statistics i

686
WWW.SYSTAT.COM ® SYSTAT 10.2 ® Statistics I

Upload: jannethblue

Post on 11-Feb-2016

194 views

Category:

Documents


37 download

DESCRIPTION

estatística

TRANSCRIPT

Page 1: Statistics I

WWW.SYSTAT.COM

®

SYSTAT 10.2®

Statistics I

Page 2: Statistics I

For more information about SYSTAT Software Inc. products , please visit our WWW site at http://www.systat.com or contact

Marketing DepartmentSYSTAT Software Inc.501 Canal Boulevard, Suite FRichmond, CA 94804-2028Tel: (800) 797-7401, (866) 797-8288Fax: (800) 797-7406

Windows is a registered trademark of Microsoft Corporation.

The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use,duplication, or disclosure by the Government is subject to restrictions as set forth insubdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at52.227-7013. Contractor/manufacturer is SYSTAT Software Inc., 501 Canal Boulevard,Suite F, Richmond, CA 94804-2028

General notice: Other product names mentioned herein are used for identificationpurposes only and may be trademarks of their respective companies.

SYSTATTM

10.2 Statistics ICopyright © 2002 by SYSTAT Software Inc.All rights reserved.Printed in the United States of America.

No part of this publication may be reproduced, stored in a retrieval system, ortransmitted, in any form or by any means, electronic, mechanical, photocopying,recording, or otherwise, without the prior written permission of the publisher.

1 2 3 4 5 6 7 8 9 0 05 04 03 02 01 00

ISBN 81-88341-04-5

Page 3: Statistics I

iii

C o n t e n t s

1 Introduction to Statistics I-1

Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-1

Know Your Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-2Sum, Mean, and Standard Deviation . . . . . . . . . . . . . . . . . . I-3Stem-and-Leaf Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . I-3The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-4Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-5Standardizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-6

Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7

What Is a Population? . . . . . . . . . . . . . . . . . . . . . . . . . . I-7Picking a Simple Random Sample . . . . . . . . . . . . . . . . . . . I-8Specifying a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10Estimating a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . I-11Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . I-12Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . I-14

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16

2 Bootstrapping and Sampling I-17

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17

Bootstrapping in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . I-20

Bootstrap Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . I-20Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-20Usage Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . I-20

Page 4: Statistics I

iv

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-29

3 Classification and Regression Trees I-31

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-31

The Basic Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . I-32Categorical or Quantitative Predictors . . . . . . . . . . . . . . . . I-35Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-35Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . I-36Stopping Rules, Pruning, and Cross-Validation . . . . . . . . . . . I-37Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-38Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-38

Classification and Regression Trees in SYSTAT . . . . . . . . . . . . . I-41

Trees Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . I-41Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-43Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-44

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-44

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51

4 Cluster Analysis I-53

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-54

Types of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . I-54

Page 5: Statistics I

v

Correlations and Distances. . . . . . . . . . . . . . . . . . . . . . . I-55Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . I-56Partitioning via K-Means . . . . . . . . . . . . . . . . . . . . . . . . I-60Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-62

Cluster Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . I-64

Hierarchical Clustering Main Dialog Box . . . . . . . . . . . . . . . I-64K-Means Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . I-67Additive Trees Main Dialog Box . . . . . . . . . . . . . . . . . . . . I-68Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-69Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . I-70

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-71

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84

5 Conjoint Analysis I-87

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-87

Additive Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-88Multiplicative Tables . . . . . . . . . . . . . . . . . . . . . . . . . . I-89Computing Table Margins Based on an Additive Model . . . . . . I-91Applied Conjoint Analysis. . . . . . . . . . . . . . . . . . . . . . . . I-92

Conjoint Analysis in SYSTAT. . . . . . . . . . . . . . . . . . . . . . . . . I-93

Conjoint Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . I-93Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-95Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . I-95

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-96

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-112

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-112Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-113

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-113

Page 6: Statistics I

vi

6 Correlations, Similarities, and Distance Measures I-115

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-116

The Scatterplot Matrix (SPLOM) . . . . . . . . . . . . . . . . . . .I-117The Pearson Correlation Coefficient . . . . . . . . . . . . . . . . .I-117Other Measures of Association . . . . . . . . . . . . . . . . . . . .I-119Transposed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-122Hadi Robust Outlier Detection. . . . . . . . . . . . . . . . . . . . .I-123

Correlations in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . .I-124

Correlations Main Dialog Box . . . . . . . . . . . . . . . . . . . . .I-124Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-128Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-129

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-129

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-145

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-145Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-146

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-146

7 Correspondence Analysis I-147

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-147

The Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . .I-147The Multiple Model . . . . . . . . . . . . . . . . . . . . . . . . . . .I-148

Correspondence Analysis in SYSTAT . . . . . . . . . . . . . . . . . . .I-149

Correspondence Analysis Main Dialog Box. . . . . . . . . . . . .I-149Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-150Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-150

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-151

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156

Page 7: Statistics I

vii

8 Crosstabulation I-157

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-158

Making Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-158Significance Tests and Measures of Association . . . . . . . . . I-160

Crosstabulations in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . I-166

One-Way Frequency Tables Main Dialog Box . . . . . . . . . . . I-166Two-Way Frequency Tables Main Dialog Box . . . . . . . . . . . I-167Multiway Frequency Tables Main Dialog Box . . . . . . . . . . . I-170Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-171Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-172

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-173

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203

9 Descriptive Statistics I-205

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-206

Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-206Spread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-207The Normal Distribution. . . . . . . . . . . . . . . . . . . . . . . . I-207Non-Normal Shape . . . . . . . . . . . . . . . . . . . . . . . . . . I-208Subpopulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-209

Descriptive Statistics in SYSTAT . . . . . . . . . . . . . . . . . . . . . I-211

Basic Statistics Main Dialog Box . . . . . . . . . . . . . . . . . . I-211Stem Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . I-213Cronbach Main Dialog Box . . . . . . . . . . . . . . . . . . . . . I-214Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-215Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-215

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225

Page 8: Statistics I

viii

10 Design of Experiments I-227

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-228

The Research Problem. . . . . . . . . . . . . . . . . . . . . . . . .I-228Types of Investigation . . . . . . . . . . . . . . . . . . . . . . . . .I-229The Importance of Having a Strategy . . . . . . . . . . . . . . . .I-230The Role of Experimental Design in Research . . . . . . . . . . .I-231Types of Experimental Designs . . . . . . . . . . . . . . . . . . . .I-231Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-232Response Surface Designs . . . . . . . . . . . . . . . . . . . . . .I-236Mixture Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-239Optimal Designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-244Choosing a Design . . . . . . . . . . . . . . . . . . . . . . . . . . .I-248

Design of Experiments in SYSTAT . . . . . . . . . . . . . . . . . . . . .I-250

Design of Experiments Wizard . . . . . . . . . . . . . . . . . . . .I-250Classic Design of Experiments . . . . . . . . . . . . . . . . . . . .I-251Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-252Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-252

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-253

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-273

11 Discriminant Analysis I-275

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-276

Linear Discriminant Model . . . . . . . . . . . . . . . . . . . . . .I-276Discriminant Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . .I-283

Discriminant Analysis Main Dialog Box . . . . . . . . . . . . . . .I-283Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-287Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . I-288

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-288

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-326

Page 9: Statistics I

ix

12 Factor Analysis I-327

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-327

A Principal Component . . . . . . . . . . . . . . . . . . . . . . . . I-328Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-331Principal Components versus Factor Analysis . . . . . . . . . . . I-334Applications and Caveats. . . . . . . . . . . . . . . . . . . . . . . I-334

Factor Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . I-335

Factor Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . I-335Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-339Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-339

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-341

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-363

13 Linear Models I-365

Simple Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . I-365

Equation for a Line . . . . . . . . . . . . . . . . . . . . . . . . . . . I-366Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-369Estimation and Inference . . . . . . . . . . . . . . . . . . . . . . . I-369Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-371Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . I-371Multiple Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . I-372Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . I-373

Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-376

Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . I-379Using an SSCP, a Covariance, or a Correlation Matrix as Input . . . . . . . . . . . . . . . . . . . . . . . I-381

Page 10: Statistics I

x

Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-382

Effects Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-383Means Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-384Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-385Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-386Multigroup ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . .I-386Factorial ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-387Data Screening and Assumptions . . . . . . . . . . . . . . . . . .I-388Levene Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-388Pairwise Mean Comparisons . . . . . . . . . . . . . . . . . . . . .I-389Linear and Quadratic Contrasts . . . . . . . . . . . . . . . . . . . .I-390

Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-393

Assumptions in Repeated Measures. . . . . . . . . . . . . . . . .I-394Issues in Repeated Measures Analysis . . . . . . . . . . . . . . .I-395

Types of Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . .I-396

SYSTAT’s Sums of Squares . . . . . . . . . . . . . . . . . . . . . .I-397

14 Linear Models I: Linear Regression I-399

Linear Regression in SYSTAT. . . . . . . . . . . . . . . . . . . . . . . .I-400

Regression Main Dialog Box . . . . . . . . . . . . . . . . . . . . .I-400Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-403Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-403

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-404

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-430

Page 11: Statistics I

xi

15 Linear Models II: Analysis of Variance I-431

Analysis of Variance in SYSTAT. . . . . . . . . . . . . . . . . . . . . . I-432

ANOVA: Estimate Model . . . . . . . . . . . . . . . . . . . . . . . I-432ANOVA: Hypothesis Test . . . . . . . . . . . . . . . . . . . . . . . I-434Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . I-436Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-438Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-438

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-439

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485

16 Linear Models III:

Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-501Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . I-501

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-503

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546

Post hoc Tests for Repeated Measures . . . . . . . . . . . . . . I-495

General Linear Models I-487

Model Estimation (in GLM) . . . . . . . . . . . . . . . . . . . . . . I-488Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . . . . . I-493Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . I-495

General Linear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . I-488

Page 12: Statistics I

xii

17 Logistic Regression I-549

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-549

Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-550Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-552Conditional Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-552Discrete Choice Logit . . . . . . . . . . . . . . . . . . . . . . . . .I-554Stepwise Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-556

Logistic Regression in SYSTAT . . . . . . . . . . . . . . . . . . . . . . .I-557

Estimate Model Main Dialog Box . . . . . . . . . . . . . . . . . . .I-557Deciles of Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-561Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-562Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-563Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-563Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-564Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-565

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-566

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-613

18 Loglinear Models I-617

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-618

Fitting a Loglinear Model . . . . . . . . . . . . . . . . . . . . . . .I-620Loglinear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . .I-621

Loglinear Model Main Dialog Box . . . . . . . . . . . . . . . . . .I-621Frequency Tables (Tabulate) . . . . . . . . . . . . . . . . . . . . .I-625Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-626Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-626

Page 13: Statistics I

xiii

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-627

Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646

Index 649

Page 14: Statistics I
Page 15: Statistics I

xv

L i s t o f E x a m p l e s

Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-82

Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . I-462

ANOVA Assumptions and Contrasts . . . . . . . . . . . . . . . . . . I-442

Automatic Stepwise Regression . . . . . . . . . . . . . . . . . . . . I-417

Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216

Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-566

Binary Logit with Interactions . . . . . . . . . . . . . . . . . . . . . . I-569

Binary Logit with Multiple Predictors . . . . . . . . . . . . . . . . . . . I-568

Box-Behnken Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-264

Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-103

Box-Hunter Fractional Factorial Design . . . . . . . . . . . . . . . . I-256

By-Choice Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . I-598

Canonical Correlation Analysis. . . . . . . . . . . . . . . . . . . . . . I-544

Canonical Correlations: Using Text Output . . . . . . . . . . . . . . . I-26

Central Composite Response Surface Design . . . . . . . . . . . . . I-269

Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-96

Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-45

Cochran’s Test of Linear Trend. . . . . . . . . . . . . . . . . . . . . . . I-194

Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . . I-588

Confidence Interval on a Median . . . . . . . . . . . . . . . . . . . . . I-25

Confidence Intervals for One-Way Table Percentages . . . . . . . . . I-199

Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-313

Correspondence Analysis (Simple) . . . . . . . . . . . . . . . . . . . . I-151

Covariance Alternatives to Repeated Measures. . . . . . . . . . . . . I-532

Crossover and Changeover Designs. . . . . . . . . . . . . . . . . . . . I-520

Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-321

Deciles of Risk and Model Diagnostics . . . . . . . . . . . . . . . . . . . I-574

Page 16: Statistics I

xvi

Discrete Choice Models. . . . . . . . . . . . . . . . . . . . . . . . . . I-591

Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . I-536

Discriminant Analysis Using Automatic Backward Stepping. . . . . I-298

Discriminant Analysis Using Automatic Forward Stepping . . . . . .I-293

Discriminant Analysis Using Complete Estimation. . . . . . . . . . . I-288

Discriminant Analysis Using Interactive Stepping . . . . . . . . . . I-306

Employment Discrimination. . . . . . . . . . . . . . . . . . . . . . . . I-107

Factor Analysis Using a Covariance Matrix. . . . . . . . . . . . . . I-353

Factor Analysis Using a Rectangular File . . . . . . . . . . . . . . . .I-356

Fisher’s Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . I-192

Fractional Factorial Design . . . . . . . . . . . . . . . . . . . . . . . I-254

Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . I-512

Frequency Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-177

Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-253

Hadi Robust Outlier Detection . . . . . . . . . . . . . . . . . . . . . . I-140

Hierarchical Clustering: Clustering Cases . . . . . . . . . . . . . . . . . I-75

Hierarchical Clustering: Clustering Variables and Cases . . . . . . . I-79

Hierarchical Clustering: Clustering Variables . . . . . . . . . . . . . . I-78

Hierarchical Clustering: Distance Matrix Input . . . . . . . . . . . . . . I-81

Hotelling’s T-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . I-535

Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-604

Incomplete Block Designs . . . . . . . . . . . . . . . . . . . . . . . . I-510

Interactive Stepwise Regression . . . . . . . . . . . . . . . . . . . . I-420

Iterated Principal Axis . . . . . . . . . . . . . . . . . . . . . . . . . . I-348

K-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-71

Latin Square Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . I-518

Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-258

Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21

Loglinear Modeling of a Four-Way Table . . . . . . . . . . . . . . . . I-627

Mantel-Haenszel Test . . . . . . . . . . . . . . . . . . . . . . . . . . . I-200

Maximum Likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . I-344

Page 17: Statistics I

xvii

McNemar’s Test of Symmetry . . . . . . . . . . . . . . . . . . . . . . I-197

Missing Category Codes . . . . . . . . . . . . . . . . . . . . . . . . . I-178

Missing Cells Designs (the Means Model) . . . . . . . . . . . . . . . I-523

Missing Data: EM Estimation . . . . . . . . . . . . . . . . . . . . . . . I-135

Missing Data: Pairwise Deletion . . . . . . . . . . . . . . . . . . . . . I-134

Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-459

Mixture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-265

Mixture Design with Constraints . . . . . . . . . . . . . . . . . . . . . . . I-266

Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-544

Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-582

Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . I-153

Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . I-413

Multivariate Analysis of Variance. . . . . . . . . . . . . . . . . . . . . I-480

Multiway Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-181

Nested Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-513

Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-189

One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-439

One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-503

One-Way Repeated Measures . . . . . . . . . . . . . . . . . . . . . . I-464

One-Way Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-173

Optimal Designs: Coordinate Exchange . . . . . . . . . . . . . . . . . I-270

Partial Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-545

Pearson Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-129

Percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-179

Plackett-Burman Design . . . . . . . . . . . . . . . . . . . . . . . . . . I-263

Principal Components Analysis (Within Groups). . . . . . . . . . . . . I-540

Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-341

Probabilities Associated with Correlations . . . . . . . . . . . . . . . . I-137

Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-315

Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-579

Quasi-Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . I-607

Page 18: Statistics I

xviii

Randomized Block Designs . . . . . . . . . . . . . . . . . . . . . . . . I-510

Regression Tree with Box Plots . . . . . . . . . . . . . . . . . . . . . . I-47

Regression Tree with Dit Plots . . . . . . . . . . . . . . . . . . . . . . . I-49

Regression with Ecological or Grouped Data. . . . . . . . . . . . . . I-428

Regression without the Constant. . . . . . . . . . . . . . . . . . . . I-429

Repeated Measures Analysis of Covariance . . . . . . . . . . . . . . I-478

One Within Factor with Ordered Levels . . . . . . . . . . . . . . . . I-470

One Within Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-472

Repeated Measures ANOVA for Two Trial Factors . . . . . . . . . . I-475

Residuals and Diagnostics for Simple Linear Regression. . . . . . I-410

Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-350

S2 and S3 Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . I-143

Grouping Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-218

One Grouping Variable . . . . . . . . . . . . . . . . . . . . . . . . . . I-217

Screening Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-638

Separate Variance Hypothesis Tests . . . . . . . . . . . . . . . . . I-461

Single-Degree-of-Freedom Designs . . . . . . . . . . . . . . . . . . I-457

Spearman Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . I-143

Spearman Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . I-24

Split Plot Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-515

Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-221

Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-600

Structural Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-641

Tables with Ordered Categories. . . . . . . . . . . . . . . . . . . . . I-196Tables without Analyses . . . . . . . . . . . . . . . . . . . . . . . . I-645

Taguchi Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-260Testing Nonzero Null Hypotheses. . . . . . . . . . . . . . . . . . . . I-427

Testing whether a Single Coefficient Equals Zero . . . . . . . . . . I-424

Testing whether Multiple Coefficients Equal Zero . . . . . . . . . . I-426

Tetrachoric Correlation. . . . . . . . . . . . . . . . . . . . . . . . . I-145

Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-132

Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-407

Page 19: Statistics I

xix

Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-447Two-Way Table Statistics (Long Results) . . . . . . . . . . . . . . . I-188

Two-Way Table Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . I-186Two-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-175

Weighting Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-532

Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-100

Page 20: Statistics I
Page 21: Statistics I

I-1

Chapte r

1Introduction to Statistics

Leland Wilkinson

Statistics and state have the same root. Statistics are the numbers of the state. More generally, they are any numbers or symbols that formally summarize our observations of the world. As we all know, summaries can mislead or elucidate. Statistics also refers to the introductory course we all seem to hate in college. When taught well, however, it is this course that teaches us how to use numbers to elucidate rather than to mislead.

Statisticians specialize in many areas—probability, exploratory data analysis, modeling, social policy, decision making, and others. While they may philosophically disagree, statisticians nevertheless recognize at least two fundamental tasks: description and inference. Description involves characterizing a batch of data in simple but informative ways. Inference involves generalizing from a sample of data to a larger population of possible data. Descriptive statistics help us to observe more acutely, and inferential statistics help us to formulate and test hypotheses.

Any distinctions, such as this one between descriptive and inferential statistics, are potentially misleading. Let’s look at some examples, however, to see some differences between these approaches.

Descriptive Statistics

Descriptive statistics may be single numerical summaries of a batch, such as an average. Or, they may be more complex tables and graphs. What distinguishes descriptive statistics is their reference to a given batch of data rather than to a more general population or class. While there are exceptions, we usually examine descriptive statistics to understand the structure of a batch. A closely related field is

Page 22: Statistics I

I-2

Chapter 1

called exploratory data analysis. Both exploratory and descriptive methods may lead us to formulate laws or test hypotheses, but their focus is on the data at hand.

Consider, for example, the following batch. These are numbers of arrests by sex in 1985 for selected crimes in the United States. The source is the FBI Uniform Crime Reports. What can we say about differences between the patterns of arrests of men and women in the United States in 1985?

Know Your Batch

First, we must be careful in characterizing the batch. These statistics do not cover the gamut of U.S. crimes. We left out curfew and loitering violations, for example. Not all reported crimes are included in these statistics. Some false arrests may be included.

CRIME MALES FEMALES

murder 12904 1815rape 28865 303robbery 105401 8639assault 211228 32926burglary 326959 26753larceny 744423 334053auto 97835 10093arson 13129 2003battery 416735 75937forgery 46286 23181fraud 151773 111825embezzle 5624 3184vandal 181600 20192weapons 134210 10970vice 29584 67592sex 74602 6108drugs 562754 90038gambling 21995 3879family 35553 5086dui 1208416 157131drunk 726214 70573disorderly 435198 99252vagrancy 24592 3001runaway 53808 72473

Page 23: Statistics I

I-3

Introduction to Statist ics

State laws vary on the definitions of some of these crimes. Agencies may modify arrest statistics for political purposes. Know where your batch came from before you use it.

Sum, Mean, and Standard Deviation

Were there more male than female arrests for these crimes in 1985? The following output shows us the answer. Males were arrested for 5,649,688 crimes (not 5,649,688 males—some may have been arrested more than once). Females were arrested 1,237,007 times.

How about the average (mean) number of arrests for a crime? For males, this was 235,403 and for females, 51,542. Does the mean make any sense to you as a summary statistic? Another statistic in the table, the standard deviation, measures how much these numbers vary around the average. The standard deviation is the square root of the average squared deviation of the observations from their mean. It, too, has problems in this instance. First of all, both the mean and standard deviation should represent what you could observe in your batch, on average: the mean number of fish in a pond, the mean number of children in a classroom, the mean number of red blood cells per cubic millimeter. Here, we would have to say, “the mean murder-rape-robbery-…-runaway type of crime.” Second, even if the mean made sense descriptively, we might question its use as a typical crime-arrest statistic. To see why, we need to examine the shape of these numbers.

Stem-and-Leaf Plots

Let’s look at a display that compresses these data a little less drastically. The stem-and-leaf plot is like a tally. We pick a most significant digit or digits and tally the next digit to the right. By using trailing digits instead of tally marks, we preserve extra digits in the data. Notice the shape of the tally. There are mostly smaller numbers of arrests and a few crimes (such as larceny and driving under the influence of alcohol) with larger

MALES FEMALESN of cases 24 24Minimum 5624.000 303.000Maximum 1208416.000 334053.000Sum 5649688.000 1237007.000Mean 235403.667 51541.958Standard Dev 305947.056 74220.864

Page 24: Statistics I

I-4

Chapter 1

numbers of arrests. Another way of saying this is that the data are positively skewed toward larger numbers for both males and females.

The Median

When data are skewed like this, the mean gets pulled from the center of the majority of numbers toward the extreme with the few. A statistic that is not as sensitive to extreme values is the median. The median is the value above which half the data fall. More precisely, if you sort the data, the median is the middle value or the average of the two middle values. Notice that for males the median is 101,618, and for females, 21,686. Both are considerably smaller than the means and more typical of the majority of the numbers. This is why the median is often used for representing skewed data, such as incomes, populations, or reaction times.

We still have the same representativeness problem that we had with the mean, however. Even if the medians corresponded to real data values in this batch (which they don’t because there is an even number of observations), it would be hard to characterize what they would represent.

Stem and Leaf Plot of variable: MALES, N = 24 Minimum: 5624.000 Lower hinge: 29224.500 Median: 101618.000 Upper hinge: 371847.000 Maximum: 1208416.000 0 H 011222234579 1 M 0358 2 1 3 H 2 4 13 5 6 6 7 24 * * * Outside Values * * * 12 0 Stem and Leaf Plot of variable: FEMALES, N = 24 Minimum: 303.000 Lower hinge: 4482.500 Median: 21686.500 Upper hinge: 74205.000 Maximum: 334053.000 0 H 00000000011 0 M 2223 0 0 H 6777 0 99 1 1 1 1 5 * * * Outside Values * * * 3 3

Page 25: Statistics I

I-5

Introduction to Statist ics

Sorting

Most people think of means, standard deviations, and medians as the primary descriptive statistics. They are useful summary quantities when the observations represent values of a single variable. We purposely chose an example where they are less appropriate, however, even when they are easily computable. There are better ways to reveal the patterns in these data. Let’s look at sorting as a way of uncovering structure.

I was talking once with an FBI agent who had helped to uncover the Chicago machine’s voting fraud scandal some years ago. He was a statistician, so I was curious what statistical methods he used to prove the fraud. He replied, “We sorted the voter registration tape alphabetically by last name. Then we looked for duplicate names and addresses.” Sorting is one of the most basic and powerful data analysis techniques. The stem-and-leaf plot, for example, is a sorted display.

We can sort on any numerical or character variable. It depends on our goal. We began this chapter with a question: Are there differences between the patterns of arrests of men and women in the United States in 1985? How about sorting the male and female arrests separately? If we do this, we will get a list of crimes in order of decreasing frequency within sex.

MALES FEMALES

dui larcenylarceny duidrunk frauddrugs disorderlydisorderly drugsbattery batteryburglary runawayassault drunkvandal vicefraud assaultweapons burglaryrobbery forgeryauto vandalsex weapons

Page 26: Statistics I

I-6

Chapter 1

You might want to connect similar crimes with lines. The number of crossings would indicate differences in ranks.

Standardizing

This ranking is influenced by prevalence. The most frequent crimes occur at the top of the list in both groups. Comparisons within crimes are obscured by this influence. Men committed almost 100 times as many rapes as women, for example, yet rape is near the bottom of both lists. If we are interested in contrasting the sexes on patterns of crime while holding prevalence constant, we must standardize the data. There are several ways to do this. You may have heard of standardized test scores for aptitude tests. These are usually produced by subtracting means and then dividing by standard deviations. Another method is simply to divide by row or column totals. For the crime data, we will divide by totals within rows (each crime). Doing so gives us the proportion of each arresting crime committed by men or women. The total of these two proportions will thus be 1.

Now, a contrast between men and women on this standardized value should reveal variations in arrest patterns within crime type. By subtracting the female proportion from the male, we will highlight primarily male crimes with positive values and female crimes with negative. Next, sort these differences and plot them in a simple graph. The following shows the result:

runaway autoforgery robberyfamily sexvice familyrape gamblingvagrancy embezzlegambling vagrancyarson arsonmurder murderembezzle rape

MALES FEMALES

Page 27: Statistics I

I-7

Introduction to Statist ics

Now we can see clear contrasts between males and females in arrest patterns. The predominantly aggressive crimes appear at the top of the list. Rape now appears where it belongs—an aggressive, rather than sexual, crime. A few crimes dominated by females are at the bottom.

Inferential Statistics

We often want to do more than describe a particular sample. In order to generalize, formulate a policy, or test a hypothesis, we need to make an inference. Making an inference implies that we think a model describes a more general population from which our data have been randomly sampled. Sometimes it is difficult to imagine a population from which you have gathered data. A population can be “all possible voters,” “all possible replications of this experiment,” or “all possible moviegoers.” When you make inferences, you should have a population in mind.

What Is a Population?

We are going to use inferential methods to estimate the mean age of the unusual population contained in the 1980 edition of Who’s Who in America. We could enter all 73,500 ages into a SYSTAT file and compute the mean age exactly. If it were practical, this would be the preferred method. Sometimes, however, a sampling estimate can be more accurate than an entire census. For example, biases are introduced into large censuses from refusals to comply, keypunch or coding errors, and other sources. In

Page 28: Statistics I

I-8

Chapter 1

these cases, a carefully constructed random sample can yield less-biased information about the population.

This an unusual population because it is contained in a book and is therefore finite. We are not about to estimate the mean age of the rich and famous. After all, Spy magazine used to have a regular feature listing all of the famous people who are not in Who’s Who. And bogus listings may escape the careful fact checking of the Who’s Who research staff. When we get our estimate, we might be tempted to generalize beyond the book, but we would be wrong to do so. For example, if a psychologist measures opinions in a random sample from a class of college sophomores, his or her conclusions should begin with the statement, “College sophomores at my university think…” If the word “people” is substituted for “college sophomores,” it is the experimenter’s responsibility to make clear that the sample is representative of the larger group on all attributes that might affect the results.

Picking a Simple Random Sample

That our population is finite should cause us no problems as long as our sample is much smaller than the population. Otherwise, we would have to use special techniques to adjust for the bias it would cause. How do we choose a simple random sample from a population? We use a method that ensures that every possible sample of a given size has an equal chance of being chosen. The following methods are not random:

� Pick the first name on every tenth page (some names have no chance of being chosen).

� Close your eyes, flip the pages of the book, and point to a name (Tversky and others have done research that shows that humans cannot behave randomly).

� Randomly pick the first letter of the last name and randomly choose from the names beginning with that letter (there are more names beginning with C, for example, than with I).

The way to pick randomly from a book, file, or any finite population is to assign a number to each name or case and then pick a sample of numbers randomly. You can use SYSTAT to generate a random number between 1 and 73,500, for example, with the expression:

1 + INT(73500∗ URN)

Page 29: Statistics I

I-9

Introduction to Statist ics

There are too many pages in Who’s Who to use this method, however. As a short cut, I randomly generated a page number and picked a name from the page using the random number generator. This method should work well provided that each page has approximately the same number of names (between 19 and 21 in this case). The sample is shown below:

AGE SEX AGE SEX

60 male 38 female74 male 44 male39 female 49 male78 male 62 male66 male 76 female63 male 51 male45 male 51 male56 male 75 male65 male 65 female51 male 41 male52 male 67 male59 male 50 male67 male 55 male48 male 45 male36 female 49 male34 female 58 male68 male 47 male50 male 55 male51 male 67 male47 male 58 male81 male 76 male56 male 70 male49 male 69 male58 male 46 male58 male 60 male

Page 30: Statistics I

I-10

Chapter 1

Specifying a Model

To make an inference about age, we need to construct a model for our population:

This model says that the age ( ) of someone we pick from the book can be described by an overall mean age ( ) plus an amount of error ( ) specific to that person and due to random factors that are too numerous and insignificant to describe systematically. Notice that we use Greek letters to denote things that we cannot observe directly and Roman letters for those that we do observe. Of the unobservables in the model, is called a parameter, and , a random variable. A parameter is a constant that helps to describe a population. Parameters indicate how a model is an instance of a family of models for similar populations. A random variable varies like the tossing of a coin.

There are two more parameters associated with the random variable ε but not appearing in the model equation. One is its mean ( ),which we have rigged to be 0, and the other is its standard deviation ( or simply ). Because is simply the sum of (a constant) and (a random variable), its standard deviation is also .

In specifying this model, we assume the following:

� The model is true for every member of the population.

� The error, plus or minus, that helps determine one population member’s age is independent of (not predictable from) the error for other members.

� The errors in predicting all of the ages come from the same random distribution with a mean of 0 and a standard deviation of .

Estimating a Model

Because we have not sampled the entire population, we cannot compute the parameter values directly from the data. We have only a small sample from a much larger population, so we can estimate the parameter values only by using some statistical method on our sample data. When our three assumptions are appropriate, the sample mean will be a good estimate of the population mean. Without going into all of the details, the sample estimate will be, on average, close to the values of the mean in the population.

a µ ε+=

aµ ε

µε

µε

σε σ aµ ε σ

σ

Page 31: Statistics I

I-11

Introduction to Statist ics

We can use various methods in SYSTAT to estimate the mean. One way is to specify our model using Linear Regression. Select AGE and add it to the Dependent list. With commands:

REGRESSIONMODEL AGE=CONSTANT

This model says that AGE is a function of a constant value ( ). The rest is error ( ). Another method is to compute the mean from the Basic Statistics routines. The result is shown below:

Our best estimate of the mean age of people in Who’s Who is 56.7 years.

Confidence Intervals

Our estimate seems reasonable, but it is not exactly correct. If we took more samples of size 50 and computed estimates, how much would we expect them to vary? First, it should be plain without any mathematics to see that the larger our sample, the closer will be our sample estimate to the true value of in the population. After all, if we could sample the entire population, the estimates would be the true values. Even so, the variation in sample estimates is a function only of the sample size and the variation of the ages in the population. It does not depend on the size of the population (number of people in the book). Specifically, the standard deviation of the sample mean is the standard deviation of the population divided by the square root of the sample size. This standard error of the mean is listed on the output above as 1.643. On average, we would expect our sample estimates of the mean age to vary by plus or minus a little more than one and a half years, assuming samples of size 50.

If we knew the shape of the sampling distribution of mean age, we would be able to complete our description of the accuracy of our estimate. There is an approximation that works quite well, however. If the sample size is reasonably large (say, greater than 25), then the mean of a simple random sample is approximately normally distributed. This is true even if the population distribution is not normal, provided the sample size is large.

AGE

N OF CASES 50MEAN 56.700STANDARD DEV 11.620STD. ERROR 1.643

µ ε

µ

Page 32: Statistics I

I-12

Chapter 1

We now have enough information from our sample to construct a normal approximation of the distribution of our sample mean. The following figure shows this approximation to be centered at the sample estimate of 56.7 years. Its standard deviation is taken from the standard error of the mean, 1.643 years.

We have drawn the graph so that the central area comprises 95% of all the area under the curve (from about 53.5 to 59.9). From this normal approximation, we have built a 95% symmetric confidence interval that gives us a specific idea of the variability of our estimate. If we did this entire procedure again—sample 50 names, compute the mean and its standard error, and construct a 95% confidence interval using the normal approximation—then we would expect that 95 intervals out of a hundred so constructed would cover the real population mean age. Remember, population mean age is not necessarily at the center of the interval that we just constructed, but we do expect the interval to be close to it.

Hypothesis Testing

From the sample mean and its standard error, we can also construct hypothesis tests on the mean. Suppose that someone believed that the average age of those listed in Who’s Who is 61 years. After all, we might have picked an unusual sample just through the luck of the draw. Let’s say, for argument, that the population mean age is 61 and the standard deviation is 11.62. How likely would it be to find a sample mean age of 56.7? If it is very unlikely, then we would reject this null hypothesis that the population mean is 61. Otherwise, we would fail to reject it.

50 55 60 65Mean Age

0.0

0.1

0.2

0.3

Den

sity

Page 33: Statistics I

I-13

Introduction to Statist ics

There are several ways to represent an alternative hypothesis against this null hypothesis. We could make a simple alternative value of 56.7 years. Usually, however, we make the alternative composite—that is, it represents a range of possibilities that do not include the value 61. Here is how it would look:

H0: (null hypothesis)

HA: (alternative hypothesis)

We would reject the null hypothesis if our sample value for the mean were outside of a set of values that a population value of 61 could plausibly generate. In this context, “plausible” means more probable than a conventionally agreed upon critical level for our test. This value is usually 0.05. A result that would be expected to occur fewer than five times in a hundred samples is considered significant and would be a basis for rejecting our null hypothesis.

Constructing this hypothesis test is mathematically equivalent to sliding the normal distribution in the above figure to center over 61. We then look at the sample value 56.7 to see if it is outside of the middle 95% of the area under the curve. If so, we reject the null hypothesis.

The following t test output shows a p value (probability) of 0.012 for this test. Because this value is lower than 0.05, we would reject the null hypothesis that the mean age is 61. This is equivalent to saying that the value of 61 does not appear in the 95% confidence interval.

µ 61=

µ 61≠

50 55 60 65Mean Age

0.0

0.1

0.2

0.3

Den

sity

50 55 60 65Mean Age

0.0

0.1

0.2

0.3

Den

sity

56.7

Page 34: Statistics I

I-14

Chapter 1

The mathematical duality between confidence intervals and hypothesis testing may lead you to wonder which is more useful. The answer is that it depends on the context. Scientific journals usually follow a hypothesis testing model because their null hypothesis value for an experiment is usually 0 and the scientist is attempting to reject the hypothesis that nothing happened in the experiment. Any rejection is usually taken to be interesting, even when the sample size is so large that even tiny differences from 0 will be detected.

Those involved in making decisions—epidemiologists, business people, engineers—are often more interested in confidence intervals. They focus on the size and credibility of an effect and care less whether it can be distinguished from 0. Some statisticians, called Bayesians, go a step further and consider statistical decisions as a form of betting. They use sample information to modify prior hypotheses. See Box and Tiao (1973) or Berger (1985) for further information on Bayesian statistics.

Checking Assumptions

Now that we have finished our analyses, we should check some of the assumptions we made in doing them. First, we should examine whether the data look normally distributed. Although sample means will tend to be normally distributed even when the population isn’t, it helps to have a normally distributed population, especially when we do not know the population standard deviation. The stem-and-leaf plot gives us a quick idea:

One-sample t test of AGE with 50 cases; Ho: Mean = 61.000 Mean = 56.700 95.00% CI = 53.398 to 60.002 SD = 11.620 t = -2.617 df = 49 Prob = 0.012

Stem and leaf plot of variable: AGE , N = 50

Minimum: 34.000Lower hinge: 49.000Median: 56.000Upper hinge: 66.000Maximum: 81.000

3 4 3 689 4 14 4 H 556778999 5 0011112 5 M 556688889 6 0023 6 H 55677789 7 04 7 5668 8 1

Page 35: Statistics I

I-15

Introduction to Statist ics

There is another plot, called a dot histogram (dit) plot which looks like a stem-and-leaf plot. We can use different symbols to denote males and females in this plot, however, to see if there are differences in these subgroups. Although there are not enough females in the sample to be sure of a difference, it is nevertheless a good idea to examine it. The dot histogram reveals four of the six females to be younger than everyone else.

A better test of normality is to plot the sorted age values against the corresponding values of a mathematical normal distribution. This is called a normal probability plot. If the data are normally distributed, then the plotted values should fall approximately on a straight line. Our data plot fairly straight. Again, different symbols are used for the males and females. The four young females appear in the bottom left corner of the plot.

Does this possible difference in ages by gender invalidate our results? No, but it suggests that we might want to examine the gender differences further to see whether or not they are significant.

Page 36: Statistics I

I-16

Chapter 1

References

Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd ed. New York: Springer Verlag.

Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, Mass.: Addison-Wesley.

Page 37: Statistics I

I-17

Chapte r

2Bootstrapping and Sampling

Leland Wilkinson and Laszlo Engelman

Bootstrapping is not a module in SYSTAT. It is a procedure available in most modules where appropriate. Bootstrapping is so important as a general statistical methodology, however, that it deserves a separate chapter. SYSTAT handles bootstrapping as a single option to the ESTIMATE command or its equivalent in each module. The computations are handled without producing a scratch file of the bootstrapped samples. This saves disk space and computer time. Bootstrap, jackknife, and other samples are simply computed “on-the-fly.”

Statistical Background

Bootstrap (Efron and Tibshirani, 1993) is the most recent and most powerful of a variety of strategies for producing estimates of parameters in samples taken from unknown probability distributions. Efron and LePage (1992) summarize the problem most succinctly. We have a set of real-valued observations independently sampled from an unknown probability distribution F. We are interested in estimating some parameter by using the information in the sample data with an estimator

. Some measure of the estimate’s accuracy is as important as the estimate itself; we want a standard error of and, even better, a confidence interval on the true value .

x1 … xn, ,

θθ t x( )=

θθ

Page 38: Statistics I

I-18

Chapter 2

Classical statistical methods provide a powerful way of handling this problem when F is known and is simple—when , for example, is the mean of the normal distribution. Focusing on the standard error of the mean, we have:

Substituting the unbiased estimate for ,

we have:

Parametric methods often work fairly well even when the distribution is contaminated or only approximately known because the central limit theorem shows that sums of independent random variables with finite variances tend to be normal in large samples even when the variables themselves are not normal. But problems arise for estimates more complicated than a mean—medians, sample correlation coefficients, or eigenvalues, especially in small or medium-sized samples and even, in some cases, in large samples.

Strategies for approaching this problem “nonparametrically” have involved using the empirical distribution to obtain information needed for the standard error estimate. One approach is Tukey’s jackknife (Tukey, 1958), which is offered in SAMPLE=JACKKNIFE. Tukey proposed computing n subsets of ( ), each consisting of all of the cases except the i th deleted case (for ). He produced standard errors as a function of the n estimates from these subsets.

Another approach has involved subsampling, usually via simple random samples. This option is offered in SAMPLE=SIMPLE. A variety of researchers in the 1950’s and 1960’s explored these methods empirically (for example, Block, 1960; see Noreen,

θ θ

se x F;{ } σ2 F( )n

--------------=

σ2 F( )

σ2

F( )xi x–( )2

i 1=

n

∑n 1–( )

----------------------------=

se x( )xi x–( )2

i 1=

n

∑n n 1–( )

----------------------------=

F

x1 … xn, ,i 1 … n, ,=

Page 39: Statistics I

I-19

Bootstrapping and Sampling

1989, for others). This method amounts to a Monte Carlo study in which the sample is treated as the population. It is also closely related to methodology for permutation tests (Fisher, 1935; Dwass, 1957; Edginton, 1980).

The bootstrap (Efron, 1979) has been the focus of most recent theoretical research. is defined as:

Then, since

we have:

The computer algorithm for getting the samples for generating is to sample from ( ) with replacement. Efron and other researchers have shown that the general procedure of generating samples and computing estimates yields “ data” on which we can make useful inferences. For example, instead of computing only and its standard error, we can do histograms, densities, order statistics (for symmetric and asymmetric confidence intervals), and other computations on our estimates. In other words, there is much to learn from the bootstrap sample distributions of the estimates themselves.

There are some concerns, however. The naive bootstrap computed this way (with SAMPLE=BOOT and STATS for computing means and standard deviations) is not especially good for long-tailed distributions. It is also not suited for time-series or stochastic data. See LePage and Billard (1992) for recent research on and solutions to some of these problems. There are also several simple improvements to the naive boostrap. One is the pivot, or bootstrap-t method, discussed in Efron and Tibshirani (1993). This is especially useful for confidence intervals on the mean of an unknown distribution. Efron (1982) discusses other applications. There are also refinements based on correction for bias in the bootstrap sample itself (DiCiccio and Efron, 1996).

In general, however, the naive bootstrap can help you get better estimates of standard errors and confidence intervals than many large-sample approximations, such as Fisher’s z transformation for Pearson correlations or Wald tests for coefficients in

F

F: probability 1/n on xi for i 1 2… n, , ,=

σ2 F( ) x x–( )2=

se x F,{ } x x–( )2

n------------------=

Fx1… xn, ,

θ θθ

Page 40: Statistics I

I-20

Chapter 2

nonlinear models. And in cases in which no good approximations are available (see some of the examples below), the bootstrap is the only way to go.

Bootstrapping in SYSTAT

Bootstrap Main Dialog Box

No dialog box exists for performing bootstrapping; therefore, you must use SYSTAT’s command language. To do a bootstrap analysis, simply add the sample type to the command that initiates model estimation (usually ESTIMATE).

Using Commands

The syntax is:

The arguments m and n stand for the number of samples and the sample size of each sample. The parameter n is optional and defaults to the number of cases in the file.

The BOOT option generates samples with replacement, SIMPLE generates samples without replacement, and JACK generates a jackknife set.

Usage Considerations

Types of data. Bootstrapping works on procedures with rectangular data only.

Print options. It is best to set PRINT=NONE; otherwise, you will get 16 miles of output. If you want to watch, however, set PRINT=LONG and have some fun.

Quick Graphs. Bootstrapping produces no Quick Graphs. You use the file of bootstrap estimates and produce the graphs you want. See the examples.

Saving files. If you are doing this for more than entertainment (watching output fly by), save your data into a file before you use the ESTIMATE / SAMPLE command. See the examples.

BY groups. By all means. Are you a masochist?

ESTIMATE / SAMPLE=BOOT(m,n) SIMPLE(m,n) JACK

Page 41: Statistics I

I-21

Bootstrapping and Sampling

Case frequencies. Yes, FREQ=<variable> works. This feature does not use extra memory.

Case weights. Use case weighting if it is available in a specific module.

Examples

A few examples will serve to illustrate bootstrapping. They cover only a few of the statistical modules, however. We will focus on the tools you can use to manipulate output and get the summary statistics you need for bootstrap estimates.

Example 1 Linear Models

This example involves the famous Longley (1967) regression data. These real data were collected by James Longley at the Bureau of Labor Statistics to test the limits of regression software. The predictor variables in the data set are highly collinear, and several coefficients of variation are extremely large. The input is:

Notice that we save the coefficients into the file BOOT. We request 2500 bootstrap samples of size 16 (the number of cases in the file). Then we fit the Longley data with a single regression to compare the result to our bootstrap. Finally, we use the bootstrap

USE LONGLEYGLM MODEL TOTAL=CONSTANT+DEFLATOR..TIME SAVE BOOT / COEF ESTIMATE / SAMPLE=BOOT(2500,16)

OUTPUT TEXT1USE LONGLEYMODEL TOTAL=CONSTANT+DEFLATOR..TIMEESTIMATE

USE BOOTSTATS STATS X(1..6)OUTPUT *

BEGIN DEN X(1..6) / NORM DEN X(1..6) END

Page 42: Statistics I

I-22

Chapter 2

file and compute basic statistics on the bootstrap estimated regression coefficients. The OUTPUT command is used to save this part of the output to a file. We should not use it earlier in the program unless we want to save the output for the 2500 regressions. To view the bootstrap distributions, we create histograms on the coefficients to see their distribution.

The resulting output is:

Variables in the SYSTAT Rectangular file are: DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995 Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004 DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863 GNP -0.036 0.033 -1.014 0.001 -1.070 0.313 UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003 ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001 POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826 TIME 1829.151 455.478 2.480 0.001 4.016 0.003 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 1.84172E+08 6 3.06954E+07 330.285 0.000 Residual 836424.056 9 92936.006 ---------------------------------------------------------------------------------------------------------------------------------- Durbin-Watson D Statistic 2.559 First Order Autocorrelation -0.348 Variables in the SYSTAT Rectangular file are: CONSTANT X(1..6) X(1) X(2) X(3) X(4) X(5) X(6) N of cases 2500 2500 2500 2500 2500 2499 Minimum -816.248 -0.846 -12.994 -8.864 -2.591 -5050.438 Maximum 1312.052 0.496 7.330 2.617 3142.235 12645.703 Mean 20.648 -0.049 -2.214 -1.118 1.295 1980.382 Standard Dev 128.301 0.064 0.903 0.480 62.845 980.870

Page 43: Statistics I

I-23

Bootstrapping and Sampling

Following is the plot of the results:

The bootstrapped standard errors are all larger than the normal-theory standard errors. The most dramatically different are the ones for the POPULATN coefficient (62.845 versus 0.226). It is well known that multicollinearity leads to large standard errors for regression coefficients, but the bootstrap makes this even clearer.

Normal curves have been superimposed on the histograms, showing that the coefficients are not normally distributed. We have run a relatively large number of samples (2500) to reveal these long-tailed distributions. Were these data to be analyzed formally, it would take a huge number of samples to get useful standard errors.

Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this problem. They added a uniform random extra digit to Longley’s data so that their data sets rounded to Longley’s values and found in a simulation that the variance of the simulated coefficient estimates was larger in many cases than the miscalculated solutions from the poorer designed regression programs.

-1000 0 1000 2000X(1)

0

200

400

600

800

1000

1200

Cou

nt

-1.0 -0.5 0.0 0.5X(2)

0

500

1000

1500

Cou

nt-20 -10 0 10

X(3)

0

500

1000

1500

Cou

nt-10 -5 0 5

X(4)

0

500

1000

1500

2000

Cou

nt

-1000 0 1000 2000 3000 4000X(5)

0

500

1000

1500

2000

Cou

nt

-10000 0 10000 20000X(6)

0

500

1000

1500

Cou

nt

-1000 0 1000 2000X(1)

0

200

400

600

800

1000

1200

Cou

nt

0.0

0.1

0.2

0.3

0.4

Proportion per B

ar

-1.0 -0.5 0.0 0.5X(2)

0

500

1000

1500

Cou

nt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Proportion per B

ar

-20 -10 0 10X(3)

0

500

1000

1500

Cou

nt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Proportion per B

ar

-10 -5 0 5X(4)

0

500

1000

1500

2000

Cou

nt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Proportion per B

ar

-1000 0 1000 2000 3000 4000X(5)

0

500

1000

1500

2000

Cou

nt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Proportion per B

ar

-10000 0 10000 20000X(6)

0

500

1000

1500

Cou

nt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Proportion per B

ar

Page 44: Statistics I

I-24

Chapter 2

Example 2 Spearman Rank Correlation

This example involves law school data from Efron and Tibshirani (1993). They use these data to illustrate the usefulness of the bootstrap for calculating standard errors on the Pearson correlation. There are similar calculations for a 95% confidence interval on the Spearman correlation.

The bootstrap estimates are saved into a temporary file. The file format is CORRELATION, meaning that 1000 correlation matrices will be saved, stacked on top of each other in the file. Consequently, we need BASIC to sift through and delete every odd line (the diagonal of the matrix). We also have to remember to change the file type to RECTANGULAR so that we can sort and do other things later. Another approach would have been to use the rectangular form of the correlation output:

Next, we reuse the new file and sort the correlations. Finally, we print the nearest values to the percentiles. Following is the input:

Following is the output, our asymmetric confidence interval:

SPEARMAN LSAT*GPA

CORR GRAPH NONE USE LAW RSEED=54321 SAVE TEMP SPEARMAN LSAT GPA / SAMPLE=BOOT(1000,15)BASIC USE TEMP TYPE=RECTANGULAR IF CASE<>2*INT(CASE/2) THEN DELETE SAVE BLAW RUNUSE BLAWSORT LSATIF CASE=975 THEN PRINT “95% CI Upper:”,LSATIF CASE=25 THEN PRINT “95% CI Lower:”,LSATOUTPUT TEXT2RUNOUTPUT *DENSITY LSAT

95% CI Lower: 0.476 95% CI Upper: 0.953

Page 45: Statistics I

I-25

Bootstrapping and Sampling

The histogram of the entire file shows the overall shape of the distribution. Notice its asymmetry.

Example 3 Confidence Interval on a Median

We will use the STATS module to compute a 95% confidence interval on the median (Efron, 1979). The input is:

STATS GRAPH NONE USE OURWORLD SAVE TEMP STATS LIFE_EXP / MEDIAN,SAMPLE=BOOT(1000,57)BASIC USE TEMP SAVE TEMP2 IF STATISTC$<>”Median” THEN DELETE RUNUSE TEMP2SORT LIFE_EXPIF CASE=975 THEN PRINT “95% CI Upper:”,LIFE_EXPIF CASE=25 THEN PRINT “95% CI Lower:”,LIFE_EXPOUTPUT TEXT3RUNOUTPUT *DENSITY LIFE_EXP

0.0 0.2 0.4 0.6 0.8 1.0 1.2LSAT

Cou

nt

Proportion per B

ar

0

50

100

150

200

0.0

0.1

0.2

Page 46: Statistics I

I-26

Chapter 2

Following is the output:

Following is the histogram of the bootstrap sample medians:

Keep in mind that we are using the naive bootstrap method here, trusting the unmodified distribution of the bootstrap sample to set percentiles. Looking at the bootstrap histogram, we can see that the distribution is skewed and irregular. There are improvements that can be made in these estimates. Also, we have to be careful about how we interpret a confidence interval on a median.

Example 4 Canonical Correlations: Using Text Output

Most statistics can be bootstrapped by saving into SYSTAT files, as shown in the examples. Sometimes you may want to search through bootstrap output for a single number and compute standard errors or graphs for that statistic. The following example uses SETCOR to compute the distribution of the two canonical correlations relating the

95% CI Lower: 63.000 95% CI Upper: 71.000

50 60 70 80LIFE_EXP

Cou

nt

0

100

200

300

400

500

600

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Proportion per B

ar

Page 47: Statistics I

I-27

Bootstrapping and Sampling

species to measurements in the Fisher Iris data. The same correlations are computed in the DISCRIM procedure. Following is the input:

Notice how the BASIC program searches through the output file TEMP.DAT for the words Canonical correlations at the beginning of a line. Two lines later, the actual numbers are in the output, so we use the LAG function to check when we are at that point after having located the string. Then we convert the printed values back to numbers with the VAL() function. If you are concerned with precision, use a larger format for the output. Finally, we delete unwanted rows and save the results into the file CC. From that file, we plot the two canonical correlations. For fun, we do a dot histogram (dit) plot.

SETCORUSE IRISMODEL SPECIES=SEPALLEN..PETALWIDCATEGORY SPECIESOUTPUT TEMPESTIMATE / SAMPLE=BOOT(500,150)OUTPUT *BASICGET TEMPINPUT A$,B$LET R1=.LET R2=.LET FOUND=.IF A$=’Canonical’ AND B$=’correlations’ , THEN LET FOUND=CASEIF LAG(FOUND,2)<>. THEN FOR LET R1=VAL(A$) LET R2=VAL(B$)NEXTIF R1=. AND R2=. THEN DELETESAVE CCRUNEXITUSE CCDENSITY R1 R2 / DIT

Page 48: Statistics I

I-28

Chapter 2

Following is the graph:

Notice the stripes in the plot on the left. These reveal the three-digit rounding we incurred by using the standard FORMAT=3.

Computation

Computations are done by the respective statistical modules. Sampling is done on the data.

Algorithms

Bootstrapping and other sampling is implemented via a one-pass algorithm that does not use extra storage for the data. Samples are generated using the SYSTAT uniform random number generator. It is always a good idea to reset the seed when running a problem so that you can be certain where the random number generator started if it becomes necessary to replicate your results.

Missing Data

Cases with missing data are handled by the specific module.

Page 49: Statistics I

I-29

Bootstrapping and Sampling

References

Beaton, A. E., Rubin, D. B., and Barone, J. L. (1976). The acceptability of regression solutions: Another look at computational accuracy. Journal of the American Statistical Association, 71, 158–168.

Block, J. (1960). On the number of significant findings to be expected by chance. Psychometrika, 25, 369–380.

DiCiccio, T. J. and Efron, B. (1966). Bootstrap confidence intervals. Statistical Science, 11, 189–228.

Dwass, M. (1957). Modified randomization sets for nonparametric hypotheses. Annals of Mathematical Statistics, 29, 181–187.

Edginton, E. S. (1980). Randomization tests. New York: Marcel Dekker.Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of

Statistics, 7, 1–26.Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Vol. 38 of

CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Penn.: SIAM.

Efron, B. and LePage, R. (1992). Introduction to bootstrap. In R. LePage and L. Billard (eds.), Exploring the Limits of Bootstrap. New York: John Wiley & Sons, Inc.

Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.

Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.Longley, J. W. (1967). An appraisal of least squares for the electronic computer from the

point of view of the user. Journal of the American Statistical Association, 62, 819–841.Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction.

New York: John Wiley & Sons, Inc.Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of

Mathematical Statistics, 29, 614.

Page 50: Statistics I
Page 51: Statistics I

I-31

Chapte r

3Classification and Regression Trees

Leland Wilkinson

The TREES module computes classification and regression trees. Classification trees include those models in which the dependent variable (the predicted variable) is categorical. Regression trees include those in which it is continuous. Within these types of trees, the TREES module can use categorical or continuous predictors, depending on whether a CATEGORY statement includes some or all of the predictors.

For any of the models, a variety of loss functions is available. Each loss function is expressed in terms of a goodness-of-fit statistic—the proportion of reduction in error (PRE). For regression trees, this statistic is equivalent to the multiple . Other loss functions include the Gini index, “twoing” (Breiman et al., 1984), and the phi coefficient.

TREES produces graphical trees called mobiles (Wilkinson, 1995). At the end of each branch is a density display (box plot, dot plot, histogram, etc.) showing the distribution of observations at that point. The branches balance (like a Calder mobile) at each node so that the branch is level, given the number of observations at each end. The physical analogy is most obvious for dot plots, in which the stacks of dots (one for each observation) balance like marbles in bins.

TREES can also produce a SYSTAT BASIC program to code new observations and predict the dependent variable. This program can be saved to a file and run from the command window or submitted as a program file.

Statistical Background

Trees are directed graphs beginning with one node and branching to many. They are fundamental to computer science (data structures), biology (classification), psychology (decision theory), and many other fields. Classification and regression

R2

Page 52: Statistics I

I-32

Chapter 3

trees are used for prediction. In the last two decades, they have become popular as alternatives to regression, discriminant analysis, and other procedures based on algebraic models. Tree-fitting methods have become so popular that several commercial programs now compete for the attention of market researchers and others looking for software.

Different commercial programs produce different results with the same data, however. Worse, some programs provide no documentation or supporting materials to explain their algorithms. The result is a marketplace of competing claims, jargon, and misrepresentation. Reviews of these packages (for example, Levine, 1991; Simon, 1991) use words like “sorcerer,” “magic formula,” and “wizardry” to describe the algorithms and express frustration at vendors’ scant documentation. Some vendors, in turn, have represented tree programs as state-of-the-art “artificial intelligence” procedures capable of discovering hidden relationships and structures in databases.

Despite the marketing hyperbole, most of the now-popular tree-fitting algorithms have been around for decades. The modern commercial packages are mainly microcomputer ports (with attractive interfaces) of the mainframe programs that originally implemented these algorithms. Warnings of abuse of these techniques are not new either (for example, Einhorn, 1972; Bishop, Fienberg, and Holland, 1975). Originally proposed as automatic procedures for detecting interactions among variables, tree-fitting methods are actually closely related to classical cluster analysis (Hartigan, 1975).

This introduction will attempt to sort out some of the differences between algorithms and illustrate their use on real data. In addition, tree analyses will be compared to discriminant analysis and regression.

The Basic Tree Model

The figure below shows a tree for predicting decisions by a medical school admissions committee (Milstein et al., 1975). It was based on data for a sample of 727 applicants. We selected a tree procedure for this analysis because it was easy to present the results to the Yale Medical School admissions committee and because the tree model could serve as a basis for structuring their discussions about admissions policy.

Notice that the values of the predicted variable (the committee’s decision to reject or interview) are at the bottom of the tree and the predictors (Medical College Admissions Test and college grade point average) come into the system at each node of the tree.

Page 53: Statistics I

I-33

Classificat ion and Regression Trees

The top node contains the entire sample. Each remaining node contains a subset of the sample in the node directly above it. Furthermore, each node contains the sum of the samples in the nodes connected to and directly below it. The tree thus splits samples.

Each node can be thought of as a cluster of objects, or cases, that is to be split by further branches in the tree. The numbers in parentheses below the terminal nodes show how many cases are incorrectly classified by the tree. A similar tree data structure is used for representing the results of single and complete linkage and other forms of hierarchical cluster analysis (Hartigan, 1975). Tree prediction models add two ingredients: the predictor and predicted variables labeling the nodes and branches.

The tree is binary because each node is split into only two subsamples. Classification or regression trees do not have to be binary, but most are. Despite the marketing claims of some vendors, nonbinary, or multibranch, trees are not superior to binary trees. Each is a permutation of the other, as shown in the figure below.

The tree on the left (ternary) is not more parsimonious than that on the right (binary). Both trees have the same number of parameters, or split points, and any statistics associated with the tree on the left can be converted trivially to fit the one on the right. A computer program for scoring either tree (IF ... THEN ... ELSE) would look identical. For display purposes, it is often convenient to collapse binary trees into multibranch trees, but this is not necessary.

GRADE POINT AVERAGEn=727

<3.47 >3.47

MCAT VERBAL

REJECTMCAT QUANTITATIVE

REJECT

REJECT

INTERVIEW

INTERVIEW

MCAT VERBAL

<555 >555 <535 >535

<655 >655

93 249

342 385

35451

122 127(9)

(45) (46)

(19) (49)

Page 54: Statistics I

I-34

Chapter 3

Some programs that do multibranch splits do not allow further splitting on a predictor once it has been used. This has an appealing simplicity. However, it can lead to unparsimonious trees. It is unnecessary to make this restriction before fitting a tree.

The figure below shows an example of this problem. The upper right tree classifies objects on an attribute by splitting once on shape, once on fill, and again on shape. This allows the algorithm to separate the objects into only four terminal nodes having common values. The upper left tree splits on shape and then only on fill. By not allowing any other splits on shape, the tree requires five terminal nodes to classify correctly. This problem cannot be solved by splitting first on fill, as the lower left tree shows. In general, restricting splits to only one branch for each predictor results in more terminal nodes.

123

1 2 3 23

32

1

123

11 2 2 3 4

11 2 3 2 4

2 3 2 4

11 2 2 3 4

11 2 32 4

2 32 4

3 4

11 2 2 3 4

11

3 4

23 24

1 1 2 2

shape

fill

shape

shape

fill

shape

fill

Page 55: Statistics I

I-35

Classificat ion and Regression Trees

Categorical or Quantitative Predictors

The predictor variables in the figure on p. 33 are quantitative, so splits are created by determining cut points on a scale. If predictor variables are categorical, as in the figure above, splits are made between categorical values. It is not necessary to categorize predictors before computing trees. This is as dubious a practice as recoding data well-suited for regression into categories in order to use chi-square tests. Those who recommend this practice are turning silk purses into sows’ ears. In fact, if variables are categorized before doing tree computations, then poorer fits are likely to result. Algorithms are available for mixed quantitative and categorical predictors, analogous to analysis of covariance.

Regression Trees

Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a quantitative variable. They called the method Automatic Interaction Detection (AID). The algorithm performs stepwise splitting. It begins with a single cluster of cases and searches a candidate set of predictor variables for a way to split the cluster into two clusters. Each predictor is tested for splitting as follows: sort all the n cases on the predictor and examine all ways to split the cluster in two. For each possible split, compute the within-cluster sum of squares about the mean of the cluster on the dependent variable. Choose the best of the splits to represent the predictor’s contribution. Now do this for every other predictor. For the actual split, choose the predictor and its cut point that yields the smallest overall within-cluster sum of squares.

Categorical predictors require a different approach. Since categories are unordered, all possible splits between categories must be considered. For deciding on one split of k categories into two groups, this means that possible splits must be considered. Once a split is found, its suitability is measured on the same within-cluster sum of squares as for a quantitative predictor.

Morgan and Sonquist called their algorithm AID because it naturally incorporates interaction among predictors. Interaction is not correlation. It has to do, instead, with conditional discrepancies. In the analysis of variance, interaction means that a trend within one level of a variable is not parallel to a trend within another level of the same variable. In the ANOVA model, interaction is represented by cross-products between predictors. In the tree model, it is represented by branches from the same node that have different splitting predictors further down the tree.

n 1–

n 1–

2k 1–

Page 56: Statistics I

I-36

Chapter 3

The figure below shows a tree without interactions on the left and with interactions on the right. Because interaction trees are a natural by-product of the AID splitting algorithm, Morgan and Sonquist called the procedure “automatic.” In fact, AID trees without interactions are quite rare for real data, so the procedure is indeed automatic. To search for interactions using stepwise regression or ANOVA linear modeling, we would have to generate interactions among p predictors and compute partial correlations for every one of them in order to decide which ones to include in our formal model.

.

Classification Trees

Regression trees parallel regression/ANOVA modeling, in which the dependent variable is quantitative. Classification trees parallel discriminant analysis and algebraic classification methods. Kass (1980) proposed a modification to AID called CHAID for categorized dependent and independent variables. His algorithm incorporated a sequential merge-and-split procedure based on a chi-square test statistic. Kass was concerned about computation time (although this has since proved an unnecessary worry), so he decided to settle for a suboptimal split on each predictor instead of searching for all possible combinations of the categories. Kass’s algorithm is like sequential crosstabulation. For each predictor:

� Crosstabulate the m categories of the predictor with the k categories of the dependent variable.

� Find the pair of categories of the predictor whose subtable is least significantly different on a chi-square test and merge these two categories.

� If the chi-square test statistic is not “significant” according to a preset critical value, repeat this merging process for the selected predictor until no nonsignificant chi-square is found for a subtable.

� Choose the predictor variable whose chi-square is the largest and split the sample into subsets, where l is the number of categories resulting from the merging process on that predictor.

� Continue splitting, as with AID, until no significant chi-squares result.

2p

A

B

CCCC

B

A

B

GFED

C

2 k×

m l≤

Page 57: Statistics I

I-37

Classificat ion and Regression Trees

The CHAID algorithm saves computer time, but it is not guaranteed to find the splits that predict best at a given step. Only by searching all possible category subsets can we do that. CHAID is also limited to categorical predictors, so it cannot be used for quantitative or mixed categorical-quantitative models, as in the figure on p. 33. Nevertheless, it is an effective way to search heuristically through rather large tables quickly.

Note: Within the computer science community, there is a categorical splitting literature that often does not cite the statistical work and is, in turn, not frequently cited by statisticians (although this has changed in recent years). Quinlan (1986, 1992), the best known of these researchers, developed a set of algorithms based on information theory. These methods, called ID3, iteratively build decision trees based on training samples of attributes.

Stopping Rules, Pruning, and Cross-Validation

AID, CHAID, and other forward-sequential tree-fitting methods share a problem with other tree-clustering methods—where do we stop? If we keep splitting, a tree will end up with only one case, or object, at each terminal node. We need a method for producing a smaller tree other than the exhaustive one. One way is to use stepwise statistical tests, as in the F-to-enter or alpha-to-enter rule for forward stepwise regression. We compute a test statistic (chi-square, F, etc.), choose a critical level for the test (sometimes modifying it with the Bonferroni inequality), and stop splitting any branch that fails to meet the test (see Wilkinson, 1979, for a review of this procedure in forward selection regression).

Breiman et al. (1984) showed that this method tends to yield trees with too many branches and can also fail to pursue branches that can add significantly to the overall fit. They advocate, instead, pruning the tree. After computing an exhaustive tree, their program eliminates nodes that do not contribute to the overall prediction. They add another essential ingredient, however—the cost of complexity. This measure is similar to other cost statistics, such as Mallows’ (Neter, Wasserman, and Kutner, 1985), which add a penalty for increasing the number of parameters in a model. Breiman’s method is not like backward elimination stepwise regression. It resembles forward stepwise regression with a cutting back on the final number of steps using a different criterion than the F-to-enter. This method still cannot do as well as an exhaustive search, which would be prohibitive for most practical problems.

Regardless of how a tree is pruned, it is important to cross-validate it. As with stepwise regression, the prediction error for a tree applied to a new sample can be

Cp

Page 58: Statistics I

I-38

Chapter 3

considerably higher than for the training sample on which it was constructed. Whenever possible, data should be reserved for cross-validation.

Loss Functions

Different loss functions are appropriate for different forms of data. TREES offers a variety of functions that are scaled as proportional reduction in error (PRE) statistics. This allows you to try different loss functions on a problem and compare their predictive validity.

For regression trees, the most appropriate loss functions are least squares, trimmed mean, and least absolute deviations. Least-squares loss yields the classic AID tree. At each split, cases are classified so that the within-group sum of squares about the mean of the group is as small as possible. The trimmed mean loss works the same way but first trims 20% of outlying cases (10% at each extreme) in a splittable subset before computing the mean and sum of squares. It can be useful when you expect outliers in subgroups and don’t want them to influence the split decisions. LAD loss computes least absolute deviations about the mean rather than squares. It, too, gives less weight to extreme cases in each potential group.

For classification trees, use the phi coefficient (the default), Gini index, or “twoing.” The phi coefficient is for a table formed by the split on k categories of the dependent variable. The Gini index is a variance estimate based on all comparisons of possible pairs of values in a subgroup. Finally, twoing is a word coined by Breiman et al. to describe splitting k categories as if it were a two-category splitting problem. For more information about the effects of Gini and twoing on computations, see Breiman et al. (1984).

Geometry

Most discussions of trees versus other classifiers compare tree graphs and algebraic equations. There is another graphic view of what a tree classifier does, however. If we look at the cases embedded in the space of the predictor variables, we can ask how a linear discriminant analysis partitions the cases and how a tree classifier partitions them.

χ2 n⁄ 2 k×

Page 59: Statistics I

I-39

Classificat ion and Regression Trees

The figure below shows how cases are split by a linear discriminant analysis. There are three subgroups of cases in this example. The cutting planes are positioned approximately halfway between each pair of group centroids. Their orientation is determined by the discriminant analysis. With three predictors and four groups, there are six cutting planes, although only four planes show in the figure. The fourth group is assumed to be under the bottom plane in the figure. In general, if there are groups, the linear discriminant model cuts them with planes.

The figure below shows how a tree-fitting algorithm cuts the same data. Only the nearest subgroup (dark spots) shows; the other three groups are hidden behind the rear and bottom cutting planes. Notice that the cutting planes are parallel to the axes. While this would seem to restrict the discrimination compared to the more flexible angles allowed the discriminant planes, the tree model allows interactions between variables, which do not appear in the ordinary linear discriminant model. Notice, for example, that one plane splits on the X variable, but the second plane that splits on the Y variable cuts only the values to the left of the X partition. The tree model can continue to cut any of these subregions separately, unlike the discriminant model, which can cut only globally and with planes. This is a mixed blessing, however, since tree methods, as we have seen, can over-fit the data. It is critical to test them on new samples.

gg g 1–( ) 2⁄

XY

Z

g g 1–( ) 2⁄

Page 60: Statistics I

I-40

Chapter 3

Tree models are not usually related by authors to dimensional plots in this way, but it is helpful to see that they have a geometric interpretation. Alternatively, we can construct algebraic expressions for trees. They would require dummy variables for any categorical predictors and interaction (or product) terms for every split whose descendants (or lower nodes) did not involve the same variables on both sides.

Classification and Regression Trees in SYSTAT

Trees Main Dialog Box

To open the Trees dialog box, from the menus choose:

StatisticsClassification

Trees…

XY

Z

Page 61: Statistics I

I-41

Classificat ion and Regression Trees

Model selection and estimation are available in the main Trees dialog box:

Dependent. The variable you want to examine. The dependent variable should be continuous or categorical numeric variables (for example, INCOME).

Independent(s). Select one or more continuous or categorical variables (grouping variables).

Expand Model. Adds all possible sums and differences of the predictors to the model.

Loss. Select a loss function from the drop-down list.

� Least squares. The least squared loss (AID) minimizes the sum of the squared deviation.

� Trimmed mean. The trimmed mean loss (TRIM) “trims” the extreme observations (20%) prior to computing the mean.

� Least absolute deviations. The least absolute deviations loss (LAD).

� Phi coefficient. The phi coefficient loss computes the correlation between two dichotomous variables.

� Gini index. The Gini index loss measures inequality or dispersion.

� Twoing. The twoing loss function.

Display nodes as. Select the type of density display. The following types are available:

� Box plot. Plot that uses boxes to show a distribution shape, central tendency, and variability.

� Dit plot. Dot histogram. Produces a density display that looks similar to a histogram. Unlike histograms, dot histograms represent every observation with a unique symbol, so they are especially suited for small- to moderate-size samples of continuous data.

� Dot plot. Plot that displays dots at the exact locations of data values.

� Jitter plot. Density plot that calculates the exact locations of the data values, but jitters points randomly on a short vertical axis to keep points from colliding.

� Stripe. Places vertical lines at the location of data values along a horizontal data scale and looks like supermarket bar codes.

� Text. Displays text output in the tree diagram including the mode, sample size, and impurity value.

Page 62: Statistics I

I-42

Chapter 3

Stopping Criteria

The Stopping Criteria dialog box contains the parameters for controlling stopping.

Specify the criteria for splitting to stop.

Number of splits. Maximum number of splits.

Minimum proportion. Minimum proportion reduction in error for the tree allowed at any split.

Split minimum. Minimum split value allowed at any node.

Minimum objects at end of trees. Minimum count allowed at any node.

Using Commands

After selecting a file with USE filename, continue with:

TREES MODEL yvar = xvarlist / EXPAND ESTIMATE / PMIN=d, SMIN=d, NMIN=n, NSPLIT=n, LOSS=LSQ TRIM LAD PHI GINI TWOING, DENSITY=STRIPE JITTER DOT DIT BOX

Page 63: Statistics I

I-43

Classificat ion and Regression Trees

Usage Considerations

Types of data. TREES uses rectangular data only.

Print options. The default output includes the splitting history and summary statistics. PRINT=LONG adds a BASIC program for classifying new observations. You can cut and paste this BASIC program into a text window and run it in the BASIC module to classify new data on the same variables for cross-validation and prediction.

Quick Graphs. TREES produces a Quick Graph for the fitted tree. The nodes may contain text describing split parameters or they may contain density graphs of the data being split. A dashed line indicates that the split is not significant.

Saving files. TREES does not save files. Use the BASIC program under PRINT=LONG to classify your data, compute residuals, etc., on old or new data.

BY groups. TREES analyzes data by groups. Your file need not be sorted on the BY variable(s).

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. FREQ = <variable> increases the number of cases by the FREQ variable.

Case weights. WEIGHT is not available in TREES.

Examples

The following examples illustrate the features of the TREES module. The first example shows a classification tree for the Fisher-Anderson iris data set. The second example is a regression tree on an example taken from Breiman et al. (1984), and the third is a regression tree predicting the danger of a mammal being eaten by predators.

Page 64: Statistics I

I-44

Chapter 3

Example 1 Classification Tree

This example shows a classification tree analysis of the Fisher-Anderson iris data set featured in Discriminant Analysis. We use the Gini loss function and display a graphical tree, or mobile, with dot histograms, or dit plots. The input is:

Following is the output:

The PRE for the whole tree is 0.89 (similar to for a regression model), which is not bad. Before exulting, however, we should keep in mind that while Fisher chose the iris data set to demonstrate his discriminant model on real data, it is barely worthy of the effort. We can classify the data almost perfectly by looking at a scatterplot of petal length against petal width.

The unique SYSTAT display of the tree is called a mobile (Wilkinson, 1995). The dit plots are ideal for illustrating how it works. Imagine each case is a marble in a box at each node. The mobile simply balances all of the boxes. The reason for doing this is that we can easily see splits that cut only a few cases out of a group. These nodes will hang out conspicuously. It is fairly evident in the first split, for example, which cuts the population into half as many cases on the right (petal length less than 3) as on the left.

USE IRISLAB SPECIES/1=’SETOSA’,2=’VERSICOLOR’,3=’VIRGINICA’TREES MODEL SPECIES=SEPALLEN,SEPALWID,PETALLEN,PETALWID ESTIMATE/LOSS=GINI,DENSITY=DIT

Variables in the SYSTAT Rectangular file are: SPECIES SEPALLEN SEPALWID PETALLEN PETALWID Split Variable PRE Improvement 1 PETALLEN 0.500 0.500 2 PETALWID 0.890 0.390 Fitting Method: Gini Index Predicted variable: SPECIES Minimum split index value: 0.050 Minimum improvement in PRE: 0.050 Maximum number of nodes allowed: 22 Minimum count allowed in each node: 5 The final tree contains 3 terminal nodes Proportional reduction in error: 0.890 Node from Count Mode Impurity Split Var Cut Value Fit 1 0 150 2 1 50 SETOSA 0.0 3 1 100 4 3 54 VERSICOLOR 0.084 5 3 46 VIRGINICA 0.021

R2

Page 65: Statistics I

I-45

Classificat ion and Regression Trees

This display has a second important characteristic that is different from other tree displays. The mobile coordinates the polarity of the terminal nodes (red on color displays) rather than the direction of the splits. This design has three consequences: we can evaluate the distributions of the subgroups on a common scale, we can see the direction of the splits on each splitting variable, and we can look at the distributions on the terminal nodes from left to right to see how the whole sample is split on the dependent variable.

The first consequence means that every box containing data is a miniature density display of the subgroup’s values on a common scale (same limits and same direction). We don’t need to “drill down” on the data in a subgroup to see its distribution. It is immediately apparent in the tree. If you prefer box plots or other density displays, simply use

DENSITY = BOX

or another density as an ESTIMATE option. Dit plots are most suitable for classification trees, however; because they spike at the category values, they look like bar charts for categorical data. For continuous data, dit plots look like histograms. Although they are my favorite density display for this purpose, they can be time consuming to draw on large samples, so box plots are the default graphical display. If you omit DENSITY altogether, you will get a text summary inside each box.

The second consequence of ordering the splits according to the polarity of the dependent (rather than the independent) variable is that the direction of the split can be recognized immediately by looking at which side (left or right) the split is displayed on. Notice that PETALLEN < 3.000 occurs on the left side of the first split. This means that the relation between petal length and species (coded 1..3) is positive. The same is true for petal width within the second split group because the split banner occurs on the left. Banners on the right side of a split indicate a negative relationship between the dependent variable and the splitting variable within the group being split, as in the regression tree examples.

The third consequence of ordering the splits is that we can look at the terminal nodes from left to right and see the consequences of the split in order. In the present example, notice that the three species are ordered from left to right in the same order that they are coded. You can change this ordering for a categorical variable with the CATEGORY and ORDER commands. Adding labels, as we did here, makes the output more interpretable.

Page 66: Statistics I

I-46

Chapter 3

Example 2 Regression Tree with Box Plots

This example shows a simple AID model. The data set is Boston housing prices, cited in Belsley, Kuh, and Welsch (1980) and used in Breiman et al. (1984). We are predicting median home values (MEDV) from a set of demographic variables. The input is:

USE BOSTONTREES MODEL MEDV=CRIM..LSTAT ESTIMATE/PMIN=.005,DENSITY=BOX

Page 67: Statistics I

I-47

Classificat ion and Regression Trees

Following is the output:

The Quick Graph of the tree more clearly reveals the sample-size feature of the mobile display. Notice that a number of the splits, because they separate out a few cases only, are extremely unbalanced. This can be interpreted in two ways, depending on context. On the one hand, it can mean that outliers are being separated so that subsequent splits can be more powerful. On the other hand, it can mean that a split is wasted by focusing on the outliers when further splits don’t help to improve the prediction. The former case appears to apply in our example. The first split separates out a few expensive housing tracts (the median values have a positively skewed distribution for all tracts), which makes subsequent splits more effective. The box plots in the terminal nodes are narrow.

Variables in the SYSTAT Rectangular file are: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV Split Variable PRE Improvement 1 RM 0.453 0.453 2 RM 0.524 0.072 3 LSTAT 0.696 0.171 4 PTRATIO 0.706 0.010 5 LSTAT 0.723 0.017 6 DIS 0.782 0.059 7 CRIM 0.809 0.027 8 NOX 0.815 0.006 Fitting Method: Least Squares Predicted variable: MEDV Minimum split index value: 0.050 Minimum improvement in PRE: 0.005 Maximum number of nodes allowed: 22 Minimum count allowed in each node: 5 The final tree contains 9 terminal nodes Proportional reduction in error: 0.815 Node from Count Mean SD Split Var Cut Value Fit 1 0 506 22.533 9.197 RM 6.943 0.453 2 1 430 19.934 6.353 LSTAT 14.430 0.422 3 1 76 37.238 8.988 RM 7.454 0.505 4 3 46 32.113 6.497 LSTAT 11.660 0.382 5 3 30 45.097 6.156 PTRATIO 18.000 0.405 6 2 255 23.350 5.110 DIS 1.413 0.380 7 2 175 14.956 4.403 CRIM 7.023 0.337 8 5 25 46.820 3.768 9 5 5 36.480 8.841 10 4 41 33.500 4.594 11 4 5 20.740 9.080 12 6 5 45.580 9.883 13 6 250 22.905 3.866 14 7 101 17.138 3.392 NOX 0.538 0.227 15 7 74 11.978 3.857 16 14 24 20.021 3.067 17 14 77 16.239 2.975

Page 68: Statistics I

I-48

Chapter 3

Example 3 Regression Tree with Dit Plots

This example involves predicting the danger of a mammal being eaten by predators (Allison and Cicchetti, 1976). The predictors are hours of dreaming and nondreaming sleep, gestational age, body weight, and brain weight. Although the danger index has only five values, we are treating it as a quantitative variable with meaningful numerical values. The input is:

USE SLEEPTREES MODEL DANGER=BODY_WT,BRAIN_WT, SLO_SLEEP,DREAM_SLEEP,GESTATE ESTIMATE / DENSITY=DIT

Page 69: Statistics I

I-49

Classificat ion and Regression Trees

The resulting output is:

Variables in the SYSTAT Rectangular file are: SPECIES$ BODY_WT BRAIN_WT SLO_SLEEP DREAM_SLEEP TOTAL_SLEEP LIFE GESTATE PREDATION EXPOSURE DANGER 18 cases deleted due to missing data. Split Variable PRE Improvement 1 DREAM_SLEEP 0.404 0.404 2 BRAIN_WT 0.479 0.074 3 SLO_SLEEP 0.547 0.068 Fitting Method: Least Squares Predicted variable: DANGER Minimum split index value: 0.050 Minimum improvement in PRE: 0.050 Maximum number of nodes allowed: 22 Minimum count allowed in each node: 5 The final tree contains 4 terminal nodes Proportional reduction in error: 0.547 Node from Count Mean SD Split Var Cut Value Fit 1 0 44 2.659 1.380 DREAM_SLEEP 1.200 0.404 2 1 14 3.929 1.072 BRAIN_WT 58.000 0.408 3 1 30 2.067 1.081 SLO_SLEEP 12.800 0.164 4 2 6 3.167 1.169 5 2 8 4.500 0.535 6 3 23 2.304 1.105 7 3 7 1.286 0.488

Page 70: Statistics I

I-50

Chapter 3

The prediction is fairly good (PRE = 0.547). The Quick Graph of this tree illustrates another feature of mobiles. The dots in each terminal node are assigned a separate color. This way, we can follow their path up the tree each time they are merged. If the prediction is perfect, the top density plot will have colored dots perfectly separated. The extent to which the colors are mixed in the top plot is a visual indication of the badness-of-fit of the model. The fairly good separation of colors for the sleep data is quite clear on the computer screen or with color printing but less evident in a black-and-white figure.

Computation

Computations are in double precision.

Algorithms

TREES uses algorithms from Breiman et al. (1984) for its splitting computations.

Missing Data

Missing data are eliminated from the calculation of the loss function for each split separately.

References

Allison, T. and Cicchetti, D. (1976). Sleep in mammals: Ecological and constitutional correlates. Science, 194, 732–734.

Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley & Sons, Inc.

Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate analysis. Cambridge, Mass.: MIT Press.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. I. (1984). Classification and regression trees. Belmont, Calif.: Wadsworth.

Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 3, 367–378.

Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.

Page 71: Statistics I

I-51

Classificat ion and Regression Trees

Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29, 119–127.

Levine, M. (1991). Statistical analysis for the executive. Byte, 17, 183–184.Milstein, R. M., Burrow, G. N., Wilkinson, L., and Kessen, W. (1975). Prediction of

screening decisions in a medical school admission process. Journal of Medical Education, 51, 626–633.

Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58, 415–434.

Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed. Homewood, Ill.: Richard D. Irwin, Inc.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Quinlan, J. R. (1992). C4.5: Programs for machine learning. New York: Morgan

Kaufmann. Simon, B. (1991). Knowledge seeker: Statistics for decision makers. PC Magazine

(January 29), 50. Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,

86, 168–174.Wilkinson, L. (1995). Mobiles. Department of Statistics, Northwestern University,

Evanston, Ill.

Page 72: Statistics I
Page 73: Statistics I

I-53

Chapte r

4Cluster Analysis

Leland Wilkinson, Laszlo Engelman, James Corter, and Mark Coward

SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric data matrices. Cluster analysis is a multivariate procedure for detecting natural groupings in data. It resembles discriminant analysis in one respect—the researcher seeks to classify a set of objects into subgroups although neither the number nor members of the subgroups are known.

Cluster provides three procedures for clustering: Hierarchical Clustering, K-means, and Additive Trees. The Hierarchical Clustering procedure comprises hierarchical linkage methods. The K-means Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation. The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering.

Hierarchical Clustering clusters cases, variables, or both cases and variables simultaneously; K-means clusters cases only; and Additive Trees clusters a similarity or dissimilarity matrix. Eight distance metrics are available with Hierarchical Clustering and K-means, including metrics for quantitative and frequency count data. Hierarchical Clustering has six methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram. When the MATRIX option is used to cluster cases and variables, SYSTAT uses a gray-scale or color spectrum to represent the values.

Page 74: Statistics I

I-54

Chapter 4

Statistical Background

Cluster analysis is a multivariate procedure for detecting groupings in data. The objects in these groups may be:

� Cases (observations or rows of a rectangular data file). For example, if health indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are recorded for countries (cases), then developed nations may form a subgroup or cluster separate from underdeveloped countries.

� Variables (characteristics or columns of the data). For example, if causes of death (cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for each U.S. state (case), the results show that accidents are relatively independent of the illnesses.

� Cases and variables (individual entries in the data matrix). For example, certain wines are associated with good years of production. Other wines have other years that are better.

Types of Clustering

Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the same object to appear in more than one cluster. Exclusive clusters do not. All of the methods implemented in SYSTAT are exclusive.

There are three approaches to producing exclusive clusters: hierarchical, partitioned, and additive trees. Hierarchical clusters consist of clusters that completely contain other clusters that completely contain other clusters, and so on. Partitioned clusters contain no other clusters. Additive trees use a graphical representation in which distances along branches reflect similarities among the objects.

The cluster literature is diverse and contains many descriptive synonyms: hierarchical clustering (McQuitty, 1960; Johnson, 1967); single linkage clustering (Sokal and Sneath, 1963), and joining (Hartigan, 1975). Output from hierarchical methods can be represented as a tree (Hartigan, 1975) or a dendrogram (Sokal and Sneath, 1963). (The linkage of each object or group of objects is shown as a joining of branches in a tree. The “root” of the tree is the linkage of all clusters into one set, and the ends of the branches lead to each separate object.)

Page 75: Statistics I

I-55

Cluster Analysis

Correlations and Distances

To produce clusters, we must be able to compute some measure of dissimilarity between objects. Similar objects should appear in the same cluster, and dissimilar objects, in different clusters. All of the methods available in CORR for producing matrices of association can be used in cluster analysis, but each has different implications for the clusters produced. Incidentally, CLUSTER converts correlations to dissimilarities by negating them.

In general, the correlation measures (Pearson, Mu2, Spearman, Gamma, Tau) are not influenced by differences in scales between objects. For example, correlations between states using health statistics will not in general be affected by some states having larger average numbers or variation in their numbers. Use correlations when you want to measure the similarity in patterns across profiles regardless of overall magnitude.

On the other hand, the other measures such as Euclidean and City (city-block distance) are significantly affected by differences in scale. For health data, two states will be judged to be different if they have differing overall incidences even when they follow a common pattern. Generally, you should use the distance measures when variables are measured on common scales.

Standardizing Data

Before you compute a dissimilarity measure, you may need to standardize your data across the measured attributes. Standardizing puts measurements on a common scale. In general, standardizing makes overall level and variation comparable across measurements. Consider the following data:

If we are clustering the four cases (A through D), variable X4 will determine almost entirely the dissimilarity between cases, whether we use correlations or distances. If we are clustering the four variables, whichever correlation measure we use will adjust for the larger mean and standard deviation on X4. Thus, we should probably

OBJECT X1 X2 X3 X4

A 10 2 11 900B 11 3 15 895C 13 4 12 760D 14 1 13 874

Page 76: Statistics I

I-56

Chapter 4

standardize within columns if we are clustering rows and use a correlation measure if we are clustering columns.

In the example below, case A will have a disproportionate influence if we are clustering columns.

We should probably standardize within rows before clustering columns. This requires transposing the data before standardization. If we are clustering rows, on the other hand, we should use a correlation measure to adjust for the larger mean and standard deviation of case A.

These are not immutable laws. The suggestions are only to make you realize that scales can influence distance and correlation measures.

Hierarchical Clustering

To understand hierarchical clustering, it’s best to look at an example. The following data reflect various attributes of selected performance cars.

OBJECT X1 X2 X3 X4

A 410 311 613 514B 1 3 2 4C 10 11 12 10D 12 13 13 11

ACCEL BRAKE SLALOM MPG SPEED NAME$

5.0 245 61.3 17.0 153 Porsche 911T5.3 242 61.9 12.0 181 Testarossa5.8 243 62.6 19.0 154 Corvette7.0 267 57.8 14.5 145 Mercedes 5607.6 271 59.8 21.0 124 Saab 90007.9 259 61.7 19.0 130 Toyota Supra8.5 263 59.9 17.5 131 BMW 6358.7 287 64.2 35.0 115 Civic CRX9.3 258 64.1 24.5 129 Acura Legend

10.8 287 60.8 25.0 100 VW Fox GL13.0 253 62.3 27.0 95 Chevy Nova

Page 77: Statistics I

I-57

Cluster Analysis

Cluster Displays

SYSTAT displays the output of hierarchical clustering in several ways. For joining rows or columns, SYSTAT prints a tree. For matrix joining, it prints a shaded matrix.

Trees. A tree is printed with a unique ordering in which every branch is lined up such that the most similar objects are closest to each other. If a perfect seriation (one-dimensional ordering) exists in the data, the tree reproduces it. The algorithm for ordering the tree is given in Gruvaeus and Wainer (1972). This ordering may differ from that of trees printed by other clustering programs if they do not use a seriation algorithm to determine how to order branches. The advantage of using seriation is most apparent for single linkage clusterings.

If you join rows, the end branches of the tree are labeled with case numbers or labels. If you join columns, the end branches of the tree are labeled with variable names.

Direct display of a matrix. As an alternative to trees, SYSTAT can produce a shaded display of the original data matrix in which rows and columns are permuted according to an algorithm in Gruvaeus and Wainer (1972). Different characters represent the magnitude of each number in the matrix (Ling, 1973). A legend showing the range of data values that these characters represent appears with the display.

Cutpoints between these values and their associated characters are selected to heighten contrast in the display. The method for increasing contrast is derived from techniques used in computer pattern recognition, in which gray-scale histograms for visual displays are modified to heighten contrast and enhance pattern detection. To find these cutpoints, we sort the data and look for the largest gaps between adjacent values. Tukey’s gapping method (Wainer and Schacht, 1978) is used to determine how many gaps (and associated characters) should be chosen to heighten contrast for a given set of data. This procedure, time consuming for large matrices, is described in detail in Wilkinson (1978).

If you have a course to grade and are looking for a way to find rational cutpoints in the grade distribution, you might want to use this display to choose the cutpoints. Cluster the matrix of numeric grades (n students by 1 grade) and let SYSTAT choose the cutpoints. Only cutpoints asymptotically significant at the 0.05 level are chosen. If no cutpoints are chosen in the display, give everyone an A, flunk them all, or hand out numeric grades (unless you teach at Brown University or Hampshire College).

n 1×

Page 78: Statistics I

I-58

Chapter 4

Clustering Rows

First, let’s look at possible clusters of the cars in the example. Since the variables are on such different scales, we will standardize them before doing the clustering. This will give acceleration comparable influence to braking, for example. Then we select Pearson correlations as the basis for dissimilarity between cars. The result is:

If you look at the correlation matrix for the cars, you will see how these clusters hang together. Cars within the same cluster (for example, Corvette, Testarossa, Porsche) generally correlate highly.

Porsche Testa Corv Merc Saab

Porsche 1.00 Testa 0.94 1.00 Corv 0.94 0.87 1.00 Merc 0.09 0.21 –0.24 1.00 Saab –0.51 –0.52 –0.76 0.66 1.00 Toyota 0.24 0.43 0.40 –0.38 –0.68 BMW –0.32 –0.10 –0.56 0.85 0.63 Civic –0.50 –0.73 –0.39 –0.52 0.26 Acura –0.05 –0.10 0.30 –0.98 –0.77 VW –0.96 –0.93 –0.98 0.08 0.70 Chevy –0.73 –0.70 –0.49 –0.53 –0.13

Cluster Tree

0.0 0.1 0.2 0.3 0.4 0.5 0.6Distances

Porsche 911T

Testarossa

Corvette

Mercedes 560

Saab 9000

Toyota Supra

BMW 635

Civic CRX

Acura Legend

VW Fox GL

Chevy Nova

Page 79: Statistics I

I-59

Cluster Analysis

Clustering Columns

We can cluster the performance attributes of the cars more easily. Here, we do not need to standardize within cars (by rows) because all of the values are comparable between cars. Again, to give each variable comparable influence, we will use Pearson correlations as the basis for the dissimilarities. The result based on the data standardized by variable (column) is:

Clustering Rows and Columns

To cluster the rows and columns jointly, we should first standardize the variables to give each of them comparable influence on the clustering of cars. Once we have standardized the variables, we can use Euclidean distances because the scales are comparable. We used single linkage to produce the following result:

Toyota BMW Civic Acura VW

Toyota 1.00 BMW –0.25 1.00 Civic –0.30 –0.50 1.00 Acura 0.53 –0.79 0.35 1.00 VW –0.35 0.39 0.55 –0.16 1.00 Chevy –0.03 –0.06 0.32 0.54 0.53

Cluster Tree

0.0 0.2 0.4 0.6 0.8 1.0 1.2Distances

ACCEL

BRAKE

SLALOM

MPG

SPEED

Page 80: Statistics I

I-60

Chapter 4

This figure displays the standardized data matrix itself with rows and columns permuted to reveal clustering and each data value replaced by one of three symbols. Notice that the rows are ordered according to overall performance, with the fastest cars at the top.

Matrix clustering is especially useful for displaying large correlation matrices. You may want to cluster the correlation matrix this way and then use the ordering to produce a scatterplot matrix that is organized by the multivariate structure.

Partitioning via K-Means

To produce partitioned clusters, you must decide in advance how many clusters you want. K-means clustering searches for the best way to divide your objects into different sections so that they are separated as well as possible. The procedure begins by picking “seed” cases, one for each cluster, which are spread apart from the center of all of the cases as much as possible. Then it assigns all cases to the nearest seed. Next, it attempts to reassign each case to a different cluster in order to reduce the within-groups sum of squares. This continues until the within-groups sum of squares can no longer be reduced.

-2-10123

Testarossa

Porsche 911T

Corvette

Acura Legend

Toyota Supra

BMW 635

Saab 9000

Mercedes 560

VW Fox GL

Chevy Nova

Civic CRX

BRAKE

MPGACCEL

SLALOM

SPEED

Page 81: Statistics I

I-61

Cluster Analysis

K-means clustering does not search through every possible partitioning of the data, so it is possible that some other solution might have smaller within-groups sums of squares. Nevertheless, it has performed relatively well on globular data separated in several dimensions in Monte Carlo studies of cluster algorithms.

Because it focuses on reducing within-groups sums of squares, k-means clustering is like a multivariate analysis of variance in which the groups are not known in advance. The output includes analysis of variance statistics, although you should be cautious in interpreting them. Remember, the program is looking for large F ratios in the first place, so you should not be too impressed by large values.

Following is a three-group analysis of the car data. The clusters are similar to those we found by joining. K-means clustering uses Euclidean distances instead of Pearson correlations, so there are minor differences because of scaling. To keep the influences of all variables comparable, we standardized the data before running the analysis.

Summary Statistics for 3 Clusters

Variable Between SS DF Within SS DF F-Ratio Prob

ACCEL 7.825 2 2.175 8 14.389 0.002 BRAKE 5.657 2 4.343 8 5.211 0.036 SLALOM 5.427 2 4.573 8 4.747 0.044 MPG 7.148 2 2.852 8 10.027 0.007 SPEED 7.677 2 2.323 8 13.220 0.003-------------------------------------------------------------------------------Cluster Number: 1

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Mercedes 560 0.60 | ACCEL -0.45 -0.14 0.17 0.23Saab 9000 0.31 | BRAKE -0.15 0.23 0.61 0.28Toyota Supra 0.49 | SLALOM -1.95 -0.89 0.11 0.73BMW 635 0.16 | MPG -1.01 -0.47 -0.01 0.37 | SPEED -0.34 0.00 0.50 0.31-------------------------------------------------------------------------------Cluster Number: 2

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Civic CRX 0.81 | ACCEL 0.26 0.99 2.05 0.69Acura Legend 0.67 | BRAKE -0.53 0.62 1.62 1.00VW Fox GL 0.71 | SLALOM -0.37 0.72 1.43 0.74Chevy Nova 0.76 | MPG 0.53 1.05 2.15 0.65 | SPEED -1.50 -0.91 -0.14 0.53-------------------------------------------------------------------------------Cluster Number: 3

Members Statistics

Case Distance | Variable Minimum Mean Maximum St.Dev.

Porsche 911T 0.25 | ACCEL -1.29 -1.13 -0.95 0.14Testarossa 0.43 | BRAKE -1.22 -1.14 -1.03 0.08Corvette 0.31 | SLALOM -0.10 0.23 0.59 0.28 | MPG -1.40 -0.78 -0.32 0.45 | SPEED 0.82 1.21 1.94 0.52

Page 82: Statistics I

I-62

Chapter 4

Additive Trees

Sattath and Tversky (1977) developed additive trees for modeling similarity/dissimilarity data. Hierarchical clustering methods require objects in the same cluster to have identical distances to each other. Moreover, these distances must be smaller than the distances between clusters. These restrictions prove problematic for similarity data, and as a result hierarchical clustering cannot fit this data well.

In contrast, additive trees use tree branch length to represent distances between objects. Allowing the within-cluster distances to vary yields a tree diagram with varying branch lengths. Objects within a cluster can be compared by focusing on the horizontal distance along the branches connecting them. The additive tree for the car data follows:

Additive Tree

Porsche

Testa

Corv

Merc

Saab

Toyota

BMW

Civic

Acura

VW

Chevy

Page 83: Statistics I

I-63

Cluster Analysis

The distances between nodes of the graph are:

Each object is a node in the graph. In this example, the first 11 nodes represent the cars. Other graph nodes correspond to “groupings” of the objects. Here, the 12th node represents Porsche and Testa.

The distance between any two nodes is the sum of the (horizontal) lengths between them. The distance between Chevy and VW is . The distance between Chevy and Civic is . Consequently, Chevy is more similar to VW than to Civic.

Node Length Child

1 0.10 Porsche2 0.49 Testa3 0.14 Corv4 0.52 Merc5 0.19 Saab6 0.13 Toyota7 0.11 BMW8 0.71 Civic9 0.30 Acura

10 0.42 VW11 0.62 Chevy12 0.06 1,213 0.08 8,1014 0.49 12,315 0.18 13,1116 0.35 9,1517 0.04 14,618 0.13 17,1619 0.0 5,1820 0.04 4,721 0.0 20,19

0.62 0.08 0.42+ + 1.12=0.62 0.08 0.71+ + 1.41=

Page 84: Statistics I

I-64

Chapter 4

Cluster Analysis in SYSTAT

Hierarchical Clustering Main Dialog Box

Hierarchical clustering produces hierarchical clusters that are displayed in a tree. Initially, each object (case or variable) is considered a separate cluster. SYSTAT begins by joining the two “closest” objects as a cluster and continues (in a stepwise manner) joining an object with another object, an object with a cluster, or a cluster with another cluster until all objects are combined into one cluster.

To obtain a hierarchical cluster analysis, from the menus choose:

StatisticsClassification

Hierarchical Clustering…

You must select the elements of the data file to cluster (Join):

� Rows. Rows (cases) of the data matrix are clustered.

� Columns. Columns (variables) of the data matrix are clustered.

� Matrix. Rows and columns of the data matrix are clustered—they are permuted to bring similar rows and columns next to one another.

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is, define how distances between clusters are measured).

Page 85: Statistics I

I-65

Cluster Analysis

� Single. Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters. This method tends to produce long, stringy clusters. If you use a SYSTAT file that contains a similarity or dissimilarity matrix, you get clustering via Johnson’s “min” method.

� Complete. Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances. This method tends to produce compact, globular clusters. If you use a similarity or dissimilarity matrix from a SYSTAT file, you get Johnson’s “max” method.

� Centroid. Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters.

� Average. Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are.

� Median. Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are.

� Ward. Ward’s method averages all distances between pairs of objects in different clusters, with adjustments for covariances, to decide how far apart the clusters are.

For some data, the last four methods cannot produce a hierarchical tree with strictly increasing amalgamation distances. In these cases, you may see stray branches that do not connect to others. If this happens, you should consider Single or Complete linkage. For more information on these problems, see Fisher and Van Ness (1971). These reviewers concluded that these and other problems made Centroid, Average, Median, and Ward (as well as k-means) “inadmissible” clustering procedures. In practice and in Monte Carlo simulations, however, they sometimes perform better than Single and Complete linkage, which Fisher and Van Ness considered “admissible.” Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms. Consult his paper for further details.

In addition, the following options can be specified:

Distance. Specifies the distance metric used to compare clusters.

Polar. Produces a polar (circular) cluster tree.

Save cluster identifier variable. Saves cluster identifiers to a SYSTAT file. You can specify the number of clusters to identify for the saved file. If not specified, two clusters are identified.

Page 86: Statistics I

I-66

Chapter 4

Clustering Distances

Both hierarchical clustering and k-means clustering allow you to select the type of distance metric to use between objects. From the Distance drop-down list, you can select:

� Gamma. Distances are computed using 1 minus the Goodman-Kruskal gamma correlation coefficient. Use this metric with rank order or ordinal scales. Missing values are excluded from computations.

� Pearson. Distances are computed using 1 minus the Pearson product-moment correlation coefficient for each pair of objects. Use this metric for quantitative variables. Missing values are excluded from computations.

� RSquared. Distances are computed using 1 minus the square of the Pearson product-moment correlation coefficient for each pair of objects. Use this metric with quantitative variables. Missing values are excluded from computations.

� Euclidean. Clustering is computed using normalized Euclidean distance (root mean squared distances). Use this metric with quantitative variables. Missing values are excluded from computations.

� Minkowski. Clustering is computed using the pth root of the mean pth powered distances of coordinates. Use this metric for quantitative variables. Missing values are excluded from computations. Use the Power text box to specify the value of p.

� Chisquare. Distances are computed as the chi-square measure of independence of rows and columns on 2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric when the data are counts of objects or events.

� Phisquare. Distances are computed as the phi-square (chi-square/total) measure on 2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric when the data are counts of objects or events.

� Percent (available for hierarchical clustering only). Clustering uses a distance metric that is the percentage of comparisons of values resulting in disagreements in two profiles. Use this metric with categorical or nominal scales.

� MW (available for k-means clustering only). Distances are computed as the increment in within sum of squares of deviations, if the case (or variable) would belong to a cluster. The case (or variable) is moved into the cluster that minimizes the within sum of squares of deviations. Use this metric with quantitative variables. Missing values are excluded from computations.

Page 87: Statistics I

I-67

Cluster Analysis

K-Means Main Dialog Box

K-means clustering splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to within-cluster variation. It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group.

K-means starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center. It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed. K-means reassigns cases until the within-groups sum of squares can no longer be reduced.

To obtain a k-means cluster analysis, from the menus choose:

StatisticsClassification

K-means Clustering…

The following options can be specified:

Groups. Enter the number of desired clusters. If the number (Groups) of clusters is not specified, two are computed (one split of the data).

Iterations. Enter the maximum number of iterations. If not stated, this maximum is 20.

Save identifier variable. Saves cluster identifiers to a SYSTAT file.

Distance. Specifies the distance metric used to compare clusters.

Page 88: Statistics I

I-68

Chapter 4

Additive Trees Main Dialog Box

Additive trees were developed by Sattath and Tversky (1977) for modeling similarity/dissimilarity data, which are not fit well by hierarchical joining trees. Hierarchical trees imply that all within-cluster distances are smaller than all between-cluster distances and that within-cluster distances are equal. This so-called “ultrametric” condition seldom applies to real similarity data from direct judgment. Additive trees, on the other hand, represent similarities with a network model in the shape of a tree. Distances between objects are represented by the lengths of the branches connecting them in the tree.

To obtain additive trees, from the menus choose:

StatisticsClassification

Additive Tree Clustering…

The following options can be specified:

Data. Display the raw data matrix.

Transformed. Include the transformed data (distance-like measures) with the output.

Model. Display the model (tree) distances between the objects.

Residuals. Show the differences between the distance-transformed data and the model distances.

Nonumbers. Objects in the tree graph are not numbered.

Page 89: Statistics I

I-69

Cluster Analysis

Nosubtract. Use of an additive constant. Additive Trees assumes interval-scaled data, which implies complete freedom in choosing an additive constant, so it adds or subtracts to exactly satisfy the triangle inequality. Use Nosubtract to allow strict inequality and subtract no constant.

Height. Prints the distance of each node from the root.

Minvar. Combines the last few remaining clusters into the root node by searching for the root that minimizes the variances of the distances from the root to the leaves.

Using Commands

For the hierarchical tree method:

The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI, CHISQUARE, PHISQUARE, or PERCENT. For MINKOWSKI, specify the root using POWER=p.

The linkage methods include SINGLE, COMPLETE, CENTROID, AVERAGE, MEDIAN, and WARD.

For the k-means splitting method:

The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI, CHISQUARE, PHISQUARE, or MW. For MINKOWSKI, specify the root using POWER=p.

CLUSTER USE filename IDVAR var$ PRINT SAVE filename / NUMBER=n DATA JOIN varlist / POLAR DISTANCE=metric POWER=p LINKAGE=method

CLUSTER USE filename IDVAR var$ PRINT SAVE filename / NUMBER=n DATA KMEANS varlist / NUMBER=n ITER=n DISTANCE=metric POWER=p

Page 90: Statistics I

I-70

Chapter 4

For additive trees:

Usage Considerations

Types of data. Hierarchical Clustering works on either rectangular SYSTAT files or files containing a symmetric matrix, such as those produced with Correlations. K-Means works only on rectangular SYSTAT files. Additive Trees works only on symmetric (similarity or dissimilarity) matrices.

Print options. Using PRINT=LONG for Hierarchical Clustering yields an ASCII representation of the tree diagram (instead of the Quick Graph). This option is useful if you are joining more than 100 objects.

Quick Graphs. Cluster analysis includes Quick Graphs for each procedure. Hierarchical Clustering and Additive Trees have tree diagrams. For each cluster, K-Means displays a profile plot of the data and a display of the variable means and standard deviations. To omit Quick Graphs, specify GRAPH NONE.

Saving files. CLUSTER saves cluster indices as a new variable.

BY groups. CLUSTER analyzes data by groups.

Bootstrapping. Bootstrapping is available in this procedure.

Labeling output. For Hierarchical Clustering and K-Means, be sure to consider using ID Variable (on the Data menu) for labeling the output.

CLUSTER USE filename ADD varlist / DATA TRANSFORMED MODEL RESIDUALS TREE NUMBERS NOSUBTRACT HEIGHT MINVAR ROOT=n1,n2

Page 91: Statistics I

I-71

Cluster Analysis

Examples

Example 1 K-Means Clustering

The data in the file SUBWORLD are a subset of cases and variables from the OURWORLD file:

The distributions of the economic variables (GDP_CAP, EDUC, HEALTH, and MIL) are skewed with long right tails, so these variables are analyzed in log units.

This example clusters countries (cases). The input is:

Note that KMEANS must be specified last.

URBAN Percentage of the population living in citiesBIRTH_RT Births per 1000 peopleDEATH_RT Deaths per 1000 peopleB_TO_D Ratio of births to deathsBABYMORT Infant deaths during the first year per 1000 live birthsGDP_CAP Gross domestic product per capita (in U.S. dollars)LIFEEXPM Years of life expectancy for malesLIFEEXPF Years of life expectancy for femalesEDUC U.S. dollars spent per person on educationHEALTH U.S. dollars spent per person on healthMIL U.S. dollars spent per person on the militaryLITERACY Percentage of the population who can read

CLUSTERUSE subworldIDVAR = country$LET (gdp_cap, educ, mil, health) = L10(@)STANDARDIZE / SDKMEANS urban birth_rt death_rt babymort lifeexpm, lifeexpf gdp_cap b_to_d literacy educ, mil health / NUMBER=4

Page 92: Statistics I

I-72

Chapter 4

The resulting output is:

Distance metric is Euclidean distance k-means splitting cases into 4 groupsSummary statistics for all cases Variable Between SS df Within SS df F-ratio URBAN 18.6065 3 9.3935 25 16.5065 BIRTH_RT 26.2041 3 2.7959 26 81.2260 DEATH_RT 23.6626 3 5.3374 26 38.4221 BABYMORT 26.0275 3 2.9725 26 75.8869 GDP_CAP 26.9585 3 2.0415 26 114.4464 EDUC 25.3712 3 3.6288 26 60.5932 HEALTH 24.9226 3 3.0774 25 67.4881 MIL 24.7870 3 3.2130 25 64.2893 LIFEEXPM 24.7502 3 4.2498 26 50.4730 LIFEEXPF 25.9270 3 3.0730 26 73.1215 LITERACY 24.8535 3 4.1465 26 51.9470 B_TO_D 22.2918 3 6.7082 26 28.7997 ** TOTAL ** 294.3624 36 50.6376 309

-------------------------------------------------------------------------------Cluster 1 of 4 contains 12 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Austria 0.28 | URBAN -0.17 0.60 1.59 0.54 Belgium 0.09 | BIRTH_RT -1.14 -0.93 -0.83 0.10 Denmark 0.19 | DEATH_RT -0.77 0.00 0.26 0.35 France 0.14 | BABYMORT -0.85 -0.81 -0.68 0.05 Switzerland 0.26 | GDP_CAP 0.33 1.01 1.28 0.26 UK 0.14 | EDUC 0.47 0.95 1.28 0.28 Italy 0.16 | HEALTH 0.52 0.99 1.31 0.23 Sweden 0.23 | MIL 0.28 0.81 1.11 0.25 WGermany 0.31 | LIFEEXPM 0.23 0.75 0.99 0.23 Poland 0.39 | LIFEEXPF 0.43 0.79 1.07 0.18 Czechoslov 0.26 | LITERACY 0.54 0.72 0.75 0.06 Canada 0.30 | B_TO_D -1.09 -0.91 -0.46 0.18

-------------------------------------------------------------------------------Cluster 2 of 4 contains 5 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Ethiopia 0.40 | URBAN -2.01 -1.69 -1.29 0.30 Guinea 0.52 | BIRTH_RT 1.46 1.58 1.69 0.10 Somalia 0.38 | DEATH_RT 1.28 1.85 3.08 0.76 Afghanistan 0.38 | BABYMORT 1.38 1.88 2.41 0.44 Haiti 0.30 | GDP_CAP -2.00 -1.61 -1.27 0.30 | EDUC -2.41 -1.58 -1.10 0.51 | HEALTH -2.22 -1.64 -1.29 0.44 | MIL -1.76 -1.51 -1.37 0.17 | LIFEEXPM -2.78 -1.90 -1.38 0.56 | LIFEEXPF -2.47 -1.91 -1.48 0.45 | LITERACY -2.27 -1.83 -0.76 0.62 | B_TO_D -0.38 -0.02 0.25 0.26 -------------------------------------------------------------------------------

Page 93: Statistics I

I-73

Cluster Analysis

Cluster 3 of 4 contains 11 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Argentina 0.45 | URBAN -0.88 0.16 1.14 0.76 Brazil 0.32 | BIRTH_RT -0.60 0.07 0.92 0.49 Chile 0.40 | DEATH_RT -1.28 -0.70 0.00 0.42 Colombia 0.42 | BABYMORT -0.70 -0.06 0.55 0.47 Uruguay 0.61 | GDP_CAP -0.75 -0.38 0.04 0.28 Ecuador 0.36 | EDUC -0.89 -0.39 0.14 0.36 ElSalvador 0.52 | HEALTH -0.91 -0.47 0.28 0.38 Guatemala 0.65 | MIL -1.25 -0.59 0.37 0.49 Peru 0.37 | LIFEEXPM -0.63 0.06 0.77 0.49 Panama 0.51 | LIFEEXPF -0.57 0.04 0.61 0.44 Cuba 0.58 | LITERACY -0.94 0.20 0.73 0.51 | B_TO_D -0.65 0.63 1.68 0.76 -------------------------------------------------------------------------------Cluster 4 of 4 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Iraq 0.29 | URBAN -0.30 0.06 0.42 0.51 Libya 0.29 | BIRTH_RT 0.92 1.27 1.61 0.49 | DEATH_RT -0.77 -0.77 -0.77 0.0 | BABYMORT 0.44 0.47 0.51 0.05 | GDP_CAP -0.25 0.05 0.36 0.43 | EDUC -0.04 0.44 0.93 0.68 | HEALTH -0.51 -0.04 0.42 0.66 | MIL 1.34 1.40 1.46 0.08 | LIFEEXPM -0.09 -0.04 0.02 0.08 | LIFEEXPF -0.30 -0.21 -0.11 0.13 | LITERACY -0.94 -0.86 -0.77 0.12 | B_TO_D 1.61 2.01 2.42 0.57

Cluster Parallel Coordinate Plots

1

URBAN

B_TO_D

DEATH_RT

LIFEEXPM

LITERACY

EDUC

MIL

HEALTH

LIFEEXPF

BABYMORT

BIRTH_RT

GDP_CAP

Inde

x of

Cas

e

-3 -2 -1 0 1 2 3 4

2

URBAN

B_TO_D

DEATH_RT

LIFEEXPM

LITERACY

EDUC

MIL

HEALTH

LIFEEXPF

BABYMORT

BIRTH_RT

GDP_CAP

Inde

x of

Cas

e

-3 -2 -1 0 1 2 3 4

3

URBAN

B_TO_D

DEATH_RT

LIFEEXPM

LITERACY

EDUC

MIL

HEALTH

LIFEEXPF

BABYMORT

BIRTH_RT

GDP_CAP

Inde

x of

Cas

e

-3 -2 -1 0 1 2 3 4

4

URBAN

B_TO_D

DEATH_RT

LIFEEXPM

LITERACY

EDUC

MIL

HEALTH

LIFEEXPF

BABYMORT

BIRTH_RT

GDP_CAP

Inde

x of

Cas

e

-3 -2 -1 0 1 2 3 4

Page 94: Statistics I

I-74

Chapter 4

For each variable, cluster analysis compares the between-cluster mean square (Between SS/df) to the within-cluster mean square (Within SS/df) and reports the F-ratio. However, do not use these F ratios to test significance because the clusters are formed to characterize differences. Instead, use these statistics to characterize relative discrimination. For example, the log of gross domestic product (GDP_CAP) and BIRTH_RT are better discriminators between countries than URBAN or DEATH_RT. For a good graphical view of the separation of the clusters, you might rotate the data using the three variables with the highest F ratios.

Following the summary statistics, for each cluster, cluster analysis prints the distance from each case (country) in the cluster to the center of the cluster. Descriptive statistics for these countries appear on the right. For the first cluster, the standard scores for LITERACY range from 0.54 to 0.75 with an average of 0.72. B_TO_D ranges from –1.09 to –0.46. Thus, for these predominantly European countries, literacy is well above the average for the sample and the birth-to-death ratio is below average. In cluster 2, LITERACY ranges from –2.27 to –0.76 for these five countries, and B_TO_D ranges from –0.38 to 0.25. Thus, the countries in cluster 2 have a lower literacy rate

Cluster Profile Plots

1

URBANB_TO_DDEATH_RTLIFEEXPMLITERACYEDUCMILHEALTHLIFEEXPFBABYMORTBIRTH_RTGDP_CAP

2

URBANB_TO_DDEATH_RTLIFEEXPMLITERACYEDUCMILHEALTHLIFEEXPFBABYMORTBIRTH_RTGDP_CAP

3

URBANB_TO_DDEATH_RTLIFEEXPMLITERACYEDUCMILHEALTHLIFEEXPFBABYMORTBIRTH_RTGDP_CAP

4

URBANB_TO_DDEATH_RTLIFEEXPMLITERACYEDUCMILHEALTHLIFEEXPFBABYMORTBIRTH_RTGDP_CAP

Page 95: Statistics I

I-75

Cluster Analysis

and a greater potential for population growth than those in cluster 1. The fourth cluster (Iraq and Libya) has an average birth-to-death ratio of 2.01, the highest among the four clusters.

Cluster Parallel Coordinates

The variables in this Quick Graph are ordered by their F ratios. In the top left plot, there is one line for each country in cluster 1 that connects its z scores for each of the variables. Zero marks the average for the complete sample. The lines for these 12 countries all follow a similar pattern: above average values for GDP_CAP, below for BIRTH_RT, and so on. The lines in cluster 3 do not follow such a tight pattern.

Cluster Profiles

The variables in cluster profile plots are ordered by the F ratios. The vertical line under each cluster number indicates the grand mean across all data. A variable mean within each cluster is marked by a dot. The horizontal lines indicate one standard deviation above or below the mean. The countries in cluster 1 have above average means of gross domestic product, life expectancy, literacy, and urbanization, and spend considerable money on health care and the military, while the means of their birth rates, infant mortality rates, and birth-to-death ratios are low. The opposite is true for cluster 2.

Example 2 Hierarchical Clustering: Clustering Cases

This example uses the SUBWORLD data (see the k-means example for a description) to cluster cases. The input is:

CLUSTERUSE subworldIDVAR = country$LET (gdp_cap, educ, mil, health) = L10(@)STANDARDIZE / SDJOIN urban birth_rt death_rt babymort lifeexpm, lifeexpf gdp_cap b_to_d literacy educ mil health

Page 96: Statistics I

I-76

Chapter 4

The resulting output is:

Distance metric is Euclidean distanceSingle linkage method (nearest neighbor) Cluster and Cluster Were joined No. of memberscontaining containing at distance in new cluster------------ ------------ ------------ --------------WGermany Belgium 0.0869 2WGermany Denmark 0.1109 3WGermany UK 0.1127 4Sweden WGermany 0.1275 5Austria Sweden 0.1606 6Austria France 0.1936 7Austria Italy 0.1943 8Austria Canada 0.2112 9Uruguay Argentina 0.2154 2Switzerland Austria 0.2364 10Czechoslov Poland 0.2411 2Switzerland Czechoslov 0.2595 12Guatemala ElSalvador 0.3152 2Guatemala Ecuador 0.3155 3Uruguay Chile 0.3704 3Cuba Uruguay 0.3739 4Haiti Somalia 0.3974 2Switzerland Cuba 0.4030 16Guatemala Brazil 0.4172 4Peru Guatemala 0.4210 5Colombia Peru 0.4433 6Ethiopia Haiti 0.4743 3Panama Colombia 0.5160 7Switzerland Panama 0.5560 23Libya Iraq 0.5704 2Afghanistan Guinea 0.5832 2Ethiopia Afghanistan 0.5969 5Switzerland Libya 0.8602 25Switzerland Ethiopia 0.9080 30

Page 97: Statistics I

I-77

Cluster Analysis

The numerical results consist of the joining history. The countries at the top of the panel are joined first at a distance of 0.087. The last entry represents the joining of the largest two clusters to form one cluster of all 30 countries. Switzerland is in one of the clusters and Ethiopia is in the other.

The clusters are best illustrated using a tree diagram. Because the example joins rows (cases) and uses COUNTRY as an ID variable, the branches of the tree are labeled with countries. If you join columns (variables), then variable names are used. The scale for the joining distances is printed at the bottom. Notice that Iraq and Libya, which form their own cluster as they did in the k-means example, are the second-to-last cluster to link with others. They join with all the countries listed above them at a distance of 0.583. Finally, at a distance of 0.908, the five countries at the bottom of the display are added to form one large cluster.

Polar Dendrogram

Adding the POLAR option to JOIN yields a polar dendrogram.

Page 98: Statistics I

I-78

Chapter 4

Example 3 Hierarchical Clustering: Clustering Variables

This example joins columns (variables) instead of rows (cases) to see which variables cluster together. The input is:

The resulting output is:

CLUSTERUSE subworldIDVAR = country$LET (gdp_cap, educ, mil, health) = L10(@)STANDARDIZE / SDJOIN urban birth_rt death_rt babymort lifeexpm, lifeexpf gdp_cap b_to_d literacy, educ mil health / COLS

Distance metric is Euclidean distanceSingle linkage method (nearest neighbor) Cluster and Cluster Were joined No. of memberscontaining containing at distance in new cluster------------ ------------ ------------ --------------LIFEEXPF LIFEEXPM 0.1444 2HEALTH GDP_CAP 0.2390 2EDUC HEALTH 0.2858 3LIFEEXPF LITERACY 0.3789 3BABYMORT BIRTH_RT 0.3859 2EDUC LIFEEXPF 0.4438 6MIL EDUC 0.4744 7MIL URBAN 0.5414 8B_TO_D BABYMORT 0.8320 3B_TO_D DEATH_RT 0.8396 4MIL B_TO_D 1.5377 12

Page 99: Statistics I

I-79

Cluster Analysis

The scale at the bottom of the tree for the distance (1–r ) ranges from 0.0 to 1.5. The smallest distance is 0.011—thus, the correlation of LIFEEXPM with LIFEEXPF is 0.989.

Example 4 Hierarchical Clustering: Clustering Variables and Cases

To produce a shaded display of the original data matrix in which rows and columns are permuted according to an algorithm in Gruvaeus and Wainer (1972), use the MATRIX option. Different shadings or colors represent the magnitude of each number in the matrix (Ling, 1973).

If you use the MATRIX option with Euclidean distance, be sure that the variables are on comparable scales because both rows and columns of the matrix are clustered. Joining a matrix containing inches of annual rainfall and annual growth of trees in feet, for example, would split columns more by scales than by covariation. In cases like this, you should standardize your data before joining.

The input is:

CLUSTERUSE subworldIDVAR = country$LET (gdp_cap, educ, mil, health) = L10(@)STANDARDIZE / SDJOIN urban birth_rt death_rt babymort lifeexpm, lifeexpf gdp_cap b_to_d literacy educ, mil health / MATRIX

Page 100: Statistics I

I-80

Chapter 4

The resulting output is:

This clustering reveals three groups of countries and two groups of variables. The countries with more urban dwellers and literate citizens, longest life-expectancies, highest gross domestic product, and most expenditures on health care, education, and the military are on the top left of the data matrix; countries with the highest rates of death, infant mortality, birth, and population growth (see B_TO_D) are on the lower right. You can also see that, consistent with the k-means and join examples, Iraq and Libya spend much more on military, education, and health than their immediate neighbors.

Permuted Data Matrix

-3-2-101234

CanadaItaly

FranceSweden

WGermanyBelgium

DenmarkUK

AustriaSwitzerlandCzechoslov

PolandCubaChile

ArgentinaUruguayPanama

ColombiaPeru

BrazilEcuador

ElSalvadorGuatemala

IraqLibya

EthiopiaHaiti

SomaliaAfghanistan

Guinea

URBAN

LITERACY

LIFEEXPF

LIFEEXPM

GDP_CAP

HEALTH

EDUCMILDEATH_RT

BABYMORT

BIRTH_RT

B_TO_D

Page 101: Statistics I

I-81

Cluster Analysis

Example 5 Hierarchical Clustering: Distance Matrix Input

This example clusters a matrix of distances. The data, stored as a dissimilarity matrix in the CITIES data file, are airline distances in hundreds of miles between 10 global cities. The data are adapted from Hartigan (1975).

The input is:

Following is the output:

CLUSTERUSE citiesJOIN berlin bombay capetown chicago london,

montreal newyork paris sanfran seattle

Single linkage method (nearest neighbor) Cluster and Cluster Were joined No. of memberscontaining containing at distance in new cluster------------ ------------ ------------ --------------PARIS LONDON 2.0000 2NEWYORK MONTREAL 3.0000 2BERLIN PARIS 5.0000 3CHICAGO NEWYORK 7.0000 3SEATTLE SANFRAN 7.0000 2SEATTLE CHICAGO 17.0000 5BERLIN SEATTLE 33.0000 8BOMBAY BERLIN 39.0000 9BOMBAY CAPETOWN 51.0000 10

Page 102: Statistics I

I-82

Chapter 4

The tree is printed in seriation order. Imagine a trip around the globe to these cities. SYSTAT has identified the shortest path between cities. The itinerary begins at San Francisco, leads to Seattle, Chicago, New York, and so on, and ends in Capetown.

Note that the CITIES data file contains the distances between the cities; SYSTAT did not have to compute those distances. When you save the file, be sure to save it as a dissimilarity matrix.

This example is used both to illustrate direct distance input and to give you an idea of the kind of information contained in the order of the SYSTAT cluster tree. For distance data, the seriation reveals shortest paths; for typical sample data, the seriation is more likely to replicate in new samples so that you can recognize cluster structure.

Example 6 Additive Trees

This example uses the ROTHKOPF data file. The input is:

The output includes:

CLUSTERUSE rothkopfADD a .. z

Similarities linearly transformed into distances.77.0000 needed to make distances positive.104.0000 added to satisfy triangle inequality.Checking 14950 quadruples.Checking 1001 quadruples.Checking 330 quadruples.Checking 70 quadruples.Checking 1 quadruples. Stress formula 1 = 0.0609Stress formula 2 = 0.3985r(monotonic) squared = 0.8412r-squared (p.v.a.f.) = 0.7880

Node Length Child 1 23.3958 A 2 15.3958 B 3 14.8125 C 4 13.3125 D 5 24.1250 E 6 34.8370 F 7 15.9167 G 8 27.8750 H 9 25.6042 I 10 19.8333 J 11 13.6875 K 12 28.6196 L 13 21.8125 M 14 22.1875 N 15 19.0833 O 16 14.1667 P

Page 103: Statistics I

I-83

Cluster Analysis

(SYSTAT also displays the raw data, as well as the model distances.)

17 18.9583 Q 18 21.4375 R 19 28.0000 S 20 23.8750 T 21 23.0000 U 22 27.1250 V 23 21.5625 W 24 14.6042 X 25 17.1875 Y 26 18.0417 Z 27 16.9432 1, 9 28 15.3804 2, 24 29 15.7159 3, 25 30 19.5833 4, 11 31 26.0625 5, 20 32 23.8426 7, 15 33 6.1136 8, 22 34 17.1750 10, 16 35 18.8068 13, 14 36 13.7841 17, 26 37 15.6630 18, 23 38 8.8864 19, 21 39 4.5625 27, 35 40 1.7000 29, 36 41 8.7995 33, 38 42 4.1797 39, 31 43 1.1232 12, 28 44 5.0491 34, 40 45 2.4670 42, 41 46 4.5849 30, 43 47 2.6155 32, 44 48 2.7303 6, 37 49 0.0 45, 48 50 3.8645 46, 47 51 0.0 50, 49

Additive Tree

A

B

C

D

E

F

G

H

I

J

KL

MN

O

P

Q

R

S

T

U

V

W

X

Y

Z

Page 104: Statistics I

I-84

Chapter 4

Computation

Algorithms

JOIN follows the standard hierarchical amalgamation method described in Hartigan (1975). The algorithm in Gruvaeus and Wainer (1972) is used to order the tree.

KMEANS follows the algorithm described in Hartigan (1975). Modifications from Hartigan and Wong (1979) improve speed. There is an important difference between SYSTAT’s KMEANS algorithm and that of Hartigan (or implementations of Hartigan’s in BMDP, SAS, and SPSS). In SYSTAT, seeds for new clusters are chosen by finding the case farthest from the centroid of its cluster. In Hartigan’s algorithm, seeds forming new clusters are chosen by splitting on the variable with largest variance.

Missing Data

In cluster analysis, all distances are computed with pairwise deletion of missing values. Since missing data are excluded from distance calculations by pairwise deletion, they do not directly influence clustering when you use the MATRIX option for JOIN. To use the MATRIX display to analyze patterns of missing data, create a new file in which missing values are recoded to 1, and all other values, to 0. Then use JOIN with MATRIX to see whether missing values cluster together in a systematic pattern.

References

Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.

Fisher, L. and Van Ness, J. W. (1971). Admissible clustering procedures. Biometrika, 58, 91–104.

Gower, J. C. (1967). A comparison of some methods of cluster analysis. Biometrics, 23, 623–637.

Gruvaeus, G. and Wainer, H. (1972). Two additions to hierarchical cluster analysis. The British Journal of Mathematical and Statistical Psychology, 25, 200–206.

Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 139–150.

Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241–254.

Page 105: Statistics I

I-85

Cluster Analysis

Ling, R. F. (1973). A computer generated aid for cluster analysis. Communications of the ACM, 16, 355–361.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. 5th Berkeley symposium on mathematics, statistics, and probability, Vol. 1, 281–298.

McQuitty, L. L. (1960). Hierarchical syndrome analysis. Educational and Psychological Measurement, 20, 293–303.

Milligan, G. W. (1980). An examination of the effects of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319–345.Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic

relationships. University of Kansas Science Bulletin, 38, 1409–1438.Sokal, R. R. and Sneath, P. H. A. (1963). Principles of numerical taxonomy. San Francisco:

W. H. Freeman and Company.Wainer, H. and Schacht, S. (1978). Gappint. Psychometrika, 43, 203–212.Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the

American Statistical Association, 58, 236–244.Wilkinson, L. (1978). Permuting a matrix to a simple structure. Proceedings of the

American Statistical Association.

Page 106: Statistics I
Page 107: Statistics I

I-87

Chapte r

5Conjoint Analysis

Leland Wilkinson

Conjoint analysis fits metric and nonmetric conjoint measurement models to observed data. It is designed to be a general additive model program using a simple optimization procedure. As such, conjoint analysis can handle measurement models not normally amenable to other specialized conjoint programs.

Statistical Background

Conjoint measurement (Luce and Tukey, 1964; Krantz, 1964; Luce, 1966; Tversky, 1967; Krantz and Tversky, 1971) is an axiomatic theory of measurement that defines the conditions under which there exist measurement scales for two or more variables that jointly define a common scale under an additive composition rule. This theory became the basis for a group of related numerical techniques for fitting additive models, called conjoint analysis (Green and Rao, 1971; Green, Carmone, and Wind, 1972; Green and DeSarbo, 1978; Green and Srinivasan, 1978, 1990; Louviere, 1988, 1994). For an interesting historical comment on Sir Ronald Fisher’s “appropriate scores” method for fitting additive models, see Heiser and Meulman (1995).

To see how conjoint analysis is based on additive models, we’ll first graph an additive table and then examine a multiplicative table to encounter one example of a non-additive table. Then we’ll consider the problem of computing margins of a general table based on an additive model.

Page 108: Statistics I

I-88

Chapter 5

Additive Tables

The following is an additive table. Notice that any cell (in roman) is the sum of the corresponding row and column marginal values (in italic).

A common way to represent a two-way table like this is with a graph. I made a file (PCONJ.SYD) containing all possible ordered pairs of the row and column indices. Then I formed Y values by adding the indices:

The following graph of the additive table shows a plot of Y (the values in the cells) against A (rows) stratified by B (columns) in the legend. Notice that the lines are parallel.

Since we really have a three-dimensional graph (Y*A*B), it is sometimes convenient to represent a two-way table as a 3-D or contour plot rather than as a stratified line graph. Following is the input to do so:

1 2 3

4 5 6 7

3 4 5 6

2 3 4 51 2 3 4

USE PCONJLET Y=A+BLINE Y*A/GROUP=B,OVERLAY

Page 109: Statistics I

I-89

Conjoint Analysis

The following contour plot of the additive table shows the result. Notice that the lines in the contour plot are parallel for additive tables. Furthermore, although I used a quadratic smoother, the contours are linear because I used a simple linear combination of A and B to make Y.

Multiplicative Tables

Following is a multiplicative table. Notice that any cell is the product of the corresponding marginal values. We commonly encounter these tables in cookbooks (for sizing recipes) or in, well, multiplication tables. These tables are one instance of two-way tables that are not additive.

PLOT Y*A*B/SMOO=QUAD, CONTOUR, XMIN=0,XMAX=4,YMIN=0,YMAX=5,INDENT

1 2 3

4 4 8 12

3 3 6 9

2 2 4 61 1 2 3

0 1 2 3 4B

0

1

2

3

4

5

A

2 3 4

5

6

7

8

Page 110: Statistics I

I-90

Chapter 5

Let’s look at a graph of this multiplicative table:

Notice that the lines are not parallel.

And the following figure shows the contour plot for the multiplicative model. Notice, again, that the contours are not parallel.

Multiplicative tables and graphs may be pleasing to look at, but they’re not simple. We all learned to add before multiplying. Scientists often simplify multiplicative functions by logging them, since logs of products are sums of logs. This is also one of the reasons we are told to be suspicious of fan-fold interactions (as in the line graph of the multiplicative table) in the analysis of variance. If we can log the variables and remove them (usually improving the residuals in the process), we should do so because it leaves us with a simple linear model.

LET Y=A*BLINE Y*A/GROUP=B,OVERLAY

Page 111: Statistics I

I-91

Conjoint Analysis

Computing Table Margins Based on an Additive Model

If we believe in Occam’s razor and assume that additive tables are generally preferable to non-additive, we may want to fit additive models to a table of numbers before accepting a more complex model. So far, we have been assuming that the marginal indices are known. Testing for additivity is simply a matter of using these indices in a formal model. What if the marginal indices are not known? All we have is a table of numbers bordered by labeled categories. Can we find marginal values such that a linear model based on these values would reproduce the table?

This is exactly what conjoint analysis does. Conjoint analysis originated in an axiomatic approach to measurement (Luce and Tukey, 1964). An additive model underlies a basic axiom of “fundamental measurement”—scale values of separate measurements can be added to produce a joint measurement. This powerful property allows us to say that for all measurements a and b, we have made on a set of objects,

and , assuming that a and b are positive.The following table is an example of such data. How do we find values for and such that ? Luce and Tukey devised rules for computing these values

assuming that the cell values can be fit by the additive model.

The following figure shows a solution. The values for a are , , , and . The values for b are , , and .

b1 b2 b3

a4 1.38 2.07 2.48a3 1.10 1.79 2.20

a2 .69 1.38 1.79

a1 .00 .69 1.10

a b+( ) a> a b+( ) b>ai

bj y ij ai bj+=

a1 0.= a2 0.69=a3 1.10= a4 1.38= b1 0.= b2 0.69=b3 1.10=

Page 112: Statistics I

I-92

Chapter 5

Applied Conjoint Analysis

In the last few decades, conjoint analysis has become popular, especially among market researchers and some economists, for analyzing consumer preferences for goods based on multiple attributes. Green and Srinivasan (1978, 1990), Crowe (1980), and Louviere (1988) summarize this activity. The focus of most of these techniques has been on the development of products with attributes ideally suited to consumer preferences. Several trends in this area have been apparent.

First, psychometricians decided that the axiomatic approach was impractical for large data sets and for data in which the conjoint measurement axioms were violated or contained errors (for example, Emery and Barron, 1979). This trend was partly a consequence of the development of numerical methods that could fit conjoint models nonmetrically (Kruskal, 1965; Kruskal and Carmone, 1969; Srinivasan and Shocker, 1973; DeLeew et al., 1976). Green and Srinivasan (1978) coined the term conjoint analysis for the application of these numerical methods.

Second, applied researchers began to substitute linear methods (usually least-squares linear regression or ANOVA) for nonmetric algorithms. The justification for this was usually practical—the results appeared to be similar for all of the fitting methods, so why not use the simple linear ones? Louviere (1988) articulates this position, partly based on results from Green and Srinivasan (1978) and partly from his own experience with real data sets. This argument is similar to one made by Weeks and Bentler (1979), in which multidimensional scalings using a linear distance function produced configurations almost indistinguishable from those using monotonic or moderately nonlinear distance functions. This is a rather ad hoc conclusion, however, and does not justify ignoring possible nonlinearities in the modeling process. We will look at such a case in the examples.

Third, recent conjoint analysis applied methodology has moved toward designing experiments rather than analyzing received ratings. Green and Srinivasan (1990) and Louviere (1991) have pioneered this approach. Response surfaces for fractional designs are analyzed to identify optimal combinations of product features. In SYSTAT, this approach amounts to using DESIGN for setting up an experimental design and then GLM for analyzing the results. With PRINT LONG, least-squares means are produced for factorial designs. Otherwise, response surfaces can be plotted.

Fourth, discrete choice logistic regression has recently emerged as a rival to conjoint analysis for modeling choice and preference behavior (Hensher and Johnson, 1981). Steinberg (1992) describes the advantages and limitations of this approach. The LOGIT procedure in SYSTAT offers this method.

Page 113: Statistics I

I-93

Conjoint Analysis

Finally, a commercial industry supplying the practical tools for conjoint studies has produced a variety of software packages. Oppewal (1995) reviews some of these. In many cases, more efforts are devoted to “card decks” and other stimulus materials management than to the actual analysis of the models. CONJOINT in SYSTAT represents the opposite end of the spectrum from these approaches. CONJOINT presents methods for fitting these models that are inspired more by Luce and Tukey’s and Green and Rao’s original theoretical formulations than by the practical requirements of data collection. The primary goal of SYSTAT CONJOINT is to provide tools for scaling small- to moderate-sized data sets in which additive models can simplify the presentation of data. Metric and nonmetric loss functions are available for exploring the effects of nonlinearity on scaling. The examples highlight this distinction.

Conjoint Analysis in SYSTAT

Conjoint Analysis Main Dialog Box

To open the Conjoint Analysis dialog box, from the menus choose:

StatisticsClassification

Conjoint Analysis…

Page 114: Statistics I

I-94

Chapter 5

Conjoint analyses are computed by specifying and then estimating a model.

Dependent(s). Select the variable(s) you want to examine. The dependent variable(s) should be continuous numeric variables (for example, INCOME).

Independent(s). Select one or more continuous or categorical variables (grouping variables).

Iterations. Enter the maximum number of iterations. If not stated, the maximum is 50.

Convergence. Enter the relative change in estimates—if all such changes are less than the specified value, convergence is assumed.

Polarity. Enter the polarity of the preferences when doing preference mapping. If the smaller number indicates the least and the higher number the most, select Positive. For example, a questionnaire may include the question “please rate a list of movies where one star is the worst and five stars is the best.” If the higher number indicates a lower ranking and the lower number indicates a higher ranking, select Negative. For example, a questionnaire may include the question “please rank your favorite sports team where 1 is the best and 10 is the worst.”

Loss. Specify a loss function to apply in model estimation:

� Stress. Conjoint analysis minimizes Kruskal’s STRESS.

� Tau. Conjoint analysis maximizes Kendall’s tau-b.

Regression. Specify the regression form:

� Monotonic. Regression function is monotonically increasing or decreasing. If LOSS=STRESS, this is Kruskal’s MONANOVA model.

� Linear. Regression function is ordinary linear regression.

� Log. Regression function is logarithmic.

� Power. Regression function is of the form . This is useful for Box-Cox models.

Save file. Saves parameter estimates into filename.SYD.

y axc=

Page 115: Statistics I

I-95

Conjoint Analysis

Using Commands

To request a conjoint analysis:

Usage Considerations

Types of data. CONJOINT uses rectangular data only.

Print options. The output is standard for all print options.

Quick Graphs. Quick Graphs produced by CONJOINT are utility functions for each predictor variable in the model.

Saving files. CONJOINT saves parameter estimates as one case into a file if you precede ESTIMATE with SAVE.

BY groups. CONJOINT analyzes data by groups. Your file need not be sorted on the BY variable(s).

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. FREQ=<variable> increases the number of cases by the FREQ variable.

Case weights. WEIGHT is not available in CONJOINT.

CONJOINT MODEL depvarlist = indvarlist ESTIMATE / ITERATIONS=n CONVERGENCE=d , LOSS = STRESS TAU , REGRESSION = MONOTONIC LINEAR LOG POWER , POLARITY = POSITIVE NEGATIVE

Page 116: Statistics I

I-96

Chapter 5

Examples

Example 1 Choice Data

The classical application of conjoint analysis is to product choice. The following example from Green and Rao (1971) shows how to fit a nonmetric conjoint model to some typical choice data. The input is:

Following is the output:

CONJOINTUSE BRANDSMODEL RESPONSE=DESIGN$..GUARANT$ESTIMATE / POLARITY=NEGATIVE

Iterative Conjoint Analysis Monotonic Regression Model Data are ranks Loss function is Kruskal STRESS Factors and Levels DESIGN$ A B C BRAND$ Bissell Glory K2R PRICE 1.19 1.39 1.59 SEAL$ NO YES GUARANT$ NO YES Convergence Criterion: 0.000010 Maximum Iterations: 50

Page 117: Statistics I

I-97

Conjoint Analysis

Iteration Loss Max parameter change 1 0.5389079 0.2641755 2 0.4476390 0.2711012 3 0.3170808 0.2482502 4 0.1746641 0.3290621 5 0.1285278 0.1702260 6 0.1050734 0.1906332 7 0.0877708 0.1261961 8 0.0591691 0.2336527 9 0.0407008 0.1665511 10 0.0166571 0.1448756 11 0.0101404 0.1399945 12 0.0058237 0.2048317 13 0.0013594 0.1900774 14 0.0006314 0.0345039 15 0.0001157 0.0466520 16 0.0000065 0.0192437 17 0.0000000 0.0155169 18 0.0000000 0.0032732 19 0.0000000 0.0000032 20 0.0000000 0.0000000

Parameter Estimates (Part Worths) A B C Bissell Glory K2R -0.331 0.400 0.209 -0.122 -0.226 -0.195 PRICE(1) PRICE(2) PRICE(3) NO YES NO 0.302 0.159 -0.429 -0.131 -0.102 -0.039 YES 0.504 Goodness of Fit (Kendall tau) RESPONSE 1.000 Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0 A B C Bissell Glory K2R 0.856 0.699 0.935 0.922 0.843 0.856 PRICE(1) PRICE(2) PRICE(3) NO YES NO 0.778 0.922 0.817 0.948 0.974 0.987 YES 0.791

Page 118: Statistics I

I-98

Chapter 5

A B CDESIGN$

-1.0

-0.5

0.0

0.5

1.0

Mea

sure

Shepard Diagram

0 5 10 15 20Data

-2

-1

0

1Jo

int S

core

Bissell Glory K2RBRAND$

-1.0

-0.5

0.0

0.5

1.0

Mea

sure

1.19 1.39 1.59PRICE

-1.0

-0.5

0.0

0.5

1.0M

easu

re

NO YESSEAL$

-1.0

-0.5

0.0

0.5

1.0

Mea

sure

NO YESGUARANT$

-1.0

-0.5

0.0

0.5

1.0

Mea

sure

Page 119: Statistics I

I-99

Conjoint Analysis

The fitting method chosen for this example is the default nonmetric loss using Kruskal’s STRESS statistic. This is the same method used in the MONANOVA program (Kruskal and Carmone, 1969). Although the minimization algorithm differs from that program, the result should be comparable.

The iterations converged to a perfect fit (LOSS = 0). That is, there exists a set of parameter estimates such that their sums fit the observed data perfectly when Kendall’s tau-b is used to measure fit. This rarely occurs with real data.

The parameter estimates are scaled to have zero sum and unit sum of squares. There is a single goodness-of-fit value for this example because there is one response.

The root-mean-square deleted goodness-of-fit values are the goodness of fit when each respective parameter is set to zero. This serves as an informal test of sensitivity. The lowest value for this example is for the B parameter, indicating that the estimate for B cannot be changed without substantially affecting the overall goodness of fit.

The Shepard diagram displays the goodness of fit in a scatterplot. The Data axis represents the observed data values. The Joint Score axis represents the values of the combined parameter estimates. For example, if we have parameters a1, a2, a3 and b1, b2, then every case measured on, say, a2 and b1 will be represented by a point in the plot whose ordinate (y value) is a2 + b1. This example involves only one condition per “card” or case, so that the Shepard diagram has no duplicate values on the y axis. Conjoint analysis can easily handle duplicate measurements either with multiple dependent variables (multiple subjects exposed to common stimuli) or with duplicate values for the same subject (replications).

The fitted jagged line is the best fitting monotonic regression of these fitted values on the observed data. For a similar diagram, see the Multidimensional Scaling chapter in SYSTAT 10 Statistics II. And note carefully the warnings about “degenerate” solutions and other problems.

You may want to try this example with REGRESSION = LINEAR to see how the results compare. The linear fit yields an almost perfect Pearson correlation. This also means that GLM (MGLH) can produce nearly the same estimates:

The PRINT LONG statement causes GLM to print the least-squares estimates of the marginal means that, for an additive model, are the parameters we seek. The GLM parameter estimates will differ from the ones printed here only by a constant and scaling parameter. Conjoint analysis always scales parameter estimates to have zero

GLMMODEL RESPONSE = CONSTANT + DESIGN$..GUARANT$CATEGORY DESIGN$..GUARANT$PRINT LONGESTIMATE

Page 120: Statistics I

I-100

Chapter 5

sum and unit sum of squares. This way, they can be thought of as utilities over the experimental domain—some negative, some positive.

Example 2 Word Frequency

The data set WORDS contains the most frequently used words in American English (Carroll et al., 1971). Three measures have been added to the data. The first is the (most likely) part of speech (PART$). The second is the number of letters (LETTERS) in the word. The third is a measure of the meaning (MEANING$). This admittedly informal measure represents the amount of harm done to comprehension (1 = a little, 4 = a lot) by omitting the word from a sentence. While linguists may argue over these classifications, they do reveal basic differences. Instead of using a measure of frequency, we will work with the rank order itself to see if there is enough information to fit a model. This time, we will maximize Kendall’s tau-b directly.

Following is the input:

Following is the output:

USE WORDSCONJOINTLET RANK=CASEMODEL RANK = LETTERS PART$ MEANINGESTIMATE / LOSS=TAU,POLARITY=NEGATIVE

Iterative Conjoint Analysis Monotonic Regression Model Data are ranks Loss function is 1-(1+tau)/2 Factors and Levels LETTERS 1 2 3 4 PART$ adjective adverb conjunction preposition pronoun verb MEANING 1 2 3 Convergence Criterion: 0.000010

Page 121: Statistics I

I-101

Conjoint Analysis

Maximum Iterations: 50 Iteration Loss Max parameter change 1 0.2042177 0.0955367 2 0.1988071 0.0911670 3 0.1897893 0.0708985 4 0.1861822 0.0308284 5 0.1843787 0.0259976 6 0.1825751 0.0131758 7 0.1825751 0.0000175 8 0.1825751 0.0000000 Parameter Estimates (Part Worths) LETTERS(1) LETTERS(2) LETTERS(3) LETTERS(4) adjective adverb 0.154 0.174 -0.076 -0.270 -0.119 -0.273 conjunction preposition pronoun verb MEANING(1) MEANING(2) -0.262 0.215 0.173 -0.162 0.749 -0.121 MEANING(3) -0.182 Goodness of Fit (Kendall tau) RANK 0.635 Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0 LETTERS(1) LETTERS(2) LETTERS(3) LETTERS(4) adjective adverb 0.628 0.610 0.635 0.606 0.635 0.617 conjunction preposition pronoun verb MEANING(1) MEANING(2) 0.602 0.613 0.610 0.631 0.494 0.617 MEANING(3) 0.610

Page 122: Statistics I

I-102

Chapter 5

The Shepard diagram reveals a slightly curvilinear relationship between the data and the fitted values. We can parameterize that relationship by refitting the model as follows:

SYSTAT will then print Computed Exponent: 1.392. We will further examine this type of power function in the Box-Cox example.

The output tells us that, in general, shorter words are higher on the list, adverbs are lower, and prepositions are higher. Also, the most frequently occurring words are generally the most disposable. These statements must be made in the context of the model, however. To the extent that the separate statements are inaccurate when the data are examined separately for each, the additive model is violated. This is another

ESTIMATE / REGRESSION=POWER,POLARITY=NEGATIVE

Shepard Diagram

0 10 20 30 40Data

Join

t Sco

reM

easu

re

Mea

sure

Mea

sure

-1.0

-0.5

0.0

0.5

1.0

1.5

1 2 3 4LETTERS

-1.0

-0.5

0.0

0.5

1.0

PART$

Adjective

Conjunction

Preposition

PronounVerb

Adverb

-1.0

-0.5

0.0

0.5

1.0

1 2 3MEANING

-1.0

-0.5

0.0

0.5

1.0

Page 123: Statistics I

I-103

Conjoint Analysis

way of saying that the additive model is appropriate when there are no interactions or configural effects. Incidentally, when these data are analyzed with GLM using the (inverse transformed) word frequencies themselves rather than rank order in the list, the conclusions are substantially the same.

Example 3 Box-Cox Model

Box and Cox (1964) devised a maximum likelihood estimator for the exponent in the following model:

where X is a matrix of known values, β is a vector of unknown parameters associated with the transformed observations, and the residuals of the model are assumed to be normally distributed and independent. The transformation itself is assumed to take the following form:

Following is a SYSTAT program (originally coded by Grant Blank) to compute the Box-Cox exponent and its standard error. The comments document the program flow:

USE BOXCOX

REM First we need GLM to code dummy variables.GLMCATEGORY TREATMEN,POISONMODEL Y=CONSTANT+TREATMEN+POISONSAVE TEMP / MODELESTIMATE

REM Now use STATS to compute geometric mean.STATSUSE TEMPSAVE GMEANLET LY=LOG(Y)STATS LY / MEAN

E y X{ }( )λ β=

yy

y

( ) ( )

log( ) ( )

λ

λ

λλ

λ=

− ≠

=

�����

10

0

Page 124: Statistics I

I-104

Chapter 5

This program produces an estimate of –0.750 for lambda, with a 95% Wald confidence interval of (-1.181, -0.319). This is in agreement with the results in the original paper. Box and Cox recommend rounding the exponent to –1 because of its natural interpretation (rate of dying from poison). In general, it is wise to round such transformations to interpretable values such as ... –1, –0.5, 0, 0.5, 2 ... to facilitate the interpretation of results.

The Box-Cox procedure is based on a specific model that assumes normality in the transformed data and that focuses on the dependent variable. We might ask whether it is worthwhile to examine transformations of this sort without assuming normality and resorting to maximum likelihood for our answer. This is especially appropriate if our general method is to find an “optimal” estimate of the exponent and then round it to the nearest interpretable value based on a confidence interval. Indeed, two discussants of the Box and Cox paper, John Hartigan and John Tukey, asked just that.

The conjoint model offers one approach to this question. Specifically, we can use a power function relating the y data values to the predictor variables in our model and see how it converges.

Following is the input:

REM Now duplicate the geometric mean for every case.MERGE GMEAN(LY) TEMP (Y,X(1..5))IF CASE=1 THEN LET GMEAN=EXP(LY)IF CASE>1 THEN LET GMEAN=LAG(GMEAN)

REM Now estimate the exponent, following Box&CoxNONLINMODEL Y = B0 + B1*X(1) + B2*X(2) + B3*X(3) + B4*X(4) + B5*X(5)LOSS = ((Y^POWER-1) /(POWER*GMEAN^(POWER-1))-ESTIMATE)^2ESTIMATE

USE BOXCOXCONJOINTMODEL Y=POISON TREATMENESTIMATE / REGRESS=POWER

Page 125: Statistics I

I-105

Conjoint Analysis

Following is the output:

Iterative Conjoint Analysis Power Regression Model Data are dissimilarities Loss function is least squares Factors and Levels POISON 1 2 3 TREATMEN 1 2 3 4 Convergence Criterion: 0.000010 Maximum Iterations: 50 Iteration Loss Max parameter change 1 0.1977795 0.1024469 2 0.1661894 0.0530742 3 0.1594770 0.1473320 4 0.1571216 0.0973117 5 0.1562271 0.0156619 6 0.1559910 0.0193429 7 0.1559285 0.0149959 8 0.1559166 0.0034746 9 0.1559135 0.0024772 10 0.1559131 0.0016637 11 0.1559129 0.0005579 12 0.1559134 0.0004575 13 0.1559129 0.0000321 14 0.1559130 0.0000188 15 0.1559127 0.0000021 Computed Exponent: -1.015 Parameter Estimates (Part Worths) POISON(1) POISON(2) POISON(3) TREATMEN(1) TREATMEN(2) TREATMEN(3) -0.375 -0.138 0.634 0.423 -0.414 0.133 TREATMEN(4) -0.264 Goodness of Fit (Pearson correlation) Y -0.919 Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0 POISON(1) POISON(2) POISON(3) TREATMEN(1) TREATMEN(2) TREATMEN(3) 0.872 0.912 0.785 0.866 0.868 0.914 TREATMEN(4) 0.898

Page 126: Statistics I

I-106

Chapter 5

On each iteration, CONJOINT transforms the observed (y) values by the current estimate of the exponent, regresses them on the currently weighted X variables (using the conjoint parameter estimates), and computes the loss from the residuals of that regression. Over iterations, this loss is minimized and we get to view the final fit in the plotted Shepard diagram.

The CONJOINT program produced an estimate of –1.015 for the exponent. Draper and Hunter (1969) reanalyzed the poison data using several criteria suggested in the discussion to Box and Cox’s paper and elsewhere (minimizing interaction F ratio, maximizing main-effects F ratios, and minimizing Levene’s test for heterogeneity of within-group variances). They found the “best” exponent to be in the neighborhood of –1.

Shepard Diagram

0.0 0.5 1.0 1.5Data

Join

t Sco

re

Mea

sure

Mea

sure

-1

0

1

2

0.0 0.5 1.0 1.5-1

0

1

2

1 2 3POISON

-1.0

-0.5

0.0

0.5

1.0

1 2 3 4TREATMENT

-1.0

-0.5

0.0

0.5

1.0

Page 127: Statistics I

I-107

Conjoint Analysis

Example 4 Employment Discrimination

The following table shows the mean salaries (SALNOW) of employees at a Chicago bank. These data are from the BANK.SYD data set used in many SYSTAT manuals. The bank was involved in a discrimination lawsuit, and the focus of our interest is whether we can represent the salaries by a simple additive model. At the time these data were collected, there were no black females with a graduate school educationworking at the bank. The education variable records the highest level reached.

Let’s regress beginning salary (SALBEG) and current salary (SALNOW) on the gender and education data. To represent our model, we will code the categories with integers: for gender/race, 1=black females, 2=white females, 3=black males, 4=white males; for education, 1=high school, 2=college, 3=grad school. These codings order the salaries for both racial/gender status and educational levels.

Following is the input:

High School College Grad School

White Males 11735 16215 28251Black Males 11513 13341 20472White Females 9600 13612 11640Black Females 8874 10278

USE BANKIF SEX=1 AND MINORITY=1 THEN LET GROUP=1IF SEX=1 AND MINORITY=0 THEN LET GROUP=2IF SEX=0 AND MINORITY=1 THEN LET GROUP=3IF SEX=0 AND MINORITY=0 THEN LET GROUP=4LET EDUC=1IF EDLEVEL>12 THEN LET EDUC=2IF EDLEVEL>16 THEN LET EDUC=3LABEL GROUP / 1=”Black_Females”,2=”White_Females”, 3=”Black_Males”,4=”White_Males”LABEL EDUC / 1=”High_School”,2=”College”,3=”Grad_School”CONJOINTMODEL SALBEG,SALNOW=GROUP EDUCESTIMATE / REGRESS=POWER

Page 128: Statistics I

I-108

Chapter 5

Following is the output:

Iterative Conjoint Analysis Power Regression Model Data are dissimilarities Loss function is least squares Factors and Levels GROUP Black_Female White_Female Black_Males White_Males EDUC High_School College Grad_School Convergence Criterion: 0.000010 Maximum Iterations: 50 Iteration Loss Max parameter change 1 0.3932757 0.0931128 2 0.3734472 0.2973392 3 0.3631769 0.2928259 4 0.3606965 0.1416823 5 0.3589525 0.0244544 6 0.3585654 0.0090515 7 0.3584647 0.0252027 8 0.3584328 0.0068830 9 0.3584239 0.0016764 10 0.3584233 0.0047662 11 0.3584215 0.0009750 12 0.3584225 0.0001914 13 0.3584253 0.0001697 14 0.3584253 0.0000182 15 0.3584231 0.0000021 16 0.3584189 0.0000004 Computed Exponent: -0.072 Parameter Estimates (Part Worths) GROUP(1) GROUP(2) GROUP(3) GROUP(4) EDUC(1) EDUC(2) -0.366 -0.200 -0.034 0.144 -0.356 -0.010 EDUC(3) 0.823 Goodness of Fit (Pearson correlation) SALBEG SALNOW 0.815 0.787 Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0 GROUP(1) GROUP(2) GROUP(3) GROUP(4) EDUC(1) EDUC(2) 0.782 0.785 0.801 0.795 0.753 0.801 EDUC(3) 0.696

Page 129: Statistics I

I-109

Conjoint Analysis

The computed exponent (–0.072) suggests that a log transformation would be appropriate for fitting a parametric model. The two salary measurements (salary at time of hire and at time of the study) perform similarly, although beginning salary shows a slightly better fit to the additive model (0.815 versus 0.787). You can see the difference in the two printed Shepard diagrams. The estimates of the parameters show clear orderings in the categories.

Check for sensitivity of the parameter estimates by examining the root-mean-square deleted goodness of fit values. The reported values are averages of the fits for both SALBEG and SALNOW when the respective parameter is set to zero. Here we find that the greatest change in goodness of fit corresponds to a change in the Grad School parameter.

g

1

010000

2000030000

4000050000

60000

Data

-1.0

-0.5

0.0

0.5

1.0Jo

int S

core

2

010000

2000030000

4000050000

60000

Data

-1.0

-0.5

0.0

0.5

1.0

Join

t Sco

re

1 2

Black_Female

White_Female

Black_Males

White_Males

GROUP

-1.0

-0.5

0.0

0.5

1.0

Mea

sure

High_School

College

Grad_School

EDUC

-1.0

-0.5

0.0

0.5

1.0M

eas u

re

Shepard Diagram

Page 130: Statistics I

I-110

Chapter 5

Transformed Additive Model

The transformed additive model removes the highly significant interaction for SALNOW and almost removes it for SALBEG in these data. You can see this by recoding the education and gender/race variables with the parameter estimates from the conjoint analysis:

Following is the output:

IF GROUP=1 THEN LET G=-.365IF GROUP=2 THEN LET G=-.2IF GROUP=3 THEN LET G=-.033IF GROUP=4 THEN LET G=.147IF EDUC=1 THEN LET E=-.359IF EDUC=2 THEN LET E=-.011IF EDUC=3 THEN LET E=.822LET LSALB=LOG(SALBEG)LET LSALN=LOG(SALNOW)GLMMODEL LSALB,LSALN = CONSTANT+E+G+E*GESTIMATEHYPOTHESISEFFECT=E*GTEST

Number of cases processed: 474 Dependent variable means LSALB LSALN 8.753 9.441 -1 Regression coefficients B = (X’X) X’Y LSALB LSALN CONSTANT 8.829 9.531 E 0.576 0.653 G 0.723 0.722 E G 0.558 0.351 Multiple correlations LSALB LSALN 0.817 0.789 Squared multiple correlations LSALB LSALN 0.667 0.622 2 2 Adjusted R = 1-(1-R )*(N-1)/df, where N = 474, and df = 470 LSALB LSALN 0.665 0.620 ------------------------------------------------------------------------------------

Page 131: Statistics I

I-111

Conjoint Analysis

Ordered Scatterplots

Finally, let’s use SYSTAT to produce scatterplots of beginning and current salary ordered by the conjoint coefficients. The SYSTAT code to do this can be found in the file CONJO4.SYC. The spacing of the scatterplots should tell the story.

*** WARNING *** Case 297 has large leverage (Leverage = 0.128) Test for effect called: E*G Univariate F Tests Effect SS df MS F P LSALB 0.275 1 0.275 6.596 0.011 Error 19.628 470 0.042 LSALN 0.109 1 0.109 1.818 0.178 Error 28.219 470 0.060 Multivariate Test Statistics Wilks’ Lambda = 0.986 F-Statistic = 3.447 df = 2, 469 Prob = 0.033 Pillai Trace = 0.014 F-Statistic = 3.447 df = 2, 469 Prob = 0.033 Hotelling-Lawley Trace = 0.015 F-Statistic = 3.447 df = 2, 469 Prob = 0.033

Page 132: Statistics I

I-112

Chapter 5

The story is mainly in this graph: regardless of educational level, minorities and women received lower salaries. There are a few exceptions to the general pattern, but overall the bank had reason to settle the lawsuit.

Computation

All computations are in double precision.

Algorithms

CONJOINT uses a direct search optimization method to minimize the loss function. This enables minimization of Kendall’s tau. There is no guarantee that the program will find the global minimum of tau, so it is wise to try several regression types and the STRESS loss to be sure that they all reach approximately the same neighborhood.

Page 133: Statistics I

I-113

Conjoint Analysis

Missing Data

Missing values are processed by omitting them from the loss function.

References

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26, 211–252.

Brogden, H. E. (1977). The Rasch model, the law of comparative judgment and additive conjoint measurement. Psychometrika, 42, 631–634.

Carmone, F. J., Green, P. E., and Jain, A. K. (1978). Robustness of conjoint analysis: Some Monte Carlo results. Journal of Marketing Research, 15, 300–303.

Carroll, J. B., Davies, P., and Richmond, B. (1971). The word frequency book. Boston, Mass.: Houghton, Mifflin.

Carroll, J. D. and Green, P. E. (1995). Psychometric methods in marketing research: Part I, conjoint analysis. Journal of Marketing Research, 32, 385–391.

Crowe, G. (1980). Conjoint measurements design considerations. PMRS Journal, 1, 8–13.DeLeew, J., Young, F. W., and Takane, Y. (1976). Additive structure in qualitative data:

An alternating least squares method with optimal scaling features. Psychometrika, 41, 471–503.

Draper, N. R. and Hunter, W. G. (1969). Transformations: Some examples revisited. Technometrics, 11, 23–40.

Emery, D. R. and Barron, F. H. (1979). Axiomatic and numerical conjoint measurement: An evaluation of diagnostic efficacy. Psychometrika, 44, 195–210.

Green, P. E., Carmone, F. J., and Wind, Y. (1972). Subjective evaluation models and conjoint measurement. Behavioral Science, 17, 288–299.

Green, P. E. and DeSarbo, W. S. (1978). Additive decomposition of perceptions data via conjoint analysis. Journal of Consumer Research, 5, 58–65.

Green, P. E. and Rao, V. R. (1971). Conjoint measurement for quantifying judgmental data. Journal of Marketing Research, 8, 355–363.

Green, P. E. and Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and outlook. Journal of Consumer Research, 5, 103–123.

Green, P. E. and Srinivasan, V. (1990). Conjoint analysis in marketing: New developments with implications for research and practice. Journal of Marketing, 54, 3–19.

Heiser, W. J. and Meulman, J. J. (1995). Nonlinear methods for the analysis of homogeneity and heterogeneity. In W. J. Krzanowski (ed.), Recent advances in descriptive multivariate analysis, 51–89. Oxford: Clarendon Press.

Page 134: Statistics I

I-114

Chapter 5

Hensher, D. A. and Johnson, L. W. (1981). Applied discrete choice modeling. London: Croom Helm.

Krantz, D. H. (1964). Conjoint measurement: The Luce-Tukey axiomatization and some extensions. Journal of Mathematical Psychology, 1, 284–277.

Krantz, D. H. and Tversky, A. (1971). Conjoint measurement analysis of composition rules in psychology. Psychological Review, 78, 151–169.

Kruskal, J. B. (1965). Analysis of factorial experiments by estimating monotone transformations of the data. Journal of the Royal Statistical Society, Series B, 27, 251–263.

Kruskal, J. B. and Carmone, F. J. (1969). MONANOVA: A Fortran-IV program for monotone analysis of variance (non-metric analysis of factorial experiments). Behavioral Science, 14, 165–166.

Louviere, J. J. (1988). Analyzing decision making: Metric conjoint analysis. Newbury Park, Calif.: Sage Publications.

Louviere, J. J. (1991). Experimental choice analysis: Introduction and review. Journal of Business Research. 23, 291–297.

Louviere, J. J. (1994). Conjoint analysis. In R. Bagozzi (ed.), Handbook of Marketing Research, 223–259. Oxford: Blackwell Publishers.

Luce, R. D. (1966). Two extensions of conjoint measurement. Journal of Mathematical Psychology, 3, 348–370.

Luce, R. D. and Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1, 1–27.

Nygren, T. E. (1986). A two-stage algorithm for assessing violationsof additivity via axiomatic and numerical conjoint analysis. Psychometrika, 51, 483–491.

Oppewal, H. (1995). A review of conjoint software. Journal of Retailing and Consumer Services, 2, 55–61.

Srinivasan, V. and Shocker, A. D. (1973). Linear programming techniques for multidimensional analysis of preference. Psychometrika, 38, 337–369.

Steinberg, D. (1992). Applications of logit models in market research. 1992 Sawtooth-SYSTAT Software Conference Proceedings, 405–424. Ketchum, Idaho: Sawtooth Software, Inc.

Tversky, A. (1967). A general theory of polynomial conjoint measurement. Journal of Mathematical Psychology, 4, 1–20.

Umesh, U. N. and Mishra, S. (1990). A Monte Carlo investigation of conjoint analysis index-of-fit: Goodness of fit, significance and power. Psychometrika, 55, 33–44.

Weeks, D. G. and Bentler, P. M. (1979). A comparison of linear and monotone multidimensional scaling models. Psychological Bulletin, 86, 349–354.

Page 135: Statistics I

I-115

Chapte r

6Correlations, Similarities, and Distance Measures

Leland Wilkinson, Laszlo Engelman, and Rick Marcantonio

Correlations computes correlations and measures of similarity and distance. It prints the resulting matrix and, if requested, saves it in a SYSTAT file for further analysis, such as multidimensional scaling, cluster, or factor analysis.

For continuous data, Correlations provides the Pearson correlation, covariances, and sums of squares of deviations from the mean and sums of cross-products of deviations (SSCP). In addition to the usual probabilities, the Bonferroni and Dunn-Sidak adjustments are available with Pearson correlations. If distances are desired, Euclidean or city-block distances are available. Similarity measures for continuous data include the Bray-Curtis coefficient and the QSK quantitative symmetric coefficient (or Kulczynski measure).

For rank-order data, Correlations provides Goodman-Kruskal’s gamma, Guttman’s mu2, Spearman’s rho, and Kendall’s tau.

For binary data, Correlations provides S2, the positive matching dichotomy coefficient; S3, Jaccard’s dichotomy coefficient; S4, the simple matching dichotomy coefficient; S5, Anderberg’s dichotomy coefficient; and S6, Tanimoto’s dichotomy coefficient. When underlying distributions are assumed to be normal, the tetrachoric correlation is available.

When data are missing, listwise and pairwise deletion methods are available for all measures. An EM algorithm is an option for maximum likelihood estimates of correlation, covariance, and cross-products of deviations matrices. For robust ML estimates where outliers are downweighted, the user can specify the degrees of freedom for the t distribution or contamination for a normal distribution. Correlations includes a graphical display of the pattern of missing values. Little’s MCAR test is printed with the display. The EM algorithm also identifies cases with extreme Mahalanobis distances.

Page 136: Statistics I

I-116

Chapter 6

Hadi’s robust outlier detection and estimation procedure is an option for correlations, covariances, and SSCP; cases identified as outliers by the procedure are not used to compute estimates.

Statistical Background

SYSTAT computes many different measures of the strength of association between variables. The most popular measure is the Pearson correlation, which is appropriate for describing linear relationships between continuous variables. However, CORR offers a variety of alternative measures of similarity and distance appropriate if the data are not continuous.

Let’s look at an example. The following data, from the CARS file, are taken from various issues of Car and Driver and Road & Track magazine. They are the car enthusiasts’ equivalent of Consumer Reports performance ratings. The cars rated include some of the most expensive and exotic cars in the world (for example, Ferrari Testarossa) as well as some of the least expensive but sporty cars (for example, Honda Civic CRX). The attributes measured are 0–60 m.p.h. acceleration, braking distance in feet from 60–0 m.p.h., slalom times (speed over a twisty course), miles per gallon, and top speed in miles per hour.

ACCEL BRAKE SLALOM MPG SPEED NAME$

5.0 245 61.3 17.0 153 Porsche 911T5.3 242 61.9 12.0 181 Testarossa5.8 243 62.6 19.0 154 Corvette7.0 267 57.8 14.5 145 Mercedes 5607.6 271 59.8 21.0 124 Saab 90007.9 259 61.7 19.0 130 Toyota Supra8.5 263 59.9 17.5 131 BMW 6358.7 287 64.2 35.0 115 Civic CRX9.3 258 64.1 24.5 129 Acura Legend

10.8 287 60.8 25.0 100 VW Fox GL13.0 253 62.3 27.0 95 Chevy Nova

Page 137: Statistics I

I-117

Correlat ions, Simi lar it ies, and Distance Measures

The Scatterplot Matrix (SPLOM)

A convenient summary that shows the relationships between the performance variables is to arrange them in a matrix. A matrix is a rectangular array. We can put any sort of numbers in the cells of the matrix, but we will focus on measures of association. Before doing that, however, let’s examine a graphical matrix, the scatterplot matrix (SPLOM).

This matrix shows the histograms of each variable on the diagonal and the scatterplots (x-y plots) of each variable against the others. For example, the scatterplot of acceleration versus braking is at the top of the matrix. Since the matrix is symmetric, only the bottom half is shown. In other words, the plot of acceleration versus braking is the same as the transposed scatterplot of braking versus acceleration.

The Pearson Correlation Coefficient

Now, assume that we want a single number that summarizes how well we could predict acceleration from braking using a straight line. For linear regression, we discuss how we calculate such a line, but it is enough here to know that we are interested in drawing a line through the area covered by the points in the scatterplot such that, on average, the acceleration of a car could be predicted rather well by the value on the line corresponding to its braking. The closer the points cluster around this line, the better would be the prediction.

AC

CE

LB

RA

KE

SLA

LOM

MP

G

ACCEL

SP

EE

D

BRAKE SLALOM MPG SPEED

Page 138: Statistics I

I-118

Chapter 6

In addition, we want this number to represent simultaneously how well we can predict braking from acceleration using a similar line. This symmetry we seek is fundamental to all the measures available in CORR. It means that, whatever the scales on which we measure our variables, the coefficient of association we compute will be the same for either prediction. If this symmetry makes no sense for a certain data set, then you probably should not be using CORR.

The most common measure of association is the Pearson correlation coefficient, which varies between –1 and +1. A Pearson correlation of 0 indicates that neither of two variables can be predicted from the other by using a linear equation. A Pearson correlation of 1 indicates that one variable can be predicted perfectly by a positive linear function of the other, and vice versa. And a value of –1 indicates the same, except that the function has a negative sign for the slope of the line.

Following is the Pearson correlation matrix corresponding to this SPLOM:

Try superimposing in your mind the correlation matrix on the SPLOM. The Pearson correlation for acceleration versus braking is 0.466. This correlation is positive and moderate in size. On the other hand, the correlation between acceleration and speed is negative and quite large (–0.908). You can see in the lower left corner of the SPLOM that the points cluster around a downward sloping line. In fact, all of the correlations of speed with the other variables are negative, which makes sense since greater speed implies greater performance. The same is true for slalom performance, but this is clouded by the fact that some small but slower cars like the Honda Civic CRX are extremely agile.

Keep in mind that the Pearson correlation measures linear predictability. Do not assume that a Pearson correlation near 0 implies no relationship between variables. Many nonlinear associations (U- and S-shaped curves, for example) can have Pearson correlations of 0.

Pearson Correlation Matrix

ACCEL BRAKE SLALOM MPG SPEED

ACCEL 1.000 BRAKE 0.466 1.000 SLALOM 0.176 -0.097 1.000 MPG 0.651 0.622 0.597 1.000 SPEED -0.908 -0.665 -0.115 -0.768 1.000

Number of Observations: 11

Page 139: Statistics I

I-119

Correlat ions, Simi lar it ies, and Distance Measures

Other Measures of Association

CORR offers a variety of other association measures. There is not room here to discuss all of them, but let’s review some briefly.

Measures for Rank-Order Data

Several measures are available for rank-order data: Goodman-Kruskal’s gamma, Guttman’s mu2, Spearman’s rho, and Kendall’s tau. Each measures an aspect of rank-order association. The one closest to Pearson is the Spearman. Spearman’s rho is simply a Pearson correlation computed on the same data after converting them to ranks. Goodman-Kruskal’s gamma and Kendall’s tau reflect the tendency for two cases to have similar orderings on two variables. However, the former focuses on cases which are not tied in rank orderings. If no ties exist, these two measures will be equal.

Following is the same matrix computed for Spearman’s rho:

It is often useful to compute both a Spearman and Pearson matrix on the same data. The absolute difference between the two can reveal unusual features. For example, the greatest difference for our data is on the slalom-braking correlation. This is because the Honda Civic CRX is so fast through the slalom, despite its inferior brakes, that it attenuates the Pearson correlation between slalom and braking. The Spearman correlation reduces its influence.

Dissimilarity and Distance Measures

These measures include the Bray-Curtis (BC) dissimilarity measure, the quantitative symmetric dissimilarity coefficient, the Euclidean distance, and the city-block distance.

Matrix of Spearman Correlation Coefficients

ACCEL BRAKE SLALOM MPG SPEED

ACCEL 1.000 BRAKE 0.501 1.000 SLALOM 0.245 -0.305 1.000 MPG 0.815 0.502 0.487 1.000 SPEED -0.891 -0.651 -0.109 -0.884 1.000

Number of observations: 11

Page 140: Statistics I

I-120

Chapter 6

Euclidean and city-block distance measures have been widely available in software packages for many years; Bray-Curtis and QSK are less common. For each pair of variables,

where i and j are variables and k is cases. After an extensive computer simulation study, Faith, Minchin, and Belbin (1987) concluded that BC and QSK were “effective as robust measures” in terms of both rank and linear correlation. The use of these measures is similar to that for Correlations (Pearson, Covariance, and SSCP), except the EM, Prob, Bonferroni, Dunn-Sidak, and Hadi options are not available.

Measures for Binary Data

Correlations offers the following association measures for binary data: positive matching dichotomy coefficients (S2), Jaccard’s dichotomy coefficients (S3), simple matching dichotomy coefficients (S4), Anderberg’s dichotomy coefficients (S5), Tanimoto’s dichotomy coefficients (S6), and tetrachoric correlations.

Dichotomy coefficients. These coefficients relate variables whose values may represent the presence or absence of an attribute or simply two values. They are documented in Gower (1985). These coefficients were chosen for SYSTAT because they are metric and produce symmetric positive semidefinite (Gramian) matrices, provided that you do not use the pairwise deletion option. This makes them suitable for multidimensional scaling and factoring as well as clustering. The following table shows how the similarity coefficients are computed:

Bray-Curtis

xik xjk–k

xik xj

k

∑+k

∑--------------------------------=

SK 11

2--- min xik xjk( , )

1

xik

k

∑-------------

1

xjk

k

∑-------------+

•k

∑–=

Page 141: Statistics I

I-121

Correlat ions, Simi lar it ies, and Distance Measures

When the absence of an attribute in both variables is deemed to convey no information, d should not be included in the coefficient (see S3 and S5).

Tetrachoric correlation. While the data for this measure are binary, they are assumed to be a random sample from a bivariate normal distribution. For example, let’s draw a horizontal line and a vertical line on this bivariate normal distribution and count the number of observations in each quadrant.

1 01 a b0 c d

Proportion of pairs with both values present

Proportion of pairs with both values present given that at least one occurs

Proportion of pairs where the values of both variables agree

S3 standardized by all possible patterns of agreement and disagreement

S4 standardized by all possible patterns of agreement and disagreement

xj

xia b+c d+

a c+ b d+

S2 a

a b c d+ + +------------------------------=

S3 a

a b c+ +---------------------=

S4 a d+

a b c d+ + +------------------------------=

S5 a

a 2 b c+( )+-----------------------------=

S6 a d+

a 2 b c+( ) d+ +---------------------------------------=

5 19

17 4

-3 3Y0

-3

3

X0

Page 142: Statistics I

I-122

Chapter 6

A large proportion of the observations fall in the upper right and lower left quadrants because the relationship is positive (the Pearson correlation is approximately 0.70). Correspondingly, if there were a strong negative relationship, the points would concentrate in the upper left and lower right quadrants. If the original observations are no longer available but you do have the frequency counts for the four quadrants, try a tetrachoric correlation.

The computations for the tetrachoric correlation begin by finding estimates of the inverse cumulative marginal distributions:

z value for x0 = Φ-1 and z value for y0 = Φ-1

and using these values as limits when integrating the bivariate normal density expressed in terms of ρ, the correlation, and then solving for ρ.

If you have the original data, don’t bother dichotomizing them because the tetrachoric correlation has an efficiency of 0.40 compared with the efficient Pearson correlation estimate.

Transposed Data

You can use CORR to compute measures of association on the rows or columns of your data. Simply transpose the data and then use CORR. This makes sense when you want to assess similarity between rows. We might be interested in identifying similar cars from our performance measures, for example. Recall that you cannot transpose a file that contains character data.

When you compute association measures across rows, however, be sure that the variables are on comparable scales. Otherwise, a single variable will influence most of the association. With the cars data, braking and speed are so large that they would almost uniquely determine the similarity between cars. Consequently, we standardized the data before transposing them. That way, the correlations measure the similarities comparably across attributes.

17 5+

45---------------

17 4+

45---------------

Page 143: Statistics I

I-123

Correlat ions, Simi lar it ies, and Distance Measures

Following is the Pearson correlation matrix for our cars:

Hadi Robust Outlier Detection

Hadi robust outlier detection identifies specific cases as outliers (if there are any) and then uses the acceptable cases to compute the requested measure in the usual way. Following are the steps for this procedure:

� Compute a “robust” covariance matrix by finding the median (instead of the mean) for each variable and using in the calculation of each covariance. If the resulting matrix is singular, reconstruct another after inflating the smallest eigenvalues by a small amount.

� Use this robust estimate of the covariance matrix to compute Mahalanobis distances and then use the distance to rank the cases.

� Use the half of the sample with the lowest ranks to compute the usual covariance matrix (that is, deviations from the mean).

� Use this covariance matrix to compute new distances for the complete sample and rerank the cases.

Pearson Correlation Matrix

PORSCHE FERRARI CORVETTE MERCEDES SAAB

PORSCHE 1.000 FERRARI 0.940 1.000 CORVETTE 0.939 0.868 1.000 MERCEDES 0.093 0.212 -0.240 1.000 SAAB -0.506 -0.523 -0.760 0.664 1.000 TOYOTA 0.238 0.429 0.402 -0.379 -0.680 BMW -0.319 -0.095 -0.557 0.854 0.634 HONDA -0.504 -0.730 -0.393 -0.519 0.265 ACURA -0.046 -0.102 0.298 -0.978 -0.770 VW -0.962 -0.928 -0.980 0.079 0.704 CHEVY -0.731 -0.698 -0.491 -0.532 -0.131

TOYOTA BMW HONDA ACURA VW

TOYOTA 1.000 BMW -0.247 1.000 HONDA -0.298 -0.500 1.000 ACURA 0.533 -0.788 0.349 1.000 VW -0.353 0.391 0.552 -0.156 1.000 CHEVY -0.034 -0.064 0.320 0.536 0.525

CHEVY

CHEVY 1.000

Number of observations: 5

Σ xi median–( )2

Page 144: Statistics I

I-124

Chapter 6

� After ranking, select the same number of cases with small ranks as before but add the case with the next largest rank and repeat the process, each time updating the covariance matrix, computing and sorting new distances, and increasing the subsample size by one.

� Continue adding cases until the entering one exceeds an internal limit based on a chi-square statistic (see Hadi, 1994). The cases remaining (not entered) are identified as outliers.

� Use the cases that are not identified as outliers to compute the measure requested in the usual way.

Correlations in SYSTAT

Correlations Main Dialog Box

To open the Correlations dialog box, from the menus choose:

StatisticsCorrelations

Simple…

Variables. Available only if One is selected for Sets. All selected variables are correlated with all other variables in the list, producing a triangular correlation matrix.

Page 145: Statistics I

I-125

Correlat ions, Simi lar it ies, and Distance Measures

Rows. Available only if Two is selected for Sets. Selected variables are correlated with all column variables, producing a rectangular matrix.

Columns. Available only if Two is selected for Sets. Selected variables are correlated with all row variables, producing a rectangular matrix.

Sets. One set creates a single, triangular correlation matrix of all variables in the Variable(s) list. Two sets creates a rectangular matrix of variables in the Row(s) list correlated with variables in the Column(s) list.

Listwise. Listwise deletion of missing data. Any case with missing data for any variable in the list is excluded.

Pairwise. Pairwise deletion of missing data. Only cases with missing data for one of the variables in the pair being correlated are excluded.

Save file. Saves the correlation matrix to a file.

Types. Type of data or measure. You can select from a variety of distance measures, as well as measures for continuous data, rank-order data, and binary data.

Measures for Continuous Data

The following measures are available for continuous data:

� Pearson. Produces a matrix of Pearson product-moment correlation coefficients. Pearson correlations vary between –1 and +1. A value of 0 indicates that neither of two variables can be predicted from the other by using a linear equation. A Pearson correlation of 1 or –1 indicates that one variable can be predicted perfectly by a linear function of the other.

� Covariance. Produces a covariance matrix.

� SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen, sums are weighted by N/n, where n is the count for a pair.

The Pearson, Covariance, and SSCP measures are related. The entries in an SSCP matrix are sums of squares of deviations (from the mean) and sums of cross-products of deviations. If you divide each entry by , variances result from the sums of squares and covariances from the sums of cross-products. Divide each covariance by the product of the standard deviations (of the two variables) and the result is a correlation.

n 1–( )

Page 146: Statistics I

I-126

Chapter 6

Distance and Dissimilarity Measures

Correlations offers two dissimilarity measures and two distance measures:

� Bray-Curtis. Produces a matrix of dissimilarity measures for continuous data.

� QSK. Produces a matrix of symmetric dissimilarity coefficients. Also called the Kulczynski measure.

� Euclidean. Produces a matrix of Euclidean distances normalized by the sample size.

� City. Produces a matrix of “city-block,” or first-power, distances (sum of absolute discrepancies) normalized by the sample size.

Measures for Rank-Order Data

If your data are simply ranks of attributes, or if you want to see how well variables are associated when you pay attention to rank ordering, you should consider the following measures available for ranked data:

� Spearman. Produces a matrix of Spearman rank-order correlation coefficients. This measure is a nonparametric version of the Pearson correlation coefficient, based on the ranks of the data rather than the actual values.

� Gamma. Produces a matrix of Goodman-Kruskal’s gamma coefficients.

� MU2. Produces a matrix of Guttman’s mu2 monotonicity coefficients.

� Tau. Produces a matrix of Kendall’s tau-b rank-order coefficients.

Measures for Binary Data

These coefficients relate variables assuming only two values. The dichotomy coefficients work only for dichotomous data scored as 0 or 1.

The following measures are available for binary data:

� Positive matching (S2). Produces a matrix of positive matching dichotomy coefficients.

� Jaccard (S3). Produces a matrix of Jaccard’s dichotomy coefficients.

� Simple matching (S4). Produces a matrix of simple matching dichotomy coefficients.

� Anderberg (S5). Produces a matrix of Anderberg’s dichotomy coefficients.

Page 147: Statistics I

I-127

Correlat ions, Simi lar it ies, and Distance Measures

� Tanimoto (S6). Produces a matrix of Tanimoto’s dichotomy coefficients.

� Tetra. Produces a matrix of tetrachoric correlations.

Correlations Options

To specify options for correlations, click Options in the Correlations dialog box.

The following options are available:

Probabilities. Requests probability of each correlation coefficient to test that the correlation is 0. Appropriate if you select only one correlation coefficient to test. Bonferroni and Dunn-Sidak use adjusted probabilities. Available only for Pearson product-moment correlations.

(EM) Estimation. Requests the EM algorithm to estimate Pearson correlation, covariance, or SSCP matrices from data with missing values. Little’s MCAR test is displayed with a graphical display of the pattern of missing values. For robust estimates where outliers are downweighted, select Normal or t.

� Normal produces maximum likelihood estimates for a contaminated multivariate normal sample. For the contaminated normal, SYSTAT assumes that the distribution is a mixture of two normal distributions (same mean, different variances) with a specified probability of contamination. The Probability value is the probability of contamination (for example, 0.10), and Variance is the variance

Page 148: Statistics I

I-128

Chapter 6

of contamination. Downweighting for the normal model tends to be concentrated in a few outlying cases.

� t produces maximum likelihood estimates for a t distribution, where df is the degrees of freedom. Downweighting for the multivariate t model tends to be more spread out than for the normal model. The degree of downweighting is inversely related to the degrees of freedom.

Iterations. Specifies the maximum number of iterations for computing the estimates.

Convergence. Defines the convergence criterion. If the relative change of covariance entries are less than the specified value, convergence is assumed.

Hadi outlier identification and estimation. Requests the HADI multivariate outlier detection algorithm to identify outliers and to compute the correlation, covariance, or SSCP matrix from the remaining cases. Tolerance omits variables with a multiple R-square value greater than (1 – n), where n is the specified tolerance value.

Using Commands

First, specify your data with USE filename. Then, type CORR and choose your measure and type:

MEASURE is one of:

For PEARSON, COVARIANCE, and SSCP, the following options are available:

In addition, PEARSON offers BONF, DUNN, and PROB as options.

Full matrix MEASURE varlist / options

Portion of matrix MEASURE rowlist * collist / options

BC QSK EUCLIDEAN CITY SPEARMANGAMMA MU2 TAU TETRA S2S3 S4 S5 S6 PEARSONCOVARIANCE SSCP

EM T=df NORMAL=n1,n2 ITER=n CONV=nHADI TOL=n

Page 149: Statistics I

I-129

Correlat ions, Simi lar it ies, and Distance Measures

Usage Considerations

Types of data. CORR uses rectangular data only.

Print options. With PRINT=LONG, SYSTAT prints the mean of each variable. In addition, for EM estimation, SYSTAT prints an iteration history, missing value patterns, Little’s MCAR test, and mean estimates.

Quick Graphs. CORR includes a SPLOM (matrix of scatterplots) where the data in each plot correspond to a value in the matrix.

Saving files. CORR saves the correlation matrix or other measure computed. SYSTAT automatically defines the type of file as CORR, DISS, COVA, SSCP, SIMI, or RECT.

BY groups. CORR analyzes data by groups. Your file need not be sorted on the BY variable(s).

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. FREQ=<variable> increases the number of cases by the FREQ variable.

Case weights. WEIGHT is available in CORR.

Examples

Example 1 Pearson Correlations

This example uses data from the OURWORLD file that contains records (cases) for 57 countries. We are interested in correlations among variables recording the percentage of the population living in cities, birth rate, gross domestic product per capita, dollars expended per person for the military, ratio of birth rates to death rates, life expectancy (in years) for males and females, percentage of the population who can read, and gross national product per capita in 1986. The input is:

CORRUSE ourworldPEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86

Page 150: Statistics I

I-130

Chapter 6

The output follows:

The correlations for all pairs of the nine variables are shown here. The bottom of the output panel shows that the sample size is 49, but the data file has 57 countries. If a country has one or more missing values, SYSTAT, by default, omits all of the data for the case. This is called listwise deletion.

The Quick Graph is a matrix of scatterplots with one plot for each entry in the correlation matrix and histograms of the variables on the diagonal. For example, the plot of BIRTH_RT against URBAN is at the top left under the histogram for URBAN.

Pearson correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF URBAN 1.000 BIRTH_RT -0.800 1.000 GDP_CAP 0.625 -0.762 1.000 MIL 0.597 -0.672 0.899 1.000 B_TO_D -0.307 0.511 -0.659 -0.607 1.000 LIFEEXPM 0.776 -0.922 0.664 0.582 -0.211 1.000 LIFEEXPF 0.801 -0.949 0.704 0.619 -0.265 0.989 1.000 LITERACY 0.800 -0.930 0.637 0.562 -0.274 0.911 0.935 GNP_86 0.592 -0.689 0.964 0.873 -0.560 0.633 0.665

LITERACY GNP_86 LITERACY 1.000 GNP_86 0.611 1.000

Number of observations: 49

UR

BA

NB

IRT

H_R

TG

DP

_CA

PM

ILB

_TO

_DLI

FE

EX

PM

LIF

EE

XP

FLI

TE

RA

CY

URBAN

GN

P_8

6

BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF LITERACY GNP_86

Page 151: Statistics I

I-131

Correlat ions, Simi lar it ies, and Distance Measures

If linearity does not hold for your variables, your results may be meaningless. A good way to assess linearity, the presence of outliers, and other anomalies is to examine the plot for each pair of variables in the scatterplot matrix. The relationships between GDP_CAP and BIRTH_RT, B_TO_D, LIFEEXPM, and LIFEEXPF do not appear to be linear. Also, the points in the MIL versus GDP_CAP and GNP_86 versus MIL displays clump in the lower left corner. It is not wise to use correlations for describing these relations.

Altering the Format

The correlation matrix for this example wraps (the results for nine variables do not fit in one panel). You squeeze in more results by specifying a field width and the number of decimal places. For example, the same correlations printed in a field 6 characters wide is shown below. We request only 2 digits to the right of the decimal instead of 3.

(Using the command language, press F9 to retrieve the previous PEARSON statement instead of retyping it.)

The output is:

Notice that while the top row of variable names is truncated to fit within the field specification, the row names remain complete.

CORR USE ourworld FORM 6 2 PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86

Pearson correlation matrix URBAN BIRTH_ GDP_CA MIL B_TO_D LIFEEX LIFEEX LITERA GNP_86 URBAN 1.00 BIRTH_RT -0.80 1.00 GDP_CAP 0.62 -0.76 1.00 MIL 0.60 -0.67 0.90 1.00 B_TO_D -0.31 0.51 -0.66 -0.61 1.00 LIFEEXPM 0.78 -0.92 0.66 0.58 -0.21 1.00 LIFEEXPF 0.80 -0.95 0.70 0.62 -0.26 0.99 1.00 LITERACY 0.80 -0.93 0.64 0.56 -0.27 0.91 0.93 1.00 GNP_86 0.59 -0.69 0.96 0.87 -0.56 0.63 0.67 0.61 1.00 Number of observations: 49

Page 152: Statistics I

I-132

Chapter 6

Requesting a Portion of a Matrix

You can request that only a portion of the matrix be computed. The input follows:

The resulting output is:

These correlations correspond to the lower left corner of the first matrix.

Example 2 Transformations

If relationships between variables appear nonlinear, using a measure of linear association is not advised. Fortunately, transformations of the variables may yield linear relationships. You can then use the linear relation measures, but all conclusions regarding the relationships are relative to the transformed variables instead of the original variables.

In the Pearson correlations example, we observed nonlinear relationships involving GDP_CAP, MIL, and GNP_86. Here we log transform these variables and compare the resulting correlations to those for the untransformed variables. The input is:

CORR USE ourworld FORMAT PEARSON lifeexpm lifeexpf literacy gnp_86 *, urban birth_rt gdp_cap mil b_to_d

Pearson correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM 0.776 -0.922 0.664 0.582 -0.211 LIFEEXPF 0.801 -0.949 0.704 0.619 -0.265 LITERACY 0.800 -0.930 0.637 0.562 -0.274 GNP_86 0.592 -0.689 0.964 0.873 -0.560 Number of observations: 49

CORRUSE ourworldLET (gdp_cap,mil,gnp_86) = L10(@)PRINT = LONGPEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86

Page 153: Statistics I

I-133

Correlat ions, Simi lar it ies, and Distance Measures

Notice that we use SYSTAT’s shortcut notation to make the transformation. Alternatively, you could use:

The output follows:

LET gdp_cap = L10(gdp_cap)LET mil = L10(mil) LET gnp_86 = L10(gnp_86)

Means URBAN BIRTH_RT GDP_CAP MIL B_TO_D 52.8776 25.9592 3.3696 1.6954 2.8855 LIFEEXPM LIFEEXPF LITERACY GNP_86 65.4286 70.5714 74.7265 3.2791 Pearson correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 1.0000 BIRTH_RT -0.8002 1.0000 GDP_CAP 0.7636 -0.9189 1.0000 MIL 0.6801 -0.8013 0.8947 1.0000 B_TO_D -0.3074 0.5106 -0.5293 -0.5374 1.0000 LIFEEXPM 0.7756 -0.9218 0.8599 0.7267 -0.2113 LIFEEXPF 0.8011 -0.9488 0.8954 0.7634 -0.2648 LITERACY 0.7997 -0.9302 0.8337 0.7141 -0.2737 GNP_86 0.7747 -0.8786 0.9736 0.8773 -0.4411 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 1.0000 LIFEEXPF 0.9887 1.0000 LITERACY 0.9110 0.9350 1.0000 GNP_86 0.8610 0.8861 0.8404 1.0000 Number of observations: 49

UR

BA

NB

IRT

H_R

TG

DP

_CA

PM

ILB

_TO

_DLI

FE

EX

PM

LIF

EE

XP

FLI

TE

RA

CY

URBAN

GN

P_8

6

BIRTH_RT GDP_CAP MIL B_TO_D LIFEEXPM LIFEEXPF LITERACY GNP_86

Page 154: Statistics I

I-134

Chapter 6

In the scatterplot matrix, linearity has improved in the plots involving GDP_CAP, MIL, and GNP_86. Look at the difference between the correlations before and after transformation.

After log transforming the variables, linearity has improved in the plots, and many of the correlations are stronger.

Example 3 Missing Data: Pairwise Deletion

To specify pairwise deletion, the input is:

The output is:

Transformation Transformation Transformation

no yes no yes no yes

gdp_cap vs. mil vs. gnp_86 vs.

urban 0.625 0.764 urban 0.597 0.680 urban 0.592 0.775birth_rt –0.762 –0.919 birth_rt –0.672 –0.801 birth_rt –0.689 –0.879lifeexpm 0.664 0.860 lifeexpm 0.582 0.727 lifeexpm 0.633 0.861lifeexpf 0.704 0.895 lifeexpf 0.619 0.763 lifeexpf 0.665 0.886literacy 0.637 0.834 literacy 0.562 0.714 literacy 0.611 0.840

USE ourworldCORR LET (gdp_cap,mil,gnp_86) = L10(@) GRAPH NONE PRINT = LONG PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86 / PAIR

Means URBAN BIRTH_RT GDP_CAP MIL B_TO_D 52.821 26.351 3.372 1.775 2.873 LIFEEXPM LIFEEXPF LITERACY GNP_86 65.088 70.123 73.563 3.293

Pearson correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 1.000 BIRTH_RT -0.781 1.000 GDP_CAP 0.778 -0.895 1.000 MIL 0.683 -0.687 0.857 1.000 B_TO_D -0.248 0.535 -0.472 -0.377 1.000 LIFEEXPM 0.796 -0.892 0.854 0.696 -0.172 LIFEEXPF 0.816 -0.924 0.891 0.721 -0.230 LITERACY 0.807 -0.930 0.832 0.646 -0.291 GNP_86 0.775 -0.881 0.974 0.881 -0.455

Page 155: Statistics I

I-135

Correlat ions, Simi lar it ies, and Distance Measures

The sample size for each variable is reported as the diagonal of the pairwise frequency table; sample sizes for complete pairs of cases are reported off the diagonal. There are 57 countries in this sample—56 reported the percentage living in cities (URBAN), and 50 reported the gross national product per capita in 1986 (GNP_86). There are 49 countries that have values for both URBAN and GNP_86.

The means are printed because we specified PRINT=LONG. Since pairwise deletion is requested, all available values are used to compute each mean—that is, these means are the same as those computed by the Statistics procedure.

Example 4 Missing Data: EM Estimation

This example uses the same variables used in the transformations example. To specify EM estimation, the input is:

LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 1.000 LIFEEXPF 0.989 1.000 LITERACY 0.911 0.937 1.000 GNP_86 0.863 0.888 0.842 1.000 Pairwise frequency table URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 56 BIRTH_RT 56 57 GDP_CAP 56 57 57 MIL 55 56 56 56 B_TO_D 56 57 57 56 57 LIFEEXPM 56 57 57 56 57 LIFEEXPF 56 57 57 56 57 LITERACY 56 57 57 56 57 GNP_86 49 50 50 50 50 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 57 LIFEEXPF 57 57 LITERACY 57 57 57 GNP_86 50 50 50 50

CORRUSE ourworldLET (gdp_cap,mil,gnp_86) = L10(@)IDVAR = country$GRAPH = NONEPRINT = LONGPEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm, lifeexpf literacy gnp_86 / EM

Page 156: Statistics I

I-136

Chapter 6

The output follows:

SYSTAT prints missing-value patterns for the data. Forty-nine cases in the sample are complete (an X is printed for each of the nine variables). Periods are inserted where data are missing. The value of the first variable, URBAN, is missing for one case, while the value of the last variable, GNP_86, is missing for six cases. The last row of the pattern indicates that the values of the fourth variable, MIL, and the last variable, GNP_86, are both missing for one case.

EM Algorithm Iteration Maximum Error -2*log(likelihood) --------- ------------- ------------------ 1 1.092328 24135.483249 2 1.023878 7625.491302 3 0.643113 6932.605472 4 0.666125 6691.458724 5 0.857590 6573.199525 6 2.718236 6538.852550 7 0.728468 6531.689766 8 0.196577 6530.369252 9 0.077590 6530.167056 10 0.034510 6530.159651 11 0.016278 6530.176410 12 0.007986 6530.190050 13 0.004050 6530.198695 14 0.002120 6530.203895 15 0.001145 6530.207008 16 0.000637 6530.208887 No.of Missing value patterns Cases (X=nonmissing; .=missing) 49 XXXXXXXXX 1 .XXXXXXXX 6 XXXXXXXX. 1 XXX.XXXX.

Little MCAR test statistic: 35.757 df = 23 prob = 0.044

EM estimate of means URBAN BIRTH_RT GDP_CAP MIL B_TO_D 53.152 26.351 3.372 1.754 2.873 LIFEEXPM LIFEEXPF LITERACY GNP_86 65.088 70.123 73.563 3.284 EM estimated correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 1.000 BIRTH_RT -0.782 1.000 GDP_CAP 0.779 -0.895 1.000 MIL 0.700 -0.697 0.863 1.000 B_TO_D -0.259 0.535 -0.472 -0.357 1.000 LIFEEXPM 0.796 -0.892 0.854 0.713 -0.172 LIFEEXPF 0.816 -0.924 0.891 0.738 -0.230 LITERACY 0.808 -0.930 0.832 0.668 -0.291 GNP_86 0.796 -0.831 0.968 0.874 -0.342 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 1.000 LIFEEXPF 0.989 1.000 LITERACY 0.911 0.937 1.000 GNP_86 0.863 0.885 0.828 1.000

Page 157: Statistics I

I-137

Correlat ions, Simi lar it ies, and Distance Measures

Little’s MCAR (missing completely at random) test has a probability less than 0.05, indicating that we reject the hypothesis that the nine missing values are randomly missing. This test has limited power when the sample of incomplete cases is small and it also offers no direct evidence on the validity of the MAR assumption.

Example 5 Probabilities Associated with Correlations

To request the usual (uncorrected) probabilities for a correlation matrix using pairwise deletion:

The output is:

The p values that are appropriate for making statements regarding one specific correlation are shown here. By themselves, these values are not very informative. These p values are pseudo-probabilities because they do not reflect the number of correlations being tested. If pairwise deletion is used, the problem is even worse, although many statistics packages print probabilities as if they meant something in this case, too.

USE ourworldCORR LET (gdp_cap,mil,gnp_86) = L10(@) GRAPH NONE PRINT = LONG PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86 / PAIR PROB

Bartlett Chi-square statistic: 815.067 df=36 Prob= 0.000

Matrix of Probabilities URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 0.0 BIRTH_RT 0.000 0.0 GDP_CAP 0.000 0.000 0.0 MIL 0.000 0.000 0.000 0.0 B_TO_D 0.065 0.000 0.000 0.004 0.0 LIFEEXPM 0.000 0.000 0.000 0.000 0.202 LIFEEXPF 0.000 0.000 0.000 0.000 0.085 LITERACY 0.000 0.000 0.000 0.000 0.028 GNP_86 0.000 0.000 0.000 0.000 0.001 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 0.0 LIFEEXPF 0.000 0.0 LITERACY 0.000 0.000 0.0 GNP_86 0.000 0.000 0.000 0.0

Page 158: Statistics I

I-138

Chapter 6

SYSTAT computes the Bartlett chi-square test whenever you request probabilities for more than one correlation. This tests a global hypothesis concerning the significance of all of the correlations in the matrix

where N is the total sample size (or the smallest sample size for any pair in the matrix if pairwise deletion is used), p is the number of variables, and |R| is the determinant of the correlation matrix. This test is sensitive to non-normality, and the test statistic is only asymptotically distributed (for large samples) as chi-square. Nevertheless, it can serve as a guideline.

If the Bartlett test is not significant, don’t even look at the significance of individual correlations. In this example, the test is significant, which indicates that there may be some real correlations among the variables. The Bartlett test is sensitive to non-normality and can be used only as a guide. Even if the Bartlett test is significant, you cannot accept the nominal p values as the true family probabilities associated with each correlation.

Bonferroni Probabilities with Pairwise Deletion

Let’s now examine the probabilities adjusted by the Bonferroni method that provides protection for multiple tests. Remember that the log-transformed values from the transformations example are still in effect. The input is:

USE ourworldCORR LET (gdp_cap,mil,gnp_86) = L10(@) GRAPH NONE PRINT = LONG PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86 / PAIR BONF

( )Rln

6

1p21N2

+−−=χ

Page 159: Statistics I

I-139

Correlat ions, Simi lar it ies, and Distance Measures

The output follows:

Compare these results with those for the 36 tests using uncorrected probabilities. Notice that some correlations, such as those for B_TO_D with MIL, LITERACY, and GNP_86, are no longer significant.

Bonferroni Probabilities for EM Estimates

You can request the Bonferroni adjusted probabilities for an EM estimated matrix by specifying:

The probabilities follow:

Bartlett Chi-square statistic: 815.067 df=36 Prob= 0.000 Matrix of Bonferroni Probabilities URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 0.0 BIRTH_RT 0.000 0.0 GDP_CAP 0.000 0.000 0.0 MIL 0.000 0.000 0.000 0.0 B_TO_D 1.000 0.001 0.008 0.150 0.0 LIFEEXPM 0.000 0.000 0.000 0.000 1.000 LIFEEXPF 0.000 0.000 0.000 0.000 1.000 LITERACY 0.000 0.000 0.000 0.000 1.000 GNP_86 0.000 0.000 0.000 0.000 0.032 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 0.0 LIFEEXPF 0.000 0.0 LITERACY 0.000 0.000 0.0 GNP_86 0.000 0.000 0.000 0.0

USE ourworldCORR LET (gdp_cap,mil,gnp_86) = L10(@) GRAPH NONE PRINT = LONG PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf, literacy gnp_86 / EM BONF

Bartlett Chi-square statistic: 821.288 df=36 Prob= 0.000 Matrix of Bonferroni Probabilities URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 0.0 BIRTH_RT 0.000 0.0 GDP_CAP 0.000 0.000 0.0 MIL 0.000 0.000 0.000 0.0 B_TO_D 1.000 0.001 0.008 0.248 0.0 LIFEEXPM 0.000 0.000 0.000 0.000 1.000 LIFEEXPF 0.000 0.000 0.000 0.000 1.000 LITERACY 0.000 0.000 0.000 0.000 1.000 GNP_86 0.000 0.000 0.000 0.000 0.537

Page 160: Statistics I

I-140

Chapter 6

Example 6 Hadi Robust Outlier Detection

If only one or two variables have outliers among many well behaved variables, the outliers may be masked. Let’s look for outliers among four variables. The input is:

The output is:

LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 0.000 LIFEEXPF 0.000 0.0 LITERACY 0.000 0.000 0.0 GNP_86 0.000 0.000 0.000 0.000

USE ourworldCORR LET (gdp_cap, mil) = L10(@) GRAPH = NONE PRINT = LONG IDVAR = country$ PEARSON gdp_cap mil b_to_d literacy / HADIPLOT GDP_CAP*B_TO_D*LITERACY / SPIKE XGRID YGRID AXES=BOOK, SCALE=L SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250

These 15 outliers are identified: Case Distance------------ ------------Venezuela 4.48653CostaRica 4.55336Senegal 4.66615Sudan 4.74882Ethiopia 4.82013Pakistan 5.05827Libya 5.10295Haiti 5.44901Bangladesh 5.47974Yemen 5.84027Gambia 5.84202Iraq 5.84507Guinea 6.12308Somalia 6.18465Mali 6.30091 Means of variables of non-outlying cases GDP_CAP MIL B_TO_D LITERACY 3.634 1.967 2.533 88.183 HADI estimated correlation matrix GDP_CAP MIL B_TO_D LITERACY GDP_CAP 1.000 MIL 0.860 1.000 B_TO_D -0.839 -0.753 1.000 LITERACY 0.729 0.642 -0.698 1.000 Number of observations: 56

Page 161: Statistics I

I-141

Correlat ions, Similar i ties, and Distance Measures

Fifteen countries are identified as outliers. We suspect that the sample may not behomogeneous so we request a plot labeled by GROUP$. The panel is set toPRINT=LONG; the country names appear because we specified COUNTRY$ as an IDvariable. The correlations at the end of the output are computed using the 30 or so casesthat are not identified as outliers.

In the plot, we see that Islamic countries tend to fall between New World andEuropean countries with respect to birth-to-death ratio and have the lowest literacy.European countries have the highest literacy and GDP_CAP values.

Stratifying the Analysis

We’ll use Hadi for each of the three groups separately:

USE ourworldCORR LET (gdp_cap, mil) = L10(@) GRAPH = NONE PRINT = LONG IDVAR = country$ BY group$ PEARSON gdp_cap mil b_to_d literacy / HADI BY

EEEEEE

E

EEEEE

E

EEE

EEEE

I

I

II

II

I

I

I

I

II

I

II

I

N

N

N

N

N

NN

NN

N

NNNN

N

N

N

N

N

NN

Page 162: Statistics I

I-142

Chapter 6

For clarity, we edited the following output by moving the panels of means to the end:

When computations are done separately for each group, Portugal is the only outlier, and the within-groups correlations differ markedly from group to group and from those for the complete sample. By scanning the means, we also see that the centroids for the three groups are quite different.

The following results are for: GROUP$ = Europe These 1 outliers are identified: Case Distance------------ ------------Portugal 5.72050 HADI estimated correlation matrix GDP_CAP MIL B_TO_D LITERACY GDP_CAP 1.000 MIL 0.474 1.000 B_TO_D -0.092 -0.173 1.000 LITERACY 0.259 0.263 0.136 1.000 Number of observations: 20 The following results are for: GROUP$ = Islamic HADI estimated correlation matrix GDP_CAP MIL B_TO_D LITERACY GDP_CAP 1.000 MIL 0.877 1.000 B_TO_D 0.781 0.882 1.000 LITERACY 0.600 0.605 0.649 1.000 Number of observations: 15 The following results are for: GROUP$ = NewWorld HADI estimated correlation matrix GDP_CAP MIL B_TO_D LITERACY GDP_CAP 1.000 MIL 0.674 1.000 B_TO_D -0.246 -0.287 1.000 LITERACY 0.689 0.561 -0.045 1.000 Number of observations: 21

Means of variables of non-outlying cases (Europe) GDP_CAP MIL B_TO_D LITERACY 4.059 2.404 1.260 98.316 Means of variables of non-outlying cases (Islamic) GDP_CAP MIL B_TO_D LITERACY 2.764 1.400 3.547 36.733 Means of variables of non-outlying cases (NewWorld) GDP_CAP MIL B_TO_D LITERACY 3.214 1.466 3.951 79.957

Page 163: Statistics I

I-143

Correlat ions, Simi lar it ies, and Distance Measures

Example 7 Spearman Correlations

As an example, we request Spearman correlations for the same data used in the Pearson correlation and Tranformations examples. It is often useful to compute both a Spearman and a Pearson matrix using the same data. The absolute difference between the two can reveal unusual features such as outliers and highly skewed distributions. The input is:

The correlation matrix follows:

Note that many of these correlations are closer to the Pearson correlations for the log-transformed data than they are to the correlations for the raw data.

Example 8 S2 and S3 Coefficients

The choice among the binary S measures depends on what you want to state about your variables. In this example, we request S2 and S3 to study responses made by 256 subjects to a depression inventory (Afifi and Clark, 1984). These data are stored in the SURVEY2 data file that has one record for each respondent with answers to 20 questions about depression. Each subject was asked, for example, “Last week, did you cry less than 1 day (code 0), 1 to 2 days (code 1), 3 to 4 days (code 2), or 5 to 7 days (code 3)?” The distributions of the answers appear to be Poisson, so they are not

USE ourworldCORR GRAPH = NONE SPEARMAN urban birth_rt gdp_cap mil b_to_d, lifeexpm lifeexpf literacy gnp_86 / PAIR

Spearman correlation matrix URBAN BIRTH_RT GDP_CAP MIL B_TO_D URBAN 1.000 BIRTH_RT -0.749 1.000 GDP_CAP 0.777 -0.874 1.000 MIL 0.678 -0.670 0.848 1.000 B_TO_D -0.381 0.689 -0.597 -0.498 1.000 LIFEEXPM 0.731 -0.856 0.834 0.633 -0.410 LIFEEXPF 0.771 -0.902 0.910 0.709 -0.501 LITERACY 0.760 -0.868 0.882 0.696 -0.576 GNP_86 0.767 -0.847 0.973 0.867 -0.543 LIFEEXPM LIFEEXPF LITERACY GNP_86 LIFEEXPM 1.000 LIFEEXPF 0.965 1.000 LITERACY 0.813 0.866 1.000 GNP_86 0.834 0.901 0.909 1.000

Page 164: Statistics I

I-144

Chapter 6

satisfactory for Pearson correlations. Here we dichotomize the behaviors or feelings as “Did it occur or did it not?” by using transformations of the form:

The result is true (1) when the behavior or feeling is present or false (0) when it is absent. We use SYSTAT’s shortcut notation to do this for 7 of the 20 questions. For each pair of feelings or behaviors, S2 indicates the proportion of subjects with both, and S3 indicates the proportion of times both occurred given that one occurs. To perform this example:

The matrices follow:

LET blue = blue <> 0

USE survey2CORRLET (blue,depress,cry,sad,no_eat,getgoing,talkless) = @ <> 0 GRAPH = NONE S2 blue depress cry sad no_eat getgoing talkless S3 blue depress cry sad no_eat getgoing talkless

S2 (Russell and Rao) binary similarity coefficients BLUE DEPRESS CRY SAD NO_EAT BLUE 0.254 DEPRESS 0.207 0.422 CRY 0.090 0.113 0.133 SAD 0.188 0.313 0.117 0.391 NO_EAT 0.117 0.129 0.051 0.137 0.246 GETGOING 0.180 0.309 0.086 0.258 0.152 TALKLESS 0.117 0.156 0.059 0.145 0.098 GETGOING TALKLESS GETGOING 0.520 TALKLESS 0.172 0.246 Number of observations: 256 S3 (Jaccard) binary similarity coefficients BLUE DEPRESS CRY SAD NO_EAT BLUE 1.000 DEPRESS 0.442 1.000 CRY 0.303 0.257 1.000 SAD 0.410 0.625 0.288 1.000 NO_EAT 0.306 0.239 0.155 0.273 1.000 GETGOING 0.303 0.488 0.152 0.395 0.248 TALKLESS 0.306 0.305 0.183 0.294 0.248 GETGOING TALKLESS GETGOING 1.000 TALKLESS 0.289 1.000 Number of observations: 256

Page 165: Statistics I

I-145

Correlat ions, Simi lar it ies, and Distance Measures

The frequencies for DEPRESS and SAD are:

For S2, the result is 80/256 = 0.313; for S3, 80/128 = 0.625.

Example 9 Tetrachoric Correlation

As an example, we use the bivariate normal data in the SYSTAT data file named TETRA. The input is:

The output follows:

For our single pair of variables, the tetrachoric correlation is 0.81.

Computation

All computations are implemented in double precision.

Algorithms

The computational algorithms use provisional means, sums of squares, and cross-products (Spicer, 1972). Starting values for the EM algorithm use all available values (see Little and Rubin, 1987, p. 42).

Sad

1 0

Depress1 80 200 28 128

USE tetraFREQ = countCORR TETRA x y

Tetrachoric correlations X Y X 1.000 Y 0.810 1.000 Number of observations: 45

Page 166: Statistics I

I-146

Chapter 6

For the rank-order coefficients (Gamma, Mu2, Spearman, and Tau), keep in mind that these are time consuming. Spearman requires sorting and ranking the data before doing the same work done by Pearson. The Gamma and Mu2 items require computations between all possible pairs of observations. Thus, their computing time is combinatoric.

Missing Data

If you have missing data, CORR can handle them in three ways: listwise deletion, pairwise deletion, and EM estimation. Listwise deletion is the default. If there are missing data and pairwise deletion is used, SYSTAT displays a table of frequencies between all possible pairs of variables after the correlation matrix.

Pairwise deletion takes considerably more computer time because the sums of cross-products for each pair must be saved in a temporary disk file. If you use the pairwise deletion to compute an SSCP matrix, the sums of squares and cross-products are weighted by N/n, where N is the number of cases in the whole file and n is the number of cases with nonmissing values in a given pair.

See Chapter II-1 for a complete discussion of handling missing values.

References

Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.: Lifetime Learning Publications.

Faith, D. P., Minchin, P., and Belbin, L. (1987). Compositional dissimilarity as a robust measure of ecological distance. Vegetatio, 69, 57–68.

Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross-classification. Journal of the American Statistical Association, 49, 732–764.

Gower, J. C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S. and Johnson, N. L. Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley & Sons, Inc.

Hadi, A. S. (1994). A modification of a method for the detection of outliers in multivariate samples. In Journal of the Royal Statistical Society, Series (B), 56, No. 2.

Little, R. J. A. and Rubin, D. B. (1987). Statistical analyses with missing data. New York: John Wiley & Sons, Inc.

Shye, S., ed. (1978). Theory construction and data analysis in the behavioral sciences. San Francisco: Jossey-Bass, Inc.

Page 167: Statistics I

I-147

Chapte r

7Correspondence Analysis

Leland Wilkinson

Correspondence analysis allows you to examine the relationship between categorical variables graphically. It computes simple and multiple correspondence analysis for two-way and multiway tables of categorical variables, respectively. Tables are decomposed into row and column coordinates, which are displayed in a graph. Categories that are similar to each other appear close to each other in the graphs.

Statistical Background

Correspondence analysis is a method for decomposing a table of data into row and column coordinates that can be displayed graphically. With this technique, a two-way table can be represented in a two-dimensional graph with points for rows and columns. These coordinates are computed with a Singular Value Decomposition (SVD), which factors a matrix into the product of three matrices: a collection of left singular vectors, a matrix of singular values, and a collection of right singular vectors. Greenacre (1984) is the most comprehensive reference. Hill (1974) and Jobson (1992) cover the major topics more briefly.

The Simple Model

The simple correspondence analysis model decomposes a two-way table. This decomposition begins with a matrix of standardized deviates, computed for each cell in the table as follows:

zij1

N--------

oij eij–

eij

---------------- ×=

Page 168: Statistics I

I-148

Chapter 7

where N is the sum of the table counts for all , is the observed count for cell ij, and is the expected count for cell ij based on an independence model. The second term in this equation is a cell’s contribution to the test-for-independence statistic. Thus, the sum of the squared over all cells in the table is the same as . Finally, the row mass for row i is and the column mass for column j is .

The next step is to compute the matrix of cross-products from this matrix of deviates:

This S matrix has nonzero eigenvalues, where r and c are the row and column dimensions of the original table, respectively. The sum of these eigenvalues is (which is termed total inertia). It is this matrix that is decomposed as follows:

where U is a matrix of row vectors, V is a matrix of column vectors, and D is a diagonal matrix of the eigenvalues. The coordinates actually plotted are standardized from U (for rows), so that

The coordinates are similarly standardized from V (for columns).

The Multiple Model

The multiple correspondence model decomposes higher-way tables. Suppose we have a multiway table of dimension by by by .... The multiple model begins with an n by p matrix Z of dummy-coded profiles, where n = the total number of cases in the table and . This matrix is used to create a cross-products matrix:

which is rescaled and decomposed with a singular value decomposition, as before. See Jobson (1992) for further information.

nij oij

eij

χ2

zij χ2 N⁄ni. N⁄ n.j N⁄

S Z’Z=

t min r 1– ,c 1–( )=

χ2 N⁄

S UDV’=

χ2 N⁄ ni N⁄( ) xij2

j 1=

t

∑i 1=

r

∑=

k1 k2 k3

p k1= k2 k3 ...+ + +

S Z’= Z

Page 169: Statistics I

I-149

Correspondence Analysis

Correspondence Analysis in SYSTAT

Correspondence Analysis Main Dialog Box

To open the Correspondence Analysis dialog box, from the menus choose:

StatisticsData Reduction

Correspondence Analysis…

A correspondence analysis is conducted by specifying a model and estimating it.

Dependent(s). Select the variable(s) you want to examine. The dependent variable(s) should be categorical. To analyze a two-way table (simple correspondence analysis), select a variable defining the rows. Selecting multiple dependent variables (and no independent variables) yields a multiple correspondence model.

Independent(s). To analyze a two-way table, select a categorical variable defining the columns of the table.

Save coordinates to file. Saves coordinates and labels to a data file.

You can specify one of two methods for handling missing data:

� Pairwise deletion. Pairwise deletion examines each pair of variables and uses all cases with both values present.

� Listwise deletion. Listwise deletion deletes any case with missing data for any variable in the list.

Page 170: Statistics I

I-150

Chapter 7

Using Commands

First, specify your data with USE filename. For a simple correspondence analysis, continue with:

For a multiple correspondence analysis:

If data are aggregated and there is a variable in the file representing frequency of profiles, use FREQ to identify that variable.

Usage Considerations

Types of data. CORAN uses rectangular data only.

Print options. There are no print options.

Quick Graphs. Quick Graphs produced by CORAN are correspondence plots for the simple or multiple models.

Saving files. For simple correspondence analysis, CORAN saves the row variable coordinates in DIM(1)...DIM(N) and the column variable coordinates in FACTOR(1)...FACTOR(N), where the subscript indicates the dimension number. For multiple correspondence analysis, DIM(1)...DIM(N) contain the variable coordinates and FACTOR(1)...FACTOR(N) contain the case coordinates. Label information is saved to LABEL$.

BY groups. CORAN analyzes data by groups. Your file need not be sorted on the BY variable(s).

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. FREQ=variable increases the number of cases by the FREQ variable.

Case weights. WEIGHT is not available in CORAN.

CORAN MODEL depvar=indvar ESTIMATE

CORAN MODEL varlist ESTIMATE

Page 171: Statistics I

I-151

Correspondence Analysis

Examples

The examples begin with a simple correspondence analysis of a two-way table from Greenacre (1984). This is followed by a multiple correspondence analysis example.

Example 1 Correspondence Analysis (Simple)

Here we illustrate a simple correspondence analysis model. The data comprise a hypothetical smoking survey in a company (Greenacre, 1984). Notice that we use value labels to describe the categories in the output and plot. The FREQ command codes the cell frequencies. The input is:

The resulting output is:

USE SMOKELABEL STAFF / 1=”Sr.Managers”,2=”Jr.Managers”,3=”Sr.Employees”, 4=”Jr.Employees”,5=”Secretaries”LABEL SMOKE / 1=”None”,2=”Light”,3=”Moderate”,4=”Heavy”FREQ=FREQCORANMODEL STAFF=SMOKEESTIMATE

Variables in the SYSTAT Rectangular file are: STAFF SMOKE FREQ Case frequencies determined by value of variable FREQ. Categorical values encountered during processing are: STAFF (5 levels) Sr.Managers, Jr.Managers, Sr.Employees, Jr.Employees, Secretaries SMOKE (4 levels) None, Light, Moderate, Heavy Simple Correspondence Analysis Chi-Square = 16.442. Degrees of freedom = 12. Probability = 0.172. Factor Eigenvalue Percent Cum Pct 1 0.075 87.76 87.76 ----------------------------------- 2 0.010 11.76 99.51 ---- 3 0.000 .49 100.00 Sum 0.085 (Total Inertia) Row Variable Coordinates Name Mass Quality Inertia Factor 1 Factor 2 Sr.Managers 0.057 0.893 0.003 0.066 0.194 Jr.Managers 0.093 0.991 0.012 -0.259 0.243 Sr.Employees 0.264 1.000 0.038 0.381 0.011 Jr.Employees 0.456 1.000 0.026 -0.233 -0.058 Secretaries 0.130 0.999 0.006 0.201 -0.079

Page 172: Statistics I

I-152

Chapter 7

Row variable contributions to factors Name Factor 1 Factor 2 Sr.Managers 0.003 0.214 Jr.Managers 0.084 0.551 Sr.Employees 0.512 0.003 Jr.Employees 0.331 0.152 Secretaries 0.070 0.081 Row variable squared correlations with factors Name Factor 1 Factor 2 Sr.Managers 0.092 0.800 Jr.Managers 0.526 0.465 Sr.Employees 0.999 0.001 Jr.Employees 0.942 0.058 Secretaries 0.865 0.133 Column variable coordinates Name Mass Quality Inertia Factor 1 Factor 2 None 0.316 1.000 0.049 0.393 0.030 Light 0.233 0.984 0.007 -0.099 -0.141 Moderate 0.321 0.983 0.013 -0.196 -0.007 Heavy 0.130 0.995 0.016 -0.294 0.198 Column variable contributions to factors Name Factor 1 Factor 2 None 0.654 0.029 Light 0.031 0.463 Moderate 0.166 0.002 Heavy 0.150 0.506 Column variable squared correlations with factors Name Factor 1 Factor 2 None 0.994 0.006 Light 0.327 0.657 Moderate 0.982 0.001 Heavy 0.684 0.310

Page 173: Statistics I

I-153

Correspondence Analysis

For the simple correspondence model, CORAN prints the basic statistics and eigenvalues of the decomposition. Next are the row and column coordinates, with mass, quality, and inertia values. Mass equals the marginal total divided by the grand total. Quality is a measure (between 0 and 1) of how well a row or column point is represented by the first two factors. It is a proportion-of-variance statistic. See Greenacre (1984) for further information. Inertia is a row’s (or column’s) contribution to the total inertia. Contributions to the factors and squared correlations with the factors are the last reported statistics.

Example 2 Multiple Correspondence Analysis

This example uses automobile accident data in Alberta, Canada, reprinted in Jobson (1992). The categories are ordered with the ORDER command so that the output will show them in increasing order of severity. The data are in tabular form, so we use the FREQ command. The input is:

The resulting output is:

USE ACCIDENTFREQ=FREQORDER INJURY$ / SORT=”None”,”Minimal”,”Minor”,”Major”ORDER DRIVER$ / SORT=”Normal”,”Drunk”ORDER SEATBELT$ / SORT=”Yes”,”No”CORANMODEL INJURY$,DRIVER$,SEATBELT$ESTIMATE

Variables in the SYSTAT Rectangular file are: SEATBELT$ IMPACT$ INJURY$ DRIVER$ FREQ Case frequencies determined by value of variable FREQ. Categorical values encountered during processing are: INJURY$ (4 levels) None, Minimal, Minor, Major DRIVER$ (2 levels) Normal, Drunk SEATBELT$ (2 levels) Yes, No Multiple Correspondence Analysis Factor Eigenvalue Percent Cum Pct 1 0.373 22.37 22.37 -------- 2 0.334 20.02 42.39 -------- 3 0.333 20.00 62.39 -------- 4 0.325 19.50 81.89 ------- 5 0.302 18.11 100.00 ------- Sum 1.667 (Total Inertia)

Page 174: Statistics I

I-154

Chapter 7

Variable Coordinates Name Mass Quality Inertia Factor 1 Factor 2 None 0.303 0.351 0.031 0.189 0.008 Minimal 0.018 0.251 0.315 -1.523 -1.454 Minor 0.012 0.552 0.322 -2.134 3.294 Major 0.001 0.544 0.332 -3.962 -10.976 Normal 0.313 0.496 0.020 0.179 0.014 Drunk 0.020 0.496 0.313 -2.758 -0.211 Yes 0.053 0.279 0.280 1.143 -0.402 No 0.280 0.279 0.053 -0.217 0.076 Variable contributions to factors Name Factor 1 Factor 2 None 0.029 0.000 Minimal 0.111 0.113 Minor 0.141 0.375 Major 0.056 0.478 Normal 0.027 0.000 Drunk 0.414 0.003 Yes 0.187 0.026 No 0.036 0.005 Variable squared correlations with factors Name Factor 1 Factor 2 None 0.350 0.001 Minimal 0.131 0.120 Minor 0.163 0.389 Major 0.063 0.481 Normal 0.493 0.003 Drunk 0.493 0.003 Yes 0.249 0.031 No 0.249 0.031 Case coordinates Name Factor 1 Factor 2 1 0.825 -0.219 2 -0.779 -0.349 3 -0.110 -1.063 4 -1.713 -1.193 5 -0.443 1.676 6 -2.047 1.547 7 -1.441 -6.558 8 -3.045 -6.687 9 0.825 -0.219 10 -0.779 -0.349 11 -0.110 -1.063 12 -1.713 -1.193 13 -0.443 1.676 14 -2.047 1.547 15 -1.441 -6.558 16 -3.045 -6.687 17 0.825 -0.219 18 -0.779 -0.349 19 -0.110 -1.063 20 -1.713 -1.193 21 -0.443 1.676 22 -2.047 1.547 23 -1.441 -6.558 24 0.825 -0.219 25 -0.779 -0.349 26 -0.110 -1.063 27 -1.713 -1.193 28 -0.443 1.676 29 -1.441 -6.558 30 -3.045 -6.687 31 0.082 0.057 32 -1.521 -0.073

Page 175: Statistics I

I-155

Correspondence Analysis

This time, we get case coordinates instead of column coordinates. These are not included in the following Quick Graph because the focus of the graph is on the tabular variables and we don’t want to clutter the display. If you want to plot case coordinates, cut and paste them into the editor and plot them directly.

Following is the Quick Graph:

33 -0.853 -0.787 34 -2.456 -0.916 35 -1.186 1.953 36 -2.790 1.823 37 -2.184 -6.281 38 -3.788 -6.411 39 0.082 0.057 40 -1.521 -0.073 41 -0.853 -0.787 42 -2.456 -0.916 43 -1.186 1.953 44 -2.790 1.823 45 -2.184 -6.281 46 -3.788 -6.411 47 0.082 0.057 48 -1.521 -0.073 49 -0.853 -0.787 50 -2.456 -0.916 51 -1.186 1.953 52 -2.790 1.823 53 -2.184 -6.281 54 -3.788 -6.411 55 0.082 0.057 56 -1.521 -0.073 57 -0.853 -0.787 58 -2.456 -0.916 59 -1.186 1.953 60 -2.790 1.823 61 -2.184 -6.281 62 -3.788 -6.411

Page 176: Statistics I

I-156

Chapter 7

The graph reveals a principal axis of major versus minor injuries. This axis is related to drunk driving and seat belt use.

Computation

All computations are in double precision.

Algorithms

CORAN uses a singular value decomposition of the cross-products matrix computed from the data.

Missing Data

Cases with missing data are deleted from all analyses.

References

Greenacre, M. J. (1984). Theory and applications of correspondence analysis. New York: Academic Press.

Hill, M. O. (1974). Correspondence analysis: A neglected multivariate method. Applied Statistics, 23, 340–354.

Jobson, J. D. (1992). Applied multivariate data analysis, Vol. II: Categorical and multivariate methods. New York: Springer-Verlag.

Page 177: Statistics I

I-157

Chapte r

8Crosstabulation

When variables are categorical, frequency tables (crosstabulations) provide useful summaries. For a report, you may need only the number or percentage of cases falling in specified categories or cross-classifications. At times, you may require a test of independence or a measure of association between two categorical variables. Or, you may want to model relationships among two or more categorical variables by fitting a loglinear model to the cell frequencies.

Both Crosstabs and Loglinear Model can make, analyze, and save frequency tables that are formed by categorical variables (or table factors). The values of the factors can be character or numeric. Both procedures form tables using data read from a cases-by-variables rectangular file or recorded as frequencies (for example, from a table in a report) with cell indices. In Crosstabs, you can request percentages of row totals, column totals, or the total sample size.

Crosstabs (on the Statistics menu) provides three types of frequency tables:

One-way Frequency counts, percentages, and confidence intervals on cell proportions for single table factors or categorical variables

Two-way Frequency counts, percentages, tests, and measures of association for the crosstabulation of two factors

Multiway Frequency counts and percentages for series of two-way tables stratified by all combinations of values of a third, fourth, etc., table factor

Page 178: Statistics I

I-158

Chapter 8

Statistical Background

Tables report results as counts or the number of cases falling in specific categories or cross-classifications. Categories may be unordered (democrat, republican, and independent), ordered (low, medium, and high), or formed by defining intervals on a continuous variable like AGE (child, teen, adult, and elderly).

Making Tables

There are many formats for displaying tabular data. Let’s examine basic layouts for counts and percentages.

One-Way Tables

Here is an example of a table showing the number of people of each gender surveyed about depression at UCLA in 1980.

The categorical variable producing this table is SEX$. Sometimes, you may define categories as intervals of a continuous variable. Here is an example showing the 256 people broken down by age.

Two-Way Tables

A crosstabulation is a table that displays one cell for every combination of values on two or more categorical variables. Here is a two-way table that crosses the gender and age distributions of the tables above.

Female Male Total+---------------+| 152 104 | 256+---------------+

18 to 30 30 to 45 46 to 60 Over 60 Total+-------------------------------------+| 79 80 64 33 | 256+-------------------------------------+

Female Male Total +-------------------+18 to 30 | 49 30 | 7930 to 45 | 48 32 | 8046 to 60 | 38 26 | 64 Over 60 | 17 16 | 33 +-------------------+Total 152 104 256

Page 179: Statistics I

I-159

Crosstabulation

This crosstabulation shows relationships between age and gender, which were invisible in the separate tables. Notice, for example, that the sample contains a large number of females below the age of 46.

Standardizing Tables with Percentages

As with other statistical procedures such as Correlation, it sometimes helps to have numbers standardized on a recognizable scale. Correlations vary between –1 and 1, for example. A convenient scale for table counts is percentage, which varies between 0 and 100.

With tables, you must choose a facet on which to standardize—rows, columns, or the total count in the table. For example, if we are interested in looking at the difference between the genders within age groups, we might want to standardize by rows. Here is that table:

Here we see that as age increases, the sample becomes more evenly dispersed across the two genders.

On the other hand, if we are interested in the overall distribution of age for each gender, we might want to standardize within columns:

For each gender, the oldest age group appears underrepresented.

Female Male Total N +-------------------+18 to 30 | 62.025 37.975 | 100.000 7930 to 45 | 60.000 40.000 | 100.000 8046 to 60 | 59.375 40.625 | 100.000 64 Over 60 | 51.515 48.485 | 100.000 33 +-------------------+Total 59.375 40.625 100.000N 152 104 256

Female Male Total N +-------------------+18 to 30 | 32.237 28.846 | 30.859 7930 to 45 | 31.579 30.769 | 31.250 8046 to 60 | 25.000 25.000 | 25.000 64 Over 60 | 11.184 15.385 | 12.891 33 +-------------------+Total 100.000 100.000 100.000N 152 104 256

Page 180: Statistics I

I-160

Chapter 8

Significance Tests and Measures of Association

After producing a table, you may want to consider a population model that accounts for the structure you see in the observed table. You should have a population in mind when you make such inferences. Many published statistical analyses of tables do not explicitly deal with the sampling problem.

One-Way Tables

A model for these data might be that the proportion of the males and females is equal in the population. The null hypothesis corresponding to the model is:

H: pmales= pfemales

The sampling model for testing this hypothesis requires that a population contains equal numbers of males and females and that each member of the population has an equal chance of being chosen. After choosing each person, we identify the person as male or female. There is no other category possible and one person cannot fit under both categories (exhaustive and mutually exclusive).

There is an exact way to reject our null hypothesis (called a permutation test). We can tally every possible sample of size 256 (including one with no females and one with no males). Then we can sort our samples into two piles: samples in which there are between 40.625% and 59.375% percent females and samples in which there are not. If the latter pile is extremely small relative to the former, we can reject the null hypothesis.

Needless to say, this would be a tedious undertaking—particularly on a microcomputer. Fortunately, there is an approximation using a continuous probability distribution that works quite well. First, we need to calculate the expected count of males and females, respectively, in a sample of size 256 if p is 0.5. This is 128, or half the sample N. Next, we subtract the observed counts from these expected counts, square them, and divide by the expected:

If our assumptions about the population and the structure of the table are correct, then this statistic will be distributed as a mathematical chi-square variable. We can look up

χ2 152 128–( )2

128-------------------------------

104 128–( )2

128------------------------------- 9=+=

Page 181: Statistics I

I-161

Crosstabulation

the area under the tail of the chi-square statistic beyond the sample value we calculate and if this area is small (say, less than 0.05), we can reject the null hypothesis.

To look up the value, we need a degrees of freedom (df) value. This is the number of independent values being added together to produce the chi-square. In our case, it is 1, since the observed proportion of men is simply 1 minus the observed proportion of women. If there were three categories (men, women, other?), then the degrees of freedom would be 2. Anyway, if you look up the value 9 with one degree of freedom in your chi-square table, you will find that the probability of exceeding this value is exceedingly small. Thus, we reject our null hypothesis that the proportion of males equals the proportion of females in the population.

This chi-square approximation is good only for large samples. A popular rule of thumb is that the expected counts should be greater than 5, although they should be even greater if you want to be comfortable with your test. With our sample, the difference between the approximation and the exact result is negligible. For both, the probability is small.

Our hypothesis test has an associated confidence interval. You can use SYSTAT to compute this interval on the population data. Here is the result:

The lower limit for each gender is on the bottom; the upper limits are on the top. Notice that these two intervals do not overlap.

Two-Way Tables

The most familiar test available for two-way tables is the Pearson chi-square test for independence of table rows and columns. When the table has only two rows or two columns, the chi-square test is also a test for equality of proportions. The concept of interaction in a two-way frequency table is similar to the one in analysis of variance. It is easiest to see in an example. Schachter (1959) randomly assigned 30 subjects to one of two groups: High Anxiety (17 subjects), who were told that they would be experiencing painful shocks, and Low Anxiety (13 subjects), who were told that they would experience painless shocks. After the assignment, each subject was given the

95 percent approximate confidence intervals scaled as cell percents Values for SEX$ Female Male +-----------------+ | 66.150 47.687 | | 52.064 33.613 | +-----------------+

Page 182: Statistics I

I-162

Chapter 8

choice of waiting alone or with the other subjects. The following tables illustrate two possible outcomes of this study.

Notice in the table on the left that the number choosing to wait together relative to those choosing to wait alone is similar for both High and Low Anxiety groups. In the table on the right, however, more of the High Anxiety group chose to wait together.

We are interpreting these numbers relatively, so we should compute row percentages to understand the differences better. Here are the same tables standardized by rows:

Now we can see that the percentages are similar in the two rows in the table on the left (No Interaction) and quite different in the table on the right (Interaction). A simple graph reveals these differences even more strongly. In the following figure, the No Interaction row percentages are plotted on the left.

No Interaction InteractionWAIT WAIT

Alone Together Alone Together

ANXIETY High 8 9 5 12Low 6 7 9 4

No Interaction InteractionWAIT WAIT

Alone Together Alone Together

ANXIETY High 47.1 52.8 29.4 70.6Low 46.1 53.8 69.2 30.8

Page 183: Statistics I

I-163

Crosstabulation

Notice that the lines cross in the Interaction plot, showing that the rows differ. There is almost complete overlap in the No Interaction plot.

Now, in the one-way table example above, we tested the hypothesis that the cell proportions were equal in the population. We can test an analogous hypothesis in this context—that each of the four cells contains 25 percent of the population. The problem with this assumption is that we already know that Schachter randomly assigned more people to the High Anxiety group. In other words, we should take the row marginal percentages (or totals) as fixed when we determine what proportions to expect in the cells from a random model.

Our No Interaction model is based on these fixed marginals. In fact, we can fix either the row or column margins to compute a No Interaction model because the total number of subjects is fixed at 30. You can verify that the row and column sums in the above tables are the same.

Now we are ready to compute our chi-square test of interaction (often called a test of independence) in the two-way table by using the No Interaction counts as expected counts in our chi-square formula above. This time, our degrees of freedom are still 1 because the marginal counts are fixed. If you know the marginal counts, then one cell count determines the remaining three. In general, the degrees of freedom for this test are (rows – 1) times (columns – 1).

Here is the result of our chi-square test. The chi-square is 4.693, with a p of 0.03. On this basis, we reject our No Interaction hypothesis.

Actually, we cheated. The program computed the expected counts from the observed data. These are not exactly the ones we showed you in the No Interaction table. They differ by rounding error in the first decimal place. You can compute them exactly. The popular method is to multiply the total row count times the total column count corresponding to a cell and dividing by the total sample size. For the upper left cell, this would be 17*14/30 = 7.93.

ANXIETY (rows) by WAIT$ (columns) Alone Together Total +-------------------+ High | 5.000 12.000 | 17.000 Low | 9.000 4.000 | 13.000 +-------------------+ Total 14.000 16.000 30.000 Test statistic Value df Prob Pearson Chi-square 4.693 1.000 0.030 Likelihood ratio Chi-square 4.810 1.000 0.028 McNemar Symmetry Chi-square 0.429 1.000 0.513 Yates corrected Chi-square 3.229 1.000 0.072 Fisher exact test (two-tail) 0.063

Page 184: Statistics I

I-164

Chapter 8

There is one other interesting problem with these data. The chi-square is only an approximation and it does not work well for small samples. Although these data meet the minimum expected count of 5, they are nevertheless problematic. Look at the Fisher’s exact test result in the output above. Like our permutation test above, which was so cumbersome for large data files, Fisher’s test counts all possible outcomes exactly, including the ones that produce interaction greater than what we observed. The Fisher exact test p value is not significant (0.063). On this basis, we could not reject the null hypothesis of no interaction, or independence.

Yates’ chi-square test in the output is an attempt to adjust the Pearson chi-square statistic for small samples. While it has come into disfavor for being unnecessarily conservative in many instances, nevertheless, the Yates p value is consistent with Fisher’s in this case (0.072). Likelihood-ratio chi-square is an alternative to the Pearson chi-square and is used as a test statistic for log linear models.

Selecting a Test or Measure

Other tests and measures are appropriate for specific table structures and also depend on whether or not the categories of the factor are ordered. We use to denote a table with two rows and two columns, and for a table with r rows and c columns. The Pearson and likelihood-ratio chi-square statistics apply to tables—categories need not be ordered.

McNemar’s test of symmetry is used for square tables (the number of rows equals the number of columns). This structure arises when the same subjects are measured twice as in a paired comparisons t test (say before and after an event) or when subjects are paired or matched (cases and controls). So the row and column categories are the same, but they are measured at different times or circumstances (like the paired t) or for different groups of subjects (cases and controls). This test ignores the counts along the diagonal of the table and tests whether the counts in cells above the diagonal differ from those below the diagonal. A significant result indicates a greater change in one direction than another. (The counts along the diagonal are for subjects who did not change.)

The table structure for Cohen’s kappa looks like that of McNemar’s in that the row and column categories are the same. But here the focus shifts to the diagonal: Are the counts along the diagonal significantly greater than those expected by chance alone? Because each subject is classified or rated twice, kappa is a measure of interrater agreement.

Another difference between McNemar and Kappa is that the former is a “test” with a chi-square statistic, degrees of freedom, and an associated p value, while the latter is

2 2×r c×

r c×

r r×

Page 185: Statistics I

I-165

Crosstabulation

a measure. Its “size” is judged by using an asymptotic standard error to construct a t statistic (that is, measure divided by standard error) to test whether kappa differs from 0. Values of kappa greater than 0.75 indicate strong agreement beyond chance, between 0.40 and 0.79 means fair to good, and below 0.40 means poor agreement.

Phi, Cramér’s V, and contingency are measures suitable for testing independence of table factors as you would with Pearson’s chi-square. They are designed for comparing results of tables with different sample sizes. (Note that the expected value of the Pearson chi-square is proportional to the total table size.) The three measures are scaled differently, but all test the same null hypothesis. Use the probability printed with the Pearson chi-square to test that these measures are zero. For tables with two rows and two columns (a table), phi and Cramér’s V are the same.

Five of the measures for two-way tables are appropriate when both categorical variables have ordered categories (always, sometimes, never or none, minimal moderate, severe). These are Goodman-Kruskal’s gamma, Kendall’s tau-b, Stuart’s tau-c, Spearman’s rho, and Somers’ d. The first three measures differ only in how ties are treated; the fourth is like the usual Pearson correlation except that the rank order of each value is used in the computations instead of the value itself. Somers’ d is an asymmetric measure: in SYSTAT, the column variable is considered to be the “dependent” variable.

For tables, Fisher’s exact test (if ) and Yates’ corrected chi-square are also printed. When expected cell sizes are small in a table (no expected value less than 5), use Fisher’s exact test as described above.

In larger contingency tables, we do not want to see any expected values less than 1.0 or more than 20% of the values less than 5. For large tables with too many small expected values, there is no remedy except to combine categories or possibly omit a category that has very few observations.

Yule’s Q and Yule’s Y measure dominance in a table. If either off-diagonal cell is 0, both statistics are equal (otherwise they are less than 1). These statistics are 0 if and only if the chi-square statistic is 0. Therefore, the null hypothesis that the measure is 0 can be tested by the chi-square test.

r c×

2 2×

2 2× n 50≤2 2×

2 2×

Page 186: Statistics I

I-166

Chapter 8

Crosstabulations in SYSTAT

One-Way Frequency Tables Main Dialog Box

To open the One-Way Frequency Tables dialog box, from the menus choose:

StatisticsTables

CrosstabsOne-way…

One-way frequency tables provides frequency counts, percentages, tests, etc., for single table factors or categorical variables.

� Tables. Tables can include frequency counts, percentages, and confidence intervals. You can specify any confidence level between 0 and 1.

� Pearson chi-square. Tests the equality of the cell frequencies. This test assumes all categories are equally likely.

� Options. You can include a category for cases with missing data. SYSTAT treats this category in the same fashion as the other categories. In addition, you can display output in a listing format instead of a tabular display. The listing includes counts, cumulative counts, percentages, and cumulative percentages.

Page 187: Statistics I

I-167

Crosstabulation

� Save last table as data file. Saves the table for the last variable in the Variable(s) list as a SYSTAT data file.

Two-Way Frequency Tables Main Dialog Box

To open the Two-Way Frequency Tables dialog box, from the menus choose:

StatisticsTables

CrosstabsTwo-way…

Two-way frequency tables crosstabulate one or more categorical row variables with a categorical column variable.

� Row variable(s). The variables displayed in the rows of the crosstabulation. Each row variable is crosstabulated with the column variable.

� Column variable. The variable displayed in the columns of the crosstabulation. The column variable is crosstabulated with each row variable.

� Tables. Tables can include frequency counts, percentages (row, column, or total), expected counts, deviates (Observed-Expected), and standardized deviates (Observed-Expected) / SQR (Expected).

Page 188: Statistics I

I-168

Chapter 8

� Options. You can include counts and percentages for cases with missing data. In addition, you can display output in a listing format instead of a tabular display. The listing includes counts, cumulative counts, percentages, and cumulative percentages for each combination of row and column variable categories.

� Save last table as data file. Saves the crosstabulation of the column variable with the last variable in the row variable(s) list as a SYSTAT data file. For each cell of the table, SYSTAT saves a record with the cell frequency and the row and column category values.

Two-Way Frequency Tables Statistics

A wide variety of statistics is available for testing the association between variables in a crosstabulation. Each statistic is appropriate for a particular table structure (rows by columns), and a few assume that categories are ordered (ordinal data).

Pearson chi-square. For tables with any number of rows and columns, tests for independence of the row and column variables.

2 x 2 tables. For tables with two rows and two columns, available tests are:

� Yates’ corrected chi-square. Adjusts the Pearson chi-square statistic for small samples.

� Fisher’s exact test. Counts all possible outcomes exactly. When the expected cell sizes are small (less than 5), use this test as an alternative to the Pearson chi-square.

Page 189: Statistics I

I-169

Crosstabulation

� Odds ratio. A measure of association in which a value near 1 indicates no relation between the variables.

� Yule’s Q and Y. Measures of association in which values near –1 or +1 indicate a strong relation. Values near 0 indicate no relation. Yule’s Y is less sensitive to differences in the margins of the table than Q.

2 x k tables. For tables with only two rows and any number of ordered column categories (or vice versa), Cochran’s test of linear trend is available to reveal whether proportions increase (or decrease) linearly across the ordered categories.

r x r tables. For square tables, available tests include:

� McNemar’s test for symmetry. Used for paired (or matched) variables. Tests whether the counts above the table diagonal differ from those below the diagonal. Small probability values indicate a greater change in one direction.

� Cohen’s kappa. Commonly used to measure agreement between two judges rating the same objects. Tests whether the diagonal counts are larger than expected. Values of kappa greater than 0.75 indicate strong agreement beyond chance, values between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor agreement.

r x c tables, unordered levels. For tables with any number of rows or columns with no assumed category order, available tests are:

� Phi. A chi-square based measure of association. Values may exceed 1.

� Cramér’s V. A measure of association based on the chi-square. The value ranges between 0 and 1, with 0 indicating independence between the row and column variables and values close to 1 indicating dependence between the variables.

� Contingency coefficient. A measure of association based on the chi-square. Similar to Cramér’s V, but values of 1 cannot be attained.

� Uncertainty coefficient and Goodman-Kruskal’s lambda. Measure of association that indicate the proportional reduction in error when values of one variable are used to predict values of the other variable. Values near 0 indicate that the row variable is no help in predicting the column variable.

� Likelihood-ratio chi-square. An alternative to the Pearson chi-square, primarily used as a test statistic for loglinear models.

r x c tables, ordered levels. For tables with any number of rows or columns in which categories for both variables represent ordered levels (for example, low, medium, high), available tests are:

Page 190: Statistics I

I-170

Chapter 8

� Spearman’s rho. Similar to the Pearson correlation coefficient, but uses the ranks of the data rather than the actual values.

� Goodman-Kruskal’s gamma, Kendall’s tau-b, and Stuart’s tau-c. Measures of association between two ordinal variables that range between –1 and +1, differing only in the method of dealing with ties. Values close to 0 indicate little or no relationship.

� Somers’ d. An asymmetric measure of association between two ordinal variables that ranges from –1 to 1. Values close to –1 or +1 indicate a strong relationship between the variables. The column variable is treated as the dependent variable.

Multiway Frequency Tables Main Dialog Box

Multiway frequency tables provide frequency counts and percentages for series of two-way tables stratified by all combinations of values of a third, fourth, etc., table factor.

To open the Multiway Frequency Tables dialog box, from the menus choose:

StatisticsTables

CrosstabsMultiway…

� Row variable. The variable displayed in the rows of the crosstabulation.

� Column variable. The variable displayed in the columns of the crosstabulation.

Page 191: Statistics I

I-171

Crosstabulation

� Strata variable(s). If strata are separate, a separate crosstabulation is produced for each value of each strata variable. If strata are crossed, a separate crosstabulation is produced for each unique combination of strata variable values. For example, if you have two strata variables, each with five categories, Separate will produce 10 tables and Crossed will produce 25 tables.

� Options. You can include counts and percentages for cases with missing data and save the last table produced as a SYSTAT data file. In addition, you can display output in a listing format, including percentages and cumulative percentages, instead of a tabular display.

� Display. You can display frequencies, total percentages, row percentages, and column percentages. Furthermore, you can use the Mantel-Haenszel test for subtables to test for an association between two binary variables while controlling for another variable.

Using Commands

For one-way tables in XTAB, specify:

For two-way tables in XTAB, specify:

For multiway tables in XTAB, specify:

XTAB USE filename PRINT / FREQ CHISQ LIST PERCENT ROWPCT COLPCT TABULATE varlist / CONFI=n MISS

XTAB USE filename PRINT / FREQ CHISQ LRCHI YATES FISHER ODDS YULE COCHRAN, MCKEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA, TAUB TAUC SOMERS EXPECT DEVI STAND LIST PERCENT, ROWPCT COLPCT TABULATE rowvar * colvar / MISS

XTAB USE filename PRINT / FREQ MANTEL LIST PERCENT ROWPCT COLPCT TABULATE varlist * rowvar * colvar / MISS

2 2×

Page 192: Statistics I

I-172

Chapter 8

Usage Considerations

Types of data. There are two ways to organize data for tables:

� The usual cases-by-variables rectangular data file

� Cell counts with cell identifiers

For example, you may want to analyze the following table reflecting application results by gender for business schools:

A cases-by-variables data file has the following form:

Instead of entering one case for each of the 685 applicants, you could use the second method to enter four cases:

For this method, the cell counts in the third column are identified by designating COUNT as a FREQUENCY variable.

Print options. Three levels of output are available. Statistics produced depend on the dimensionality of the table. PRINT SHORT yields frequency tables for all tables and Pearson chi-square for one-way and two-way tables. The MEDIUM length yields all statistics appropriate for the dimensionality of a two-way or multiway table. LONG

Admitted Denied

Male 420 90Female 150 25

PERSON GENDER$ STATUS$1 female admit2 male deny3 male admit

(etc.)684 female deny685 male admit

GENDER$ STATUS$ COUNTmale admit 420male deny 90

female admit 150female deny 25

Page 193: Statistics I

I-173

Crosstabulation

adds expected cell values, deviates, and standardized deviates to the SHORT and MEDIUM output.

Quick Graphs. Frequency tables produce no Quick Graphs.

Saving files. You can save the frequency counts to a file. For two-way tables, cell values, deviates, and standardized deviates are also saved.

BY groups. Use of a BY variable yields separate frequency tables (and corresponding statistics) for each level of the BY variable.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. XTAB uses the FREQUENCY variable to duplicate cases. This is the preferred method of input when the data are aggregated.

Case weights. WEIGHT is available for frequency tables.

Examples

Example 1 One-Way Tables

This example uses questionnaire data from a community survey (Afifi and Clark, 1984). The SURVEY2 data file includes a record (case) for each of the 256 subjects in the sample. We request frequencies for gender, marital status, and religion. The values of these variables are numbers, so we add character identifiers for the categories. The input is:

If the words male and female were stored in the variable SEX$, you would omit LABEL and tabulate SEX$ directly. If you omit LABEL and specify SEX, the numbers would label the output.

USE survey2XTAB LABEL sex / 1=’Male’, 2=’Female’ LABEL marital / 1=’Never’, 2=’Married’, 3=’Divorced’, 4=’Separated’ LABEL religion / 1=’Protestant’, 2=’Catholic’, 3=’Jewish’, 4=’None’, 6=’Other’ PRINT NONE / FREQ TABULATE sex marital religion

Page 194: Statistics I

I-174

Chapter 8

� When using the Label dialog box, you can omit quotation marks around category names. With commands, you can omit them if the name has no embedded blanks or symbols (the name, however, is displayed in uppercase letters).

The output follows:

In this sample of 256 subjects, 152 are females, 127 are married, and 133 are Protestants.

List Layout

List layout produces an alternative layout for the same information. Percentages and cumulative percentages are part of the display. The input is:

Frequencies Values for SEX Male Female Total +---------------+ | 104 152 | 256 +---------------+ Frequencies Values for MARITAL Never Married Divorced Separated Total +-----------------------------------------+ | 73 127 43 13 | 256 +-----------------------------------------+ Frequencies Values for RELIGION Protestant Catholic Jewish None Other Total +--------------------------------------------------------+ | 133 46 23 52 2 | 256 +--------------------------------------------------------+

USE survey2XTAB LABEL sex / 1=’Male’, 2=’Female’ LABEL marital / 1=’Never’, 2=’Married’, 3=’Divorced’, 4=’Separated’ LABEL religion / 1=’Protestant’, 2=’Catholic’, 3=’Jewish’, 4=’None’, 6=’Other’ PRINT NONE / LIST TABULATE sex marital religion PRINT

Page 195: Statistics I

I-175

Crosstabulation

You can also use TABULATE varlist / LIST as an alternative to PRINT NONE / LIST. The output follows:

Almost 60% (59.4) of the subjects are female, approximately 50% (49.6) are married, and more than half (52%) are Protestants.

Example 2 Two-Way Tables

This example uses the SURVEY2 data to crosstabulate marital status against religion. The input is:

The table follows:

Cum Cum Count Count Pct Pct SEX 104. 104. 40.6 40.6 Male 152. 256. 59.4 100.0 Female Cum Cum Count Count Pct Pct MARITAL 73. 73. 28.5 28.5 Never 127. 200. 49.6 78.1 Married 43. 243. 16.8 94.9 Divorced 13. 256. 5.1 100.0 Separated Cum Cum Count Count Pct Pct RELIGION 133. 133. 52.0 52.0 Protestant 46. 179. 18.0 69.9 Catholic 23. 202. 9.0 78.9 Jewish 52. 254. 20.3 99.2 None 2. 256. .8 100.0 Other

USE survey2XTAB LABEL marital / 1=’Never’, 2=’Married’, 3=’Divorced’, 4=’Separated’ LABEL religion / 1=’Protestant’, 2=’Catholic’, 3=’Jewish’, 4=’None’, 6=’Other’ PRINT NONE / FREQ TABULATE marital * religion

Frequencies MARITAL (rows) by RELIGION (columns) Protestant Catholic Jewish None Other Total +--------------------------------------------------------+ Never | 29 16 8 20 0 | 73 Married | 75 21 11 19 1 | 127 Divorced | 21 6 3 13 0 | 43 Separated | 8 3 1 0 1 | 13 +--------------------------------------------------------+ Total 133 46 23 52 2 256

Page 196: Statistics I

I-176

Chapter 8

In the sample of 256 people, 73 never married. Of the people that have never married, 29 are Protestants (the cell in the upper left corner), and none are in the Other category (their religion is not among the first four categories). The Totals (or marginals) along the bottom row and down the far right column are the same as the values displayed for one-way tables.

Omitting Sparse Categories

There are only two counts in the last column, and the counts in the last row are fairly sparse. It is easy to omit rows and/or columns. You can:

� Omit the category codes from the LABEL request.

� Select cases to use.

Note that LABEL and SELECT remain in effect until you turn them off. If you request several different tables, use SELECT to ensure that the same cases are used in all tables. The subset of cases selected via LABEL applies only to those tables that use the variables specified with LABEL. To turn off the LABEL specification for RELIGION, for example, specify:

We continue from the last table, eliminating the last category codes for MARITAL and RELIGION:

The table is:

LABEL religion

SELECT marital <> 4 AND religion <> 6 TABULATE marital * religion SELECT

Frequencies MARITAL (rows) by RELIGION (columns) Protestant Catholic Jewish None Total +---------------------------------------------+ Never | 29 16 8 20 | 73 Married | 75 21 11 19 | 126 Divorced | 21 6 3 13 | 43 +---------------------------------------------+ Total 125 43 22 52 242

Page 197: Statistics I

I-177

Crosstabulation

List Layout

Following is the panel for marital status crossed with religious preference:

The listing is:

Example 3 Frequency Input

Crosstabs, like other SYSTAT procedures, reads cases-by-variables data from a SYSTAT file. However, if you want to analyze a table from a report or a journal article, you can enter the cell counts directly. This example uses counts from a four-way table of a breast cancer study of 764 women. The data are from Morrison et al. (1973), cited in Bishop, Fienberg, and Holland (1975). There is one record for each of the 72 cells in the table, with the count (NUMBER) of women in the cell and codes or category names to identify their age group (under 50, 50 to 69, and 70 or over), treatment center (Tokyo, Boston, or Glamorgan), survival status (dead or alive), and tumor diagnosis (minimal inflammation and benign, maximum inflammation and benign, minimal inflammation and malignant, and maximum inflammation and malignant). This example illustrates how to form a two-way table of AGE by CENTER$.

USE survey2XTAB LABEL marital / 1=’Never’, 2=’Married’, 3=’Divorced’ LABEL religion / 1=’Protestant’, 2=’Catholic’, 3=’Jewish’, 4=’None’ PRINT NONE / LIST TABULATE marital * religion PRINT

Cum Cum Count Count Pct Pct MARITAL RELIGION 29. 29. 12.0 12.0 Never Protestant 16. 45. 6.6 18.6 Never Catholic 8. 53. 3.3 21.9 Never Jewish 20. 73. 8.3 30.2 Never None 75. 148. 31.0 61.2 Married Protestant 21. 169. 8.7 69.8 Married Catholic 11. 180. 4.5 74.4 Married Jewish 19. 199. 7.9 82.2 Married None 21. 220. 8.7 90.9 Divorced Protestant 6. 226. 2.5 93.4 Divorced Catholic 3. 229. 1.2 94.6 Divorced Jewish 13. 242. 5.4 100.0 Divorced None

Page 198: Statistics I

I-178

Chapter 8

The input is:

The resulting two-way table is:

Of the 764 women studied, 290 were treated in Tokyo. Of these women, 151 were in the youngest age group, and 19 were in the 70 or over age group.

Example 4 Missing Category Codes

You can choose whether or not to include a separate category for missing codes. For example, if some subjects did not check “male” or “female” on a form, there would be three categories for SEX$: male, female, and blank (missing). By default, when values of a table factor are missing, SYSTAT does not include a category for missing values.

In the OURWORLD data file, some countries did not report the GNP to the United Nations. In this example, we include a category for missing values, and we followed this request with a table that omits the category for missing. The input follows:

USE cancerXTAB FREQ = number LABEL age / 50=’Under 50’, 60=’50 to 69’, 70=’70 & Over’ TABULATE center$ * age

Frequencies CENTER$ (rows) by AGE (columns) Under 50 50 to 69 70 & Over Total +-------------------------------+ Boston | 58 122 73 | 253 Glamorgn | 71 109 41 | 221 Tokyo | 151 120 19 | 290 +-------------------------------+ Total 280 351 133 764 Test statistic Value df Prob Pearson Chi-square 74.039 4.000 0.000

USE ourworldXTAB TABULATE group$ * gnp$ / MISS LABEL gnp$ / ’D’=’Developed’, ’U’=’Emerging’ TABULATE group$ * gnp$

Page 199: Statistics I

I-179

Crosstabulat ion

The tables are:

List Layout

To create a listing of the counts in each cell of the table:

The output is:

Note that there is no entry for the empty cell.

Example 5 Percentages

Percentages are helpful for describing categorical variables and interpreting relationsbetween table factors. Crosstabs prints tables of percentages in the same layout asdescribed for frequency counts. That is, each frequency count is replaced by thepercentage. Percentages are computed by dividing each cell frequency by:

Frequencies GROUP$ (rows) by GNP$ (columns) D U Total +----------------------------+ Europe | 3 17 0 | 20 Islamic | 2 4 10 | 16 NewWorld | 1 15 5 | 21 +----------------------------+ Total 6 36 15 57 Frequencies GROUP$ (rows) by GNP$ (columns) Developed Emerging Total +---------------------+ Europe | 17 0 | 17 Islamic | 4 10 | 14 NewWorld | 15 5 | 20 +---------------------+ Total 36 15 51

PRINT / LIST TAB group$ * gnp$ PRINT

Cum Cum Count Count Pct Pct GROUP$ GNP$ 17. 17. 33.3 33.3 Europe Developed 4. 21. 7.8 41.2 Islamic Developed 10. 31. 19.6 60.8 Islamic Emerging 15. 46. 29.4 90.2 NewWorld Developed 5. 51. 9.8 100.0 NewWorld Emerging

Page 200: Statistics I

I-180

Chapter 8

� The total frequency in its row (row percents)

� The total frequency in its column (column percents)

� The total table frequency or sample size (table percents)

In this example, we request all three percentages using the following input:

The output is:

USE ourworldXTAB LABEL gnp$ / 'D'='Developed', 'U'='Emerging' PRINT NONE / ROWP COLP PERCENT TABULATE group$ * gnp$

Percents of total count GROUP$ (rows) by GNP$ (columns) Developed Emerging Total N +---------------------+ Europe | 33.333 0.0 | 33.333 17 Islamic | 7.843 19.608 | 27.451 14 NewWorld | 29.412 9.804 | 39.216 20 +---------------------+ Total 70.588 29.412 100.000 N 36 15 51 Row percents GROUP$ (rows) by GNP$ (columns) Developed Emerging Total N +---------------------+ Europe | 100.000 0.0 | 100.000 17 Islamic | 28.571 71.429 | 100.000 14 NewWorld | 75.000 25.000 | 100.000 20 +---------------------+ Total 70.588 29.412 100.000 N 36 15 51 Column percents GROUP$ (rows) by GNP$ (columns) Developed Emerging Total N +---------------------+ Europe | 47.222 0.0 | 33.333 17 Islamic | 11.111 66.667 | 27.451 14 NewWorld | 41.667 33.333 | 39.216 20 +---------------------+ Total 100.000 100.000 100.000 N 36 15 51

Page 201: Statistics I

I-181

Crosstabulation

Missing Categories

Notice how the row percentages change when we include a category for the missing GNP:

The new table is:

Here we see that 62.5% of the Islamic nations are classified as emerging. However, from the earlier table of row percentages, it might be better to say that among the Islamic nations reporting the GNP, 71.43% are emerging.

Example 6 Multiway Tables

When you have three or more table factors, Crosstabs forms a series of two-way tables stratified by all combinations of values of the third, fourth, and so on, table factors. The order in which you choose the table factors determines the layout. Your input can be the usual cases-by-variables data file or the cell counts with category values.

The input is:

PRINT NONE / ROWPLABEL gnp$ / ’ ’=Missing, ’D’=’Developed’, ’U’=’Emerging’TABULATE group$ * gnp$PRINT

Row percents GROUP$ (rows) by GNP$ (columns) MISSING Developed Emerging Total N +-------------------------------+ Europe | 15.000 85.000 0.0 | 100.000 20 Islamic | 12.500 25.000 62.500 | 100.000 16 NewWorld | 4.762 71.429 23.810 | 100.000 21 +-------------------------------+ Total 10.526 63.158 26.316 100.000 N 6 36 15 57

USE cancerXTAB FREQ = number LABEL age / 50=’Under 50’, 60=’50 to 69’, 70=’70 & Over’ ORDER center$ / SORT=none ORDER tumor$ / SORT=’MinBengn’, ’MaxBengn’, ’MinMalig’, ’MaxMalig’ TABULATE survive$ * tumor$ * center$ * age

Page 202: Statistics I

I-182

Chapter 8

The last two factors selected (CENTER$ and AGE) define two-way tables. The levels of the first two factors define the strata. After the table is run, we edited the output and moved the four tables for SURVIVE$ = Dead next to those for Alive.

List Layout

To create a listing of the counts in each cell of the table:

The output follows:

Frequencies CENTER$ (rows) by AGE (columns) SURVIVE$ = Alive SURVIVE$ = Dead TUMOR$ = MinBengn TUMOR$ = MinBengn Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total +-------------------------------+ +-------------------------------+ Tokyo | 68 46 6 | 120 Tokyo | 7 9 3 | 19 Boston | 24 58 26 | 108 Boston | 7 20 18 | 45 Glamorgn | 20 39 11 | 70 Glamorgn | 7 12 7 | 26 +-------------------------------+ +-------------------------------+ Total 112 143 43 298 Total 21 41 28 90 SURVIVE$ = Alive SURVIVE$ = Dead TUMOR$ = MaxBengn TUMOR$ = MaxBengn Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total +-------------------------------+ +-------------------------------+ Tokyo | 9 5 1 | 15 Tokyo | 3 2 0 | 5 Boston | 0 3 1 | 4 Boston | 0 2 0 | 2 Glamorgn | 1 4 1 | 6 Glamorgn | 0 0 0 | 0 +-------------------------------+ +-------------------------------+ Total 10 12 3 25 Total 3 4 0 7 SURVIVE$ = Alive SURVIVE$ = Dead TUMOR$ = MinMalig TUMOR$ = MinMalig Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total +-------------------------------+ +-------------------------------+ Tokyo | 26 20 1 | 47 Tokyo | 9 9 2 | 20 Boston | 11 18 15 | 44 Boston | 6 8 9 | 23 Glamorgn | 16 27 12 | 55 Glamorgn | 16 14 3 | 33 +-------------------------------+ +-------------------------------+ Total 53 65 28 146 Total 31 31 14 76 SURVIVE$ = Alive SURVIVE$ = Dead TUMOR$ = MaxMalig TUMOR$ = MaxMalig Under 50 50 to 69 70 & Over Total Under 50 50 to 69 70 & Over Total +-------------------------------+ +-------------------------------+ Tokyo | 25 18 5 | 48 Tokyo | 4 11 1 | 16 Boston | 4 10 1 | 15 Boston | 6 3 3 | 12 Glamorgn | 8 10 4 | 22 Glamorgn | 3 3 3 | 9 +-------------------------------+ +-------------------------------+ Total 37 38 10 85 Total 13 17 7 37

PRINT / LIST TABULATE survive$ * center$ * age * tumor$

Page 203: Statistics I

I-183

Crosstabulation

Case frequencies determined by value of variable NUMBER. Cum Cum Count Count Pct Pct SURVIVE$ CENTER$ AGE TUMOR$ 68. 68. 8.9 8.9 Alive Tokyo Under 50 MinBengn 9. 77. 1.2 10.1 Alive Tokyo Under 50 MaxBengn 26. 103. 3.4 13.5 Alive Tokyo Under 50 MinMalig 25. 128. 3.3 16.8 Alive Tokyo Under 50 MaxMalig 46. 174. 6.0 22.8 Alive Tokyo 50 to 69 MinBengn 5. 179. .7 23.4 Alive Tokyo 50 to 69 MaxBengn 20. 199. 2.6 26.0 Alive Tokyo 50 to 69 MinMalig 18. 217. 2.4 28.4 Alive Tokyo 50 to 69 MaxMalig 6. 223. .8 29.2 Alive Tokyo 70 & Over MinBengn 1. 224. .1 29.3 Alive Tokyo 70 & Over MaxBengn 1. 225. .1 29.5 Alive Tokyo 70 & Over MinMalig 5. 230. .7 30.1 Alive Tokyo 70 & Over MaxMalig 24. 254. 3.1 33.2 Alive Boston Under 50 MinBengn 11. 265. 1.4 34.7 Alive Boston Under 50 MinMalig 4. 269. .5 35.2 Alive Boston Under 50 MaxMalig 58. 327. 7.6 42.8 Alive Boston 50 to 69 MinBengn 3. 330. .4 43.2 Alive Boston 50 to 69 MaxBengn 18. 348. 2.4 45.5 Alive Boston 50 to 69 MinMalig 10. 358. 1.3 46.9 Alive Boston 50 to 69 MaxMalig 26. 384. 3.4 50.3 Alive Boston 70 & Over MinBengn 1. 385. .1 50.4 Alive Boston 70 & Over MaxBengn 15. 400. 2.0 52.4 Alive Boston 70 & Over MinMalig 1. 401. .1 52.5 Alive Boston 70 & Over MaxMalig 20. 421. 2.6 55.1 Alive Glamorgn Under 50 MinBengn 1. 422. .1 55.2 Alive Glamorgn Under 50 MaxBengn 16. 438. 2.1 57.3 Alive Glamorgn Under 50 MinMalig 8. 446. 1.0 58.4 Alive Glamorgn Under 50 MaxMalig 39. 485. 5.1 63.5 Alive Glamorgn 50 to 69 MinBengn 4. 489. .5 64.0 Alive Glamorgn 50 to 69 MaxBengn 27. 516. 3.5 67.5 Alive Glamorgn 50 to 69 MinMalig 10. 526. 1.3 68.8 Alive Glamorgn 50 to 69 MaxMalig 11. 537. 1.4 70.3 Alive Glamorgn 70 & Over MinBengn 1. 538. .1 70.4 Alive Glamorgn 70 & Over MaxBengn 12. 550. 1.6 72.0 Alive Glamorgn 70 & Over MinMalig 4. 554. .5 72.5 Alive Glamorgn 70 & Over MaxMalig 7. 561. .9 73.4 Dead Tokyo Under 50 MinBengn 3. 564. .4 73.8 Dead Tokyo Under 50 MaxBengn 9. 573. 1.2 75.0 Dead Tokyo Under 50 MinMalig 4. 577. .5 75.5 Dead Tokyo Under 50 MaxMalig 9. 586. 1.2 76.7 Dead Tokyo 50 to 69 MinBengn 2. 588. .3 77.0 Dead Tokyo 50 to 69 MaxBengn 9. 597. 1.2 78.1 Dead Tokyo 50 to 69 MinMalig 11. 608. 1.4 79.6 Dead Tokyo 50 to 69 MaxMalig 3. 611. .4 80.0 Dead Tokyo 70 & Over MinBengn 2. 613. .3 80.2 Dead Tokyo 70 & Over MinMalig 1. 614. .1 80.4 Dead Tokyo 70 & Over MaxMalig 7. 621. .9 81.3 Dead Boston Under 50 MinBengn 6. 627. .8 82.1 Dead Boston Under 50 MinMalig 6. 633. .8 82.9 Dead Boston Under 50 MaxMalig 20. 653. 2.6 85.5 Dead Boston 50 to 69 MinBengn 2. 655. .3 85.7 Dead Boston 50 to 69 MaxBengn 8. 663. 1.0 86.8 Dead Boston 50 to 69 MinMalig 3. 666. .4 87.2 Dead Boston 50 to 69 MaxMalig 18. 684. 2.4 89.5 Dead Boston 70 & Over MinBengn 9. 693. 1.2 90.7 Dead Boston 70 & Over MinMalig 3. 696. .4 91.1 Dead Boston 70 & Over MaxMalig 7. 703. .9 92.0 Dead Glamorgn Under 50 MinBengn 16. 719. 2.1 94.1 Dead Glamorgn Under 50 MinMalig 3. 722. .4 94.5 Dead Glamorgn Under 50 MaxMalig 12. 734. 1.6 96.1 Dead Glamorgn 50 to 69 MinBengn 14. 748. 1.8 97.9 Dead Glamorgn 50 to 69 MinMalig 3. 751. .4 98.3 Dead Glamorgn 50 to 69 MaxMalig 7. 758. .9 99.2 Dead Glamorgn 70 & Over MinBengn 3. 761. .4 99.6 Dead Glamorgn 70 & Over MinMalig 3. 764. .4 100.0 Dead Glamorgn 70 & Over MaxMalig

Page 204: Statistics I

I-184

Chapter 8

The 35 cells for the women who survived are listed first (the cell for Boston women under 50 years old with MaxBengn tumors is empty). In the Cum Pct column, we see that these women make up 72.5% of the sample. Thus, 27.5% did not survive.

Percentages

While list layout provides percentages of the total table count, you might want others. Here we specify COLPCT in Crosstabs to print the percentage surviving within each age-by-center stratum. The input is:

The tables follow:

PRINT NONE / COLPCT TABULATE age * center$ * survive$ * tumor$ PRINT

Column percents SURVIVE$ (rows) by TUMOR$ (columns) AGE = Under 50 CENTER$ = Tokyo MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 90.667 75.000 74.286 86.207 | 84.768 128 Dead | 9.333 25.000 25.714 13.793 | 15.232 23 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 75 12 35 29 151 AGE = Under 50 CENTER$ = Boston MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 77.419 0.0 64.706 40.000 | 67.241 39 Dead | 22.581 0.0 35.294 60.000 | 32.759 19 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 31 0 17 10 58 AGE = Under 50 CENTER$ = Glamorgn MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 74.074 100.000 50.000 72.727 | 63.380 45 Dead | 25.926 0.0 50.000 27.273 | 36.620 26 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 27 1 32 11 71

Page 205: Statistics I

I-185

Crosstabulation

AGE = 50 to 69 CENTER$ = Tokyo MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 83.636 71.429 68.966 62.069 | 74.167 89 Dead | 16.364 28.571 31.034 37.931 | 25.833 31 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 55 7 29 29 120 AGE = 50 to 69 CENTER$ = Boston MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 74.359 60.000 69.231 76.923 | 72.951 89 Dead | 25.641 40.000 30.769 23.077 | 27.049 33 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 78 5 26 13 122 AGE = 50 to 69 CENTER$ = Glamorgn MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 76.471 100.000 65.854 76.923 | 73.394 80 Dead | 23.529 0.0 34.146 23.077 | 26.606 29 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 51 4 41 13 109 AGE = 70 & Over CENTER$ = Tokyo MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 66.667 100.000 33.333 83.333 | 68.421 13 Dead | 33.333 0.0 66.667 16.667 | 31.579 6 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 9 1 3 6 19 AGE = 70 & Over CENTER$ = Boston MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 59.091 100.000 62.500 25.000 | 58.904 43 Dead | 40.909 0.0 37.500 75.000 | 41.096 30 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 44 1 24 4 73 AGE = 70 & Over CENTER$ = Glamorgn MinBengn MaxBengn MinMalig MaxMalig Total N +-----------------------------------------+ Alive | 61.111 100.000 80.000 57.143 | 68.293 28 Dead | 38.889 0.0 20.000 42.857 | 31.707 13 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 18 1 15 7 41

Page 206: Statistics I

I-186

Chapter 8

The percentage of women surviving for each age-by-center combination is reported in the first row of each panel. In the marginal Total down the right column, we see that the younger women treated in Tokyo have the best survival rate (84.77%). This is the row total (128) divided by the total for the stratum (151).

Example 7 Two-Way Table Statistics

For the SURVEY2 data, you study the relationship between marital status and age. This is a general table—while the categories for AGE are ordered, those for MARITAL are not. The usual Pearson chi-square statistic is used to test the association between the two factors. This statistic is the default for Crosstabs.

The data file is the usual cases-by-variables rectangular file with one record for each person. We split the continuous variable AGE into four categories and add names such as 30 to 45 for the output. There are too few separated people to tally, so here we eliminate them and reorder the categories of MARITAL that remain. To supplement the results, we request row percentages. The input is:

The output follows:

USE survey2XTAB LABEL age / .. 29=’18 to 29’, 30 .. 45=’30 to 45’, 46 .. 60=’46 to 60’, 60 .. =’Over 60’ LABEL marital / 2=’Married’, 3=’Divorced’, 1=’Never’ PRINT / ROWPCT TABULATE age * marital

Frequencies AGE (rows) by MARITAL (columns) Married Divorced Never Total +----------------------------+ 18 to 29 | 17 5 53 | 75 30 to 45 | 48 21 9 | 78 46 to 60 | 39 12 8 | 59 Over 60 | 23 5 3 | 31 +----------------------------+ Total 127 43 73 243

Page 207: Statistics I

I-187

Crosstabulation

Even though the chi-square statistic is highly significant (87.761; p value < 0.0005), in the Row percentages table, you see that 70.67% of the youngest age group fall into the never-married category. Many of these people may be too young to consider marriage.

Eliminating a Stratum

If you eliminate the subjects in the youngest group, is there an association between marital status and age? To address this question, the input is:

The resulting output is:

Row percents AGE (rows) by MARITAL (columns) Married Divorced Never Total N +----------------------------+ 18 to 29 | 22.667 6.667 70.667 | 100.000 75 30 to 45 | 61.538 26.923 11.538 | 100.000 78 46 to 60 | 66.102 20.339 13.559 | 100.000 59 Over 60 | 74.194 16.129 9.677 | 100.000 31 +----------------------------+ Total 52.263 17.695 30.041 100.000 N 127 43 73 243 Test statistic Value df Prob Pearson Chi-square 87.761 6.000 0.000

SELECT age > 29 PRINT / CHISQ PHI CRAMER CONT ROWPCT TABULATE age * marital SELECT

Frequencies AGE (rows) by MARITAL (columns) Married Divorced Never Total +----------------------------+ 30 to 45 | 48 21 9 | 78 46 to 60 | 39 12 8 | 59 Over 60 | 23 5 3 | 31 +----------------------------+ Total 110 38 20 168 Row percents AGE (rows) by MARITAL (columns) Married Divorced Never Total N +----------------------------+ 30 to 45 | 61.538 26.923 11.538 | 100.000 78 46 to 60 | 66.102 20.339 13.559 | 100.000 59 Over 60 | 74.194 16.129 9.677 | 100.000 31 +----------------------------+ Total 65.476 22.619 11.905 100.000 N 110 38 20 168

Page 208: Statistics I

I-188

Chapter 8

The proportion of married people is larger within the Over 60 group than for the 30 to 45 group—74.19% of the former are married while 61.54% of the latter are married. The youngest stratum has the most divorced people. However, you cannot say these proportions differ significantly (chi-square = 2.173, p value = 0.704).

Example 8 Two-Way Table Statistics (Long Results)

This example illustrates LONG results and table input. It uses the AGE by CENTER$ table from the cancer study described in the frequency input example. The input is:

The output follows:

Test statistic Value df Prob Pearson Chi-square 2.173 4.000 0.704 Coefficient Value Asymptotic Std Error Phi 0.114 Cramer V 0.080 Contingency 0.113

USE cancerXTAB FREQ = number PRINT LONG LABEL age / 50=’Under 50’, 60=’50 to 69’, 70=’70 & Over’ TABULATE center$ * age

Frequencies CENTER$ (rows) by AGE (columns) Under 50 50 to 69 70 & Over Total +-------------------------------+ Boston | 58 122 73 | 253 Glamorgn | 71 109 41 | 221 Tokyo | 151 120 19 | 290 +-------------------------------+ Total 280 351 133 764 Expected values CENTER$ (rows) by AGE (columns) Under 50 50 to 69 70 & Over +-------------------------------+ Boston | 92.723 116.234 44.043 | Glamorgn | 80.995 101.533 38.473 | Tokyo | 106.283 133.233 50.484 | +-------------------------------+

Page 209: Statistics I

I-189

Crosstabulation

The null hypothesis for the Pearson chi-square test is that the table factors are independent. You reject the hypothesis (chi-square = 74.039, p value < 0.0005). We are concerned about the analysis of the full table with four factors in the cancer study because we see an imbalance between AGE and study CENTER. The researchers in Tokyo entered a much larger proportion of younger women than did the researchers in the other cities.

Notice that with LONG, SYSTAT reports all statistics for an table including those that are appropriate when both factors have ordered categories (gamma, tau-b, tau-c, rho, and Spearman’s rho).

Example 9 Odds Ratios

For a table with cell counts a, b, c, and d:

Standardized deviates: (Observed-Expected)/SQR(Expected) CENTER$ (rows) by AGE (columns) Under 50 50 to 69 70 & Over +-------------------------------+ Boston | -3.606 0.535 4.363 | Glamorgn | -1.111 0.741 0.407 | Tokyo | 4.338 -1.146 -4.431 | +-------------------------------+ Test statistic Value df Prob Pearson Chi-square 74.039 4.000 0.000 Likelihood ratio Chi-square 76.963 4.000 0.000 McNemar Symmetry Chi-square 79.401 3.000 0.000

Coefficient Value Asymptotic Std Error Phi 0.311 Cramer V 0.220 Contingency 0.297 Goodman-Kruskal Gamma -0.417 0.043 Kendall Tau-B -0.275 0.030 Stuart Tau-C -0.265 0.029 Cohen Kappa -0.113 0.022 Spearman Rho -0.305 0.033 Somers D (column dependent) -0.267 0.030 Lambda (column dependent) 0.075 0.038 Uncertainty (column dependent) 0.049 0.011

Exposure

yes no

Diseaseyes a b

no c d

r c×

Page 210: Statistics I

I-190

Chapter 8

where, if you designate the Disease yes people sick and the Disease no people well, the odds ratio (or cross-product ratio) equals the odds that a sick person is exposed divided by the odds that a well person is exposed, or:

If the odds for the sick and disease-free people are the same, the value of the odds ratio is 1.0.

As an example, use the SURVEY2 file and study the association between gender and depressive illness. Be careful to order your table factors so that your odds ratio is constructed correctly (we use LABEL to do this). The input is:

The output is:

The odds that a female is depressed are 36 to 116, the odds for a male are 8 to 96, and the odds ratio is 3.724. Thus, in this sample, females are almost four times more likely to be depressed than males. But, does our sample estimate differ significantly from 1.0? Because the distribution of the odds ratio is very skewed, significance is

USE survey2XTAB LABEL casecont / 1=’Depressed’, 0=’Normal’ PRINT / FREQ ODDS TABULATE sex$ * casecont

Frequencies SEX$ (rows) by CASECONT (columns) Depressed Normal Total +---------------------+ Female | 36 116 | 152 Male | 8 96 | 104 +---------------------+ Total 44 212 256 Test statistic Value df Prob Pearson Chi-square 11.095 1.000 0.001 Coefficient Value Asymptotic Std Error Odds Ratio 3.724 Ln(Odds) 1.315 0.415

a b⁄( ) c d⁄( )⁄ ad( ) bc( )⁄=

Page 211: Statistics I

I-191

Crosstabulation

determined by examining Ln(Odds), the natural logarithm of the ratio, and the standard error of the transformed ratio. Note the symmetry when ratios are transformed:

The value of Ln(Odds) here is 1.315 with a standard error of 0.415. Constructing an approximate 95% confidence interval using the statistic plus or minus two times its standard error:

results in:

Because 0 is not included in the interval, Ln(Odds) differs significantly from 0, and the odds ratio differs from 1.0.

Using the calculator to take antilogs of the limits. You can use SYSTAT’s calculator to take antilogs of the limits EXP(0.485) and EXP(2.145) and obtain a confidence interval for the odds ratio:

That is, for the lower limit, type CALC EXP(0.485). Notice that the proportion of females who are depressed is 0.2368 (from a table of

row percentages not displayed here) and the proportion of males is 0.0769, so you also reject the hypothesis of equality of proportions (chi-square = 11.095, p value = 0.001).

3 Ln 3

2 Ln 2

1 Ln 0

1/2 –Ln 2

1/3 –Ln 3

1.315 2 * 0.415± 1.315 0.830±=

0.485 Ln Odds( ) 2.145< <

e 0.485( ) odds ratio e 2.145( )< <

1.624 odds ratio 8.542< <

Page 212: Statistics I

I-192

Chapter 8

Example 10 Fisher’s Exact Test

Let’s say that you are interested in how salaries of female executives compare with those of male executives at a particular firm. The accountant there will not give you salaries in dollar figures but does tell you whether the executives’ salaries are low or high:

The sample size is very small. When a table has only two rows and two columns and PRINT=MEDIUM is set as the length, SYSTAT reports results of five additional tests and measures: Fisher’s exact test, the odds ratio (and Ln(Odds)), Yates’ corrected chi-square, and Yules’ Q and Y.) By setting PRINT=SHORT, you request three of these: Fisher’s exact test, the chi-square test, and Yates’ corrected chi-square. The input is:

The output follows:

Low High

Male 2 7Female 5 1

USE salaryXTAB FREQ = count LABEL sex / 1=’male’, 2=’female’ LABEL earnings / 1=’low’, 2=’high’ PRINT / FISHER CHISQ YATES TABULATE sex * earnings

Frequencies SEX (rows) by EARNINGS (columns) low high Total +---------------+ male | 2 7 | 9 female | 5 1 | 6 +---------------+ Total 7 8 15 WARNING: More than one-fifth of fitted cells are sparse (frequency < 5). Significance tests computed on this table are suspect.

Test statistic Value df Prob Pearson Chi-square 5.402 1.000 0.020 Yates corrected Chi-square 3.225 1.000 0.073 Fisher exact test (two-tail) 0.041

Page 213: Statistics I

I-193

Crosstabulation

Notice that SYSTAT warns you that the results are suspect because the counts in the table are too low (sparse). Technically, the message states that more than one-fifth of the cells have expected values (fitted values) of less than 5.

The p value for the Pearson chi-square (0.020) leads you to believe that SEX and EARNINGS are not independent. But there is a warning about suspect results. This warning applies to the Pearson chi-square test but not to Fisher’s exact test. Fisher’s test counts all possible outcomes exactly, including the ones that produce an interaction greater than what you observe. The Fisher exact test p value is also significant. On this basis, you reject the null hypothesis of independence (no interaction between SEX and EARNINGS).

Sensitivity

Results for small samples, however, can be fairly sensitive. One case can matter. What if the accountant forgets one well-paid male executive?

The results of the Fisher exact test indicates that you cannot reject the null hypothesis of independence. It is too bad that you do not have the actual salaries. Much information is lost when a quantitative variable like salary is dichotomized into LOW and HIGH.

What Is a Small Expected Value?

In larger contingency tables, you do not want to see any expected values less than 1.0 or more than 20% of the values less than 5. For large tables with too many small expected values, there is no remedy but to combine categories or possibly omit a category that has very few observations.

Frequencies SEX (rows) by EARNINGS (columns) low high Total +---------------+ male | 2 6 | 8 female | 5 1 | 6 +---------------+ Total 7 7 14 WARNING: More than one-fifth of fitted cells are sparse (frequency < 5). Significance tests computed on this table are suspect.Test statistic Value df Prob Pearson Chi-square 4.667 1.000 0.031 Yates corrected Chi-square 2.625 1.000 0.105 Fisher exact test (two-tail) 0.103

Page 214: Statistics I

I-194

Chapter 8

Example 11 Cochran’s Test of Linear Trend

When one table factor is dichotomous and the other has three or more ordered categories (for example, low, median, and high), Cochran’s test of linear trend is used to test the null hypothesis that the slope of a regression line across the proportions is 0. For example, in studying the relation of depression to education, you form this table for the SURVEY2 data and plot the proportion depressed:

If you regress the proportions on scores 1, 2, 3, and 4 assigned by SYSTAT to the ordered categories, you can test whether the slope is significant.

This is what we do in this example. We also explore the relation of depression to health. The input is:

USE survey2XTAB LABEL casecont / 1=’Depressed’, 0=’Normal’ LABEL educatn / 1,2=’Dropout’, 3=’HS grad’, 4,5=’College’, 6,7=’Degree +’ LABEL healthy / 1=’Excellent’, 2=’Good’, 3,4=’Fair/Poor’ PRINT / FREQ COLPCT COCHRAN TABULATE casecont * educatn TABULATE casecont * healthy

Page 215: Statistics I

I-195

Crosstabulation

The output is:

As the level of education increases, the proportion of depressed subjects decreases (Cochran’s Linear Trend = 7.681, df = 1, and Prob (p value) = 0.006). Of those not graduating from high school (Dropout), 28% are depressed, and 4.55% of those with advanced degrees are depressed. Notice that the Pearson chi-square is marginally significant (p value = 0.049). It simply tests the hypothesis that the four proportions are equal rather than decreasing linearly.

Frequencies CASECONT (rows) by EDUCATN (columns) Dropout HS grad College Degree + Total +-----------------------------------------+ Depressed | 14 18 11 1 | 44 Normal | 36 80 75 21 | 212 +-----------------------------------------+ Total 50 98 86 22 256 Column percents CASECONT (rows) by EDUCATN (columns) Dropout HS grad College Degree + Total N +-----------------------------------------+ Depressed | 28.000 18.367 12.791 4.545 | 17.187 44 Normal | 72.000 81.633 87.209 95.455 | 82.813 212 +-----------------------------------------+ Total 100.000 100.000 100.000 100.000 100.000 N 50 98 86 22 256 Test statistic Value df Prob Pearson Chi-square 7.841 3.000 0.049 Cochran’s Linear Trend 7.681 1.000 0.006

Frequencies CASECONT (rows) by HEALTHY (columns) Excellent Good Fair/Poor Total +-------------------------------+ Depressed | 16 15 13 | 44 Normal | 105 78 29 | 212 +-------------------------------+ Total 121 93 42 256 Column percents CASECONT (rows) by HEALTHY (columns) Excellent Good Fair/Poor Total N +-------------------------------+ Depressed | 13.223 16.129 30.952 | 17.187 44 Normal | 86.777 83.871 69.048 | 82.813 212 +-------------------------------+ Total 100.000 100.000 100.000 100.000 N 121 93 42 256 Test statistic Value df Prob Pearson Chi-square 7.000 2.000 0.030 Cochran’s Linear Trend 5.671 1.000 0.017

Page 216: Statistics I

I-196

Chapter 8

In contrast to education, the proportion of depressed subjects tends to increase linearly as health deteriorates (p value = 0.017). Only 13% of those in excellent health are depressed, whereas 31% of cases with fair or poor health report depression.

Example 12 Tables with Ordered Categories

In this example, we focus on statistics for studies in which both table factors have a few ordered categories. For example, a teacher evaluating the activity level of schoolchildren may feel that she can’t score them from 1 to 20 but that she could categorize the activity of each child as sedentary, normal, or hyperactive. Here you study the relation of health status to age. If the category codes are character-valued, you must indicate the correct ordering (as opposed to the default alphabetical ordering).

For Spearman’s rho, instead of using actual data values, the indices of the categories are used to compute the usual correlation. Gamma measures the probability of getting like (as opposed to unlike) orders of values. Its numerator is identical to that of Kendall’s tau-b and Stuart’s tau-c. The input is:

The output follows:

USE survey2XTAB LABEL healthy / 1=’Excellent’, 2=’Good’, 3,4=’Fair/Poor’ LABEL age / .. 29=’18 to 29’, 30 .. 45=’30 to 45’, 46 .. 60=’46 to 60’, 60 .. =’Over 60’ PRINT / FREQ ROWP GAMMA RHO TABULATE healthy * age

Frequencies HEALTHY (rows) by AGE (columns) 18 to 29 30 to 45 46 to 60 Over 60 Total +-----------------------------------------+ Excellent | 43 48 25 5 | 121 Good | 30 23 24 16 | 93 Fair/Poor | 6 9 15 12 | 42 +-----------------------------------------+ Total 79 80 64 33 256 Row percents HEALTHY (rows) by AGE (columns) 18 to 29 30 to 45 46 to 60 Over 60 Total N +-----------------------------------------+ Excellent | 35.537 39.669 20.661 4.132 | 100.000 121 Good | 32.258 24.731 25.806 17.204 | 100.000 93 Fair/Poor | 14.286 21.429 35.714 28.571 | 100.000 42 +-----------------------------------------+ Total 30.859 31.250 25.000 12.891 100.000 N 79 80 64 33 256

Page 217: Statistics I

I-197

Crosstabulation

Not surprisingly, as age increases, health status tends to deteriorate. In the table of row percentages, notice that among those with EXCELLENT health, 4.13% are in the oldest age group; in the GOOD category, 17.2% are in the oldest group; and in the FAIR/POOR category, 28.57% are in the oldest group.

The value of gamma is 0.346; rho is 0.274. Here are confidence intervals (Value ± 2 * Asymptotic Std Error) for each statistic:

Because 0 is in neither interval, you conclude that there is an association between health and age.

Example 13 McNemar’s Test of Symmetry

In November of 1993, the U.S. Congress approved the North American Free Trade Agreement (NAFTA). Let’s say that two months before the approval and before the televised debate between Vice President Al Gore and businessman Ross Perot, political pollsters queried a sample of 350 people, asking “Are you for, unsure, or against NAFTA?” Immediately after the debate, the pollsters contacted the same people and asked the question a second time. Here are the responses:

The pollsters wonder, “Is there a shift in opinion about NAFTA?” The study design for the answer is similar to a paired t test—each subject has two responses. The row and column categories of our table are the same variable measured at different points in time.

Test statistic Value df Prob Pearson Chi-square 29.380 6.000 0.000 Coefficient Value Asymptotic Std Error Goodman-Kruskal Gamma 0.346 0.072 Spearman Rho 0.274 0.058

After

For Unsure Against

For 51 22 28

Before Unsure 46 18 27Against 52 49 57

0.202 <= 0.346 <= 0.490

0.158 <= 0.274 <= 0.390

Page 218: Statistics I

I-198

Chapter 8

The file NAFTA contains these data. To test for an opinion shift, the input is:

We use ORDER to ensure that the row and column categories are ordered the same. The output follows:

The McNemar test of symmetry focuses on the counts in the off-diagonal cells (those along the diagonal are not used in the computations). We are investigating the direction of change in opinion. First, how many respondents became more negative about NAFTA?

� Among those who initially responded For, 22 (6.29%) are now Unsure and 28 (8%) are now Against.

� Among those who were Unsure before the debate, 27 (7.71%) answered Against afterwards.

USE naftaXTAB FREQ = count ORDER before$ after$ / SORT=’for’,’unsure’,’against’ PRINT / FREQ MCNEMAR CHI PERCENT TABULATE before$ * after$

Frequencies BEFORE$ (rows) by AFTER$ (columns) for unsure against Total +-------------------------+ for | 51 22 28 | 101 unsure | 46 18 27 | 91 against | 52 49 57 | 158 +-------------------------+ Total 149 89 112 350 Percents of total count BEFORE$ (rows) by AFTER$ (columns) for unsure against Total N +-------------------------+ for | 14.571 6.286 8.000 | 28.857 101 unsure | 13.143 5.143 7.714 | 26.000 91 against | 14.857 14.000 16.286 | 45.143 158 +-------------------------+ Total 42.571 25.429 32.000 100.000 N 149 89 112 350

Test statistic Value df Prob Pearson Chi-square 11.473 4.000 0.022 McNemar Symmetry Chi-square 22.039 3.000 0.000

Page 219: Statistics I

I-199

Crosstabulation

The three cells in the upper right contain counts for those who became more unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in the lower left contain counts for people who became more positive about NAFTA (46, 52, and 49) or 42% of the sample.

The null hypothesis for the McNemar test is that the changes in opinion are equal. The chi-square statistic for this test is 22.039 with 3 df and p < 0.0005. You reject the null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the anti-NAFTA shift.

You also clearly reject the null hypothesis that the row (BEFORE$) and column (AFTER$) factors are independent (chi-square = 11.473; p = 0.022). However, a test of independence does not answer your original question about change of opinion and its direction.

Example 14 Confidence Intervals for One-Way Table Percentages

If your data are binomially or multinomially distributed, you may want confidence intervals on the cell proportions. SYSTAT’s confidence intervals are based on an approximation by Bailey (1980). Crosstabs uses that reference’s approximation number 6 with a continuity correction, which closely fits the real intervals for the binomial on even small samples and performs well when population proportions are near 0 or 1. The confidence intervals are scaled on a percentage scale for compatibility with the other Crosstabs output.

Here is an example using data from Davis (1977) on the number of buses failing after driving a given distance (1 of 10 distances). Print the percentages of the 191 buses failing in each distance category to see the cover of the intervals. The input follows:

USE busesXTAB FREQ = count PRINT NONE / FREQ PERCENT TABULATE distance / CONFI=.95

Page 220: Statistics I

I-200

Chapter 8

The resulting output is:

There are 6 buses in the first distance category; this is 3.14% of the 191 buses. The confidence interval for this percentage ranges from 0.55 to 8.23%.

Example 15 Mantel-Haenszel Test

For any table, if the output mode is MEDIUM or if you select the Mantel-Haenszel test, SYSTAT produces the Mantel-Haenszel statistic without continuity correction. This tests the association between two binary variables controlling for a stratification variable. The Mantel-Haenszel test is often used to test the effectiveness of a treatment on an outcome, to test the degree of association between the presence or absence of a risk factor and the occurrence of a disease, or to compare two survival distributions.

Frequencies Values for DISTANCE 1 2 3 4 5 6 7 8 9 10 Total +-------------------------------------------------------------+ | 6 11 16 25 34 46 33 16 2 2 | 191 +-------------------------------------------------------------+ Percents of total count Values for DISTANCE 1 2 3 4 5 6 7 +---------------------------------------------------------+ | 3.141 5.759 8.377 13.089 17.801 24.084 17.277 | +---------------------------------------------------------+ 8 9 10 Total N +-------------------------+ | 8.377 1.047 1.047 | 100.000 191 +-------------------------+ 95 percent approximate confidence intervals scaled as cell percents Values for DISTANCE 1 2 3 4 5 6 7 +---------------------------------------------------------+ | 8.234 11.875 15.259 20.996 26.447 33.420 25.852 | | 0.548 1.903 3.552 6.905 10.560 15.737 10.142 | +---------------------------------------------------------+ 8 9 10 +-------------------------+ | 15.259 4.914 4.914 | | 3.552 0.0 0.0 | +-------------------------+

k 2 2××( )

Page 221: Statistics I

I-201

Crosstabulation

A study by Ansfield, et al. (1977) examined the responses of two different groups of patients (colon or rectum cancer and breast cancer) to two different treatments:

Here are the data rearranged:

The odds ratio (cross-product ratio) for the first table is:

or

Similarly, for the second table, the odds ratio is:

If the odds for treatments A and B are identical, the ratios would both be 1.0. For these data, the breast cancer patients on treatment A are 1.6 times more likely to have a positive biopsy than patients on treatment B; while, for the colon-rectum, those on treatment A are 3.2 times more likely to have a positive biopsy than those on treatment B. But can you say these estimates differ significantly from 1.0? After adjusting for the

CANCER$ TREAT$ RESPONSE$ NUMBERColon-Rectum a Positive 16.000Colon-Rectum b Positive 7.000Colon-Rectum a Negative 32.000Colon-Rectum b Negative 45.000Breast a Positive 14.000Breast b Positive 9.000Breast a Negative 28.000Breast b Negative 29.000

Breast Cancer Colon-Rectum

Positive Negative Positive Negative

Treatment A 14 28 16 32

Treatment B 9 29 7 45

odds (biopsy positive, given treatment A) 14 28⁄=odds (biopsy positive, given treatment B) 9 29⁄=

---------------------------------------------------------------------------------------------------------------------------

14 28⁄9 29⁄

---------------- 1.6=

16 32⁄7 45⁄

---------------- 3.2=

Page 222: Statistics I

I-202

Chapter 8

total frequency in each table, the Mantel-Haenszel statistic combines odd ratios across tables. The input is:

The stratification variable (CANCER$) must be the first variable listed on TABULATE. The output is:

SYSTAT prints a chi-square test for testing whether this combined estimate equals 1.0 (that odds for A and B are the same). The probability associated with this chi-square is 0.029, so you reject the hypothesis that the odds ratio is 1.0 and conclude that treatment A is less effective—more patients on treatment A have positive biopsies after treatment than patients on treatment B.

One assumption required for the Mantel-Haenszel chi-square test is that the odds ratios are homogenous across tables. For your example, the second odds ratio is twice as large as the first. You can use loglinear models to test if a cancer-by-treatment interaction is needed to fit the cells of the three-way table defined by cancer, treatment, and response. The difference between this model and one without the interaction was not significant (a chi-square of 0.36 with 1 df).

USE ansfieldXTAB FREQ = number ORDER response$ / SORT=’Positive’,’Negative’ PRINT / MANTEL TABULATE cancer$ * treat$ * response$

Frequencies TREAT$ (rows) by RESPONSE$ (columns) CANCER$ = Breast Positive Negative Total +---------------------------+ a | 14 28 | 42 b | 9 29 | 38 +---------------------------+ Total 23 57 80 CANCER$ = Colon-Rectum Positive Negative Total +---------------------------+ a | 16 32 | 48 b | 7 45 | 52 +---------------------------+ Total 23 77 100 Test statistic Value df Prob Mantel-Haenszel statistic = 2.277 Mantel-Haenszel Chi-square = 4.739 Probability = 0.029

Page 223: Statistics I

I-203

Crosstabulation

Computation

All computations are in double precision.

References

Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.: Lifetime Learning.

Ansfield, F., et al. (1977). A phase III study comparing the clinical utility of four regimens of 5-fluorouracil. Cancer, 39, 34–40.

Bailey, B. J. R. (1980). Large sample simultaneous confidence intervals for the multinomial probabilities based on transformations of the cell frequencies. Technometrics, 22, 583–589.

Davis, D. J. (1977). An analysis of some failure data. Journal of the American Statistical Association, 72, 113–150.

Fleiss, J. L. (1981). Statistical methods for rates and proportions. 2nd ed. New York: John Wiley & Sons, Inc.

Morrison, A. S., Black, M. M., Lowe, C. R., MacMahon, B., and Yuasa, S. Y. (1973). Some international differences in histology and survival in breast cancer. International Journal of Cancer, 11, 261–267.

Page 224: Statistics I
Page 225: Statistics I

I-205

Chapte r

9Descriptive Statistics

Leland Wilkinson and Laszlo Engelman

There are many ways to describe data, although not all descriptors are appropriate for a given sample. Means and standard deviations are useful for data that follow a normal distribution, but are poor descriptors when the distribution is highly skewed or has outliers, subgroups, or other anomalies. Some statistics, such as the mean and median, describe the center of a distribution. These estimates are called measures of location. Others, such as the standard deviation, describe the spread of the distribution.

Before deciding what you want to describe (location, spread, and so on), you should consider what type of variables are present. Are the values of a variable unordered categories, ordered categories, counts, or measurements?

For many statistical purposes, counts are treated as measured variables. Such variables are called quantitative if it makes sense to do arithmetic on their values. Means and standard deviations are appropriate for quantitative variables that follow a normal distribution. Often, however, real data do not meet this assumption of normality. A descriptive statistic is called robust if the calculations are insensitive to violations of the assumption of normality. Robust measures include the median, quartiles, frequency counts, and percentages.

Before requesting descriptive statistics, first scan graphical displays to see if the shape of the distribution is symmetric, if there are outliers, and if the sample has subpopulations. If the latter is true, then the sample is not homogeneous, and the statistics should be calculated for each subgroup separately.

Descriptive Statistics offers the usual mean, standard deviation, and standard error appropriate for data that follow a normal distribution. It also provides the median, minimum, maximum, and range. A confidence interval for the mean and standard errors for skewness and kurtosis can be requested. A stem-and-leaf plot is available

Page 226: Statistics I

I-206

Chapter 9

for assessing distributional shape and identifying outliers. Moreover, Descriptive Statistics provide stratified analyses—that is, you can request results separately for each level of a grouping variable (such as SEX$) or for each combination of levels of two or more grouping variables.

Statistical Background

Descriptive statistics are numerical summaries of batches of numbers. Inevitably, these summaries are misleading, because they mask details of the data. Without them, however, we would be lost in particulars.

There are many ways to describe a batch of data. Not all are appropriate for every batch, however. Let’s look at the Who’s Who data from Chapter 1 to see what this means. First of all, here is a stem-and-leaf diagram of the ages of 50 randomly sampled people from Who’s Who. A stem-and-leaf diagram is a tally; it shows us the distribution of the AGE values.

Notice that these data look fairly symmetric and lumpy in the middle. A natural way to describe this type of distribution would be to report its center and the amount of spread.

Location

How do we describe the center, or central location of the distribution, on a scale? One way is to pick the value above which half of the data values fall and, by implication, below which half of the data values fall. This measure is called the median. For our AGE data, the median age is 56 years. Another measure of location is the “center of

Stem and leaf plot of variable: AGE , N = 50

Minimum: 34.000Lower hinge: 49.000Median: 56.000Upper hinge: 66.000Maximum: 81.000

3 4 3 689 4 14 4 H 556778999 5 0011112 5 M 556688889 6 0023 6 H 55677789 7 04 7 5668 8 1

Page 227: Statistics I

I-207

Descriptive Statist ics

gravity” of the numbers. Think of turning the stem-and-leaf diagram on its side and balancing it. The balance point would be the mean. For a batch of numbers, the mean is computed by averaging the values. In our sample, the mean age is 56.7 years. It is quite close to the median.

Spread

One way to measure spread is to take the difference between the largest and smallest value in the data. This is called the range. For the age data, the range is 47 years. Another measure, called the interquartile range or midrange, is the difference between the values at the limits of the middle 50% of the data. For AGE, this is 17 years. (Using the statistics at the top of the stem-and-leaf display, subtract the lower hinge from the upper hinge.) Still another way to measure would be to compute the average variability in the values. The standard deviation is the square root of the average squared deviation of values from the mean. For the AGE variable, the standard deviation is 11.62. Following is some output from STATS:

The Normal Distribution

All of these measures of location and spread have their advantages and disadvantages, but the mean and standard deviation are especially useful for describing data that follow a normal distribution. The normal distribution is a mathematical curve with only two parameters in its equation: the mean and standard deviation. As you recall from Chapter 1, a parameter defines a family of mathematical functions, all of which have the same general shape. Thus, if data come from a normal distribution, we can describe them completely (except for random variation) with only a mean and standard deviation.

Let’s see how this works for our AGE data. Shown in the next figure is a histogram of AGE with the normal curve superimposed. The location (center) of this curve is at the mean age of the sample (56.7), and its spread is determined by the standard deviation (11.62).

AGE

N of cases 50 Mean 56.700 Standard Dev 11.620

Page 228: Statistics I

I-208

Chapter 9

The fit of the curve to the data looks excellent. Let’s examine the fit in more detail. For a normal distribution, we would expect 68% of the observations to fall between one standard deviation below the mean and one standard deviation above the mean (45.1 to 68.3 years). By counting values in the stem-and-leaf diagram, we find 34 cases—on target. This is not to say that every number follows a normal distribution exactly, however. If we looked further, we would find that the tails of this distribution are slightly shorter than those from a normal distribution, but not enough to worry.

Non-Normal Shape

Before you compute means and standard deviations on everything in sight, however, let’s take a look at some more data: the USDATA data. Following are histograms for the first two variables, ACCIDENT and CARDIO:

30 40 50 60 70 80 90AGE

0

4

8

12C

ount

0.0

0.1

0.2 Proportion per B

ar

30 40 50 60 70 80 90AGE

0

4

8

12C

ount

20 30 40 50 60 70 80 90ACCIDENT

0

5

10

15

Cou

n t

0.0

0.1

0.2

0.3

Proportion per B

ar

20 30 40 50 60 70 80 90ACCIDENT

0

5

10

15

Cou

n t

100 200 300 400 500 600CARDIO

0

2

4

6

8

10

Cou

nt

0.0

0.1

0.2

Proportion per B

ar

100 200 300 400 500 600CARDIO

0

2

4

6

8

10

Cou

nt

Page 229: Statistics I

I-209

Descriptive Statist ics

Notice that the normal curves fit the distributions poorly. ACCIDENT is positively skewed. That is, it has a long right tail. CARDIO, on the other hand, is negatively skewed. It has a long left tail. The means (44.3 and 398.5) clearly do not fall in the centers of the distributions. Furthermore, if you calculate the medians using the Stem display, you will see that the mean for ACCIDENT is pulled away from the median (41.9) toward the upper tail and the mean for CARDIO is pulled to the left of the median (416.2).

In short, means and standard deviations are not good descriptors for non-normal data. In these cases, you have two alternatives: either transform your data to look normal, or find other descriptive statistics that characterize the data. If you log the values of ACCIDENT, for example, the histogram looks quite normal. If you square the values of CARDIO, the normal fit similarly improves.

If a transformation doesn’t work, then you may be looking at data that come from a different mathematical distribution or are mixtures of subpopulations (see below). The probability plots in SYSTAT can help you identify certain mathematical distributions. There is not room here to discuss parameters for more complex probability distributions. Otherwise, you should turn to distribution-free summary statistics to characterize your data: the median, range, minimum, maximum, midrange, quartiles, and percentiles.

Subpopulations

Sometimes, distributions can look non-normal because they are mixtures of different normal distributions. Let’s look at the Fisher/Anderson IRIS flower measurements. Following is a histogram of PETALLEN (petal length) smoothed by a normal curve:

0 1 2 3 4 5 6 7PETALLEN

0

10

20

30

40

50

Cou

nt

0.0

0.1

0.2

0.3

Proportion per B

ar

0 1 2 3 4 5 6 7PETALLEN

0

10

20

30

40

50

Cou

nt

0.0

0.1

0.2

0.3

Proportion per B

ar

0 1 2 3 4 5 6 7PETALLEN

0

10

20

30

40

50

Cou

nt

Page 230: Statistics I

I-210

Chapter 9

We forgot to notice that the petal length measurements involve three different flower species. You can see one of them at the left. The other two are blended at the right. Computing a mean and standard deviation on the mixed data is misleading.

The following box plot, split by species, shows how different the subpopulations are:

When there are such differences, you should compute basic statistics by group. If you want to go on to test whether the differences in subpopulation means are significant, use analysis of variance.

But first notice that the Setosa flowers (Group 1) have the shortest petals and the smallest spread; while the Virginica flowers (Group 3) have the longest petals and widest spread. That is, the size of the cell mean is related to the size of the cell standard deviation. This violates the assumption of equal variances necessary for a valid analysis of variance.

Here, we log transform the plot scale:

1 2 3SPECIES

0

1

2

3

4

5

6

7

PE

TA

LLE

N

1 2 3SPECIES

1

2

3

45678

PE

TA

LLE

N

Page 231: Statistics I

I-211

Descriptive Statistics in SYSTAT

Basic Statistics Main Dialog Box

To open the Basic Statistics main dialog box, from the menus choose:

StatisticsDescriptive Statistics

Basic Statistics…

� All Options. Calculate all available statistics.

� N. The number of nonmissing values for the variable.

� Minimum. The smallest nonmissing value.

� Maximum. The largest nonmissing value.

� Sum. The total of all nonmissing values of a variable.

� Mean. The arithmetic mean of a variable—the sum of the values divided by the number of (nonmissing) values.

� SEM. The standard error of the mean is the standard deviation divided by the square root of the sample size. It is the estimation error, or the average deviation of sample means from the expected value of a variable.

The following options are available

Descriptive Statist ics

The spreads of the three distributions are now more similar. For the analysis, we should log transform the data.

Page 232: Statistics I

I-212

� Median. The median estimates the center of a distribution. If the data are sorted in increasing order, the median is the value above which half of the values fall.

� SD. Standard deviation, a measure of spread, is the square root of the sum of the squared deviations of the values from the mean divided by (n–1).

� CV. The coefficient of variation is the standard deviation divided by the sample mean.

� Range. The difference between the minimum and the maximum values.

� Variance. The mean of the squared deviations of values from the mean. (Variance is the standard deviation squared).

� Skewness. A measure of the symmetry of a distribution about its mean. If skewness is significantly nonzero, the distribution is asymmetric. A significant positive value indicates a long right tail; a negative value, a long left tail. A skewness coefficient is considered significant if the absolute value of SKEWNESS / SES is greater than 2.

� SES. The standard error of skewness .

� Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable has longer tails than those for a normal distribution; less than 0 indicates that the distribution is flatter than a normal distribution. A kurtosis coefficient is considered significant if the absolute value of KURTOSIS / SEK is greater than 2.

� SEK. The standard error of kurtosis .

� Confidence. Confidence level for the confidence interval of the mean. Enter a value between 0 and 1. (0.95 and 0.99 are typical values).

SQR 6 n⁄( )( )

SQR 24 n⁄( )( )

Chapter 9

� N-tiles. Values that divide a sample of data into “N” groups containing (as far as possible) equal numbers of observations.

� Percentiles.Values that divide a sample of data into one hundred groups containing (asfar as equal numbers of observations.

Saving Basic Statistics to a File

If you are saving statistics to a file, you must select the format in which the statistics are to be saved:

� Variables. Use with a By Groups variable to save selected statistics to a data file. Each selected statistic is a case in the new data file (both the statistic and the

� CI of Mean. Endpoints for the confidence interval of the mean. You can specify confidence values between 0 and 1.

Page 233: Statistics I

I-213

Descriptive Statist ics

group(s) are identified). The file contains the variable STATISTIC$ identifying the statistics.

� Aggregate. Saves aggregate statistics to a data file. For each By Groups category, a record (case) in the new data file contains all requested statistics. Three characters are appended to the first eight letters of the variable name to identify the statistics. The first two characters identify the statistic. The third character represents the order in which the variables are selected. The statistics correspond to the following two-letter combinations:

Stem Main Dialog Box

To open the Stem main dialog box, from the menus choose:

StatisticsDescriptive Statistics

Stem-and-Leaf…

Stem creates a stem-and-leaf plot for one or more variables. The plot shows the distribution of a variable graphically. In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf. The stems are listed as a column on the left, and the leaves for each stem are in a row on the right. Stem-and-leaf plots also list the minimum, lower-hinge, median, upper-hinge, and maximum values of the sample.

N of cases NU Std. Error SEMinimum MI Std. Deviation SDMaximum MA Variance VARange RA C.V. CVSum SU Skewness SKMedian MD SE Skewness ESMean ME Kurtosis KUCI Upper CU SE Kurtosis EKCI Lower CL

Page 234: Statistics I

I-214

Chapter 9

Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of the leaves.

The stem-and-leaf plot is useful for assessing distributional shape and identifying outliers. Values that are markedly different from the others in the sample are labeled as outside values—that is, the value is more than 1.5 hspreads outside its hinge (the hspread is the distance between the lower and upper hinges, or quartiles). Under normality, this translates into roughly 2.7 standard deviations from the mean.

The following must be specified to obtain a stem-and-leaf plot:

� Variable(s). A separate stem-and-leaf plot is created for each selected variable.

In addition, you can indicate how many lines (stems) to include in the plot.

Cronbach Main Dialog Box

To open the Cronbach main dialog box, from the menus choose:

StatisticsScale

Cronbach’s Alpha…

Cronbach computes Cronbach’s alpha. This statistic is a lower bound for test reliability and ranges in value from 0 to 1 (negative values can occur when items are negatively correlated). Alpha can be viewed as the correlation between the items (variables) selected and all other possible tests or scales (with the same number of items) constructed to measure the characteristic of interest. The formula used to calculate alpha is:

α

k avg cov( )×avg var( )

--------------------------------

1k 1–( ) avg cov( )×

avg var( )----------------------------------------------+

-------------------------------------------------------------=

Page 235: Statistics I

I-215

Descriptive Statist ics

where k is the number of items, avg(cov) is the average covariance among the items, and avg(var) is the average variance. Note that alpha depends on both the number of items and the correlations among them. Even when the average correlation is small, the reliability coefficient can be large if the number of items is large.

The following must be specified to obtain a Cronbach’s alpha:

� Variable(s). To obtain Cronbach’s alpha, at least two variables must be selected.

Using Commands

To generate descriptive statistics, choose your data by typing USE filename, and continue with:

Usage Considerations

Types of data. STATS uses only numeric data.

Print options. The output is standard for all PRINT options.

Quick Graphs. STATS does not create Quick Graphs.

Saving files. STATS saves basic statistics as either records (cases) or as variables.

BY groups. STATS analyzes data by groups.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. STATS uses the FREQ variable, if present, to duplicate cases.

Case weights. STATS uses the WEIGHT variable, if present, to weight cases. However, STEM is not affected by the WEIGHT variable.

STATISTICSSTEM varlist / LINES=nCRONBACH varlistSAVE / AGSTATISTICS varlist / ALL N MIN MAX SUM MEAN SEM CIM, CONFI=n MEDIAN SD CV RANGE VARIANCE, SKEWNESS SES KURTOSIS SEK

Page 236: Statistics I

I-216

Chapter 9

Examples

Example 1 Basic Statistics

This example uses the OURWORLD data file, containing one record for each of 57 countries, and requests the default set of statistics for BABYMORT (infant mortality), GNP_86 (gnp per capita in 1986), LITERACY (percentage of the population who can read), and POP_1990 (population, in millions, in 1990).

The Statistics procedure knows only that these are numeric variables—it does not know if the mean and standard deviation are appropriate descriptors for their distributions. In other examples, we learned that the distribution of infant mortality is right-skewed and has distinct subpopulations, the gnp is missing for 12.3% of the countries, the distribution of LITERACY is left-skewed and has distinct subgroups. and a log transformation markedly improves the symmetry of the population values. This example ignores those findings.

The input is:

Following is the output:

For each variable, SYSTAT prints the number of cases (N of cases) with data present. Notice that the sample size for GNP_86 is 50, or 7 less than the total observations. For each variable, Minimum is the smallest value and Maximum, the largest. Thus, the lowest infant mortality rate is 5 deaths (per 1,000 live births), and the highest is 154 deaths. In a symmetric distribution, the mean and median are approximately the same. The median for POP_1990 is 10.354 million people (see the stem-and-leaf plot example). Here, the mean is 22.8 million—more than double the median. This estimate of the mean is quite sensitive to the extreme values in the right tail.

Standard Dev, or standard deviation, measures the spread of the values in each distribution. When the data follow a normal distribution, we expect roughly 95% of the values to fall within two standard deviations of the mean.

STATISTICSUSE ourworldSTATISTICS babymort gnp_86 literacy pop_1990

BABYMORT GNP_86 LITERACY POP_1990N of cases 57 50 57 57Minimum 5.0000 120.0000 11.6000 0.2627Maximum 154.0000 17680.0000 100.0000 152.5051Mean 48.1404 4310.8000 73.5632 22.8003Standard Dev 47.2355 4905.8773 29.7646 30.3655

Page 237: Statistics I

I-217

Descriptive Statist ics

Example 2 Saving Basic Statistics: One Statistic and One Grouping Variable

For European, Islamic, and New World countries, we save the median infant mortality rate, gross national product, literacy rate, and 1990 population using the OURWORLD data file. The input is:

The text results that appear on the screen are shown below (they can also be sent to a text file).

The MYSTATS data file (created in the SAVE step) is shown below:

Use a statement such as this to eliminate the sample size records:

STATISTICSUSE ourworldBY group$SAVE mystatsSTATISTICS babymort gnp_86 literacy pop_1990 / MEDIANBY

The following results are for: GROUP$ = Europe BABYMORT GNP_86 LITERACY POP_1990 N of cases 20 18 20 20 Median 6.000 9610.000 99.000 10.462 GROUP$ = Islamic BABYMORT GNP_86 LITERACY POP_1990 N of cases 16 12 16 16 Median 113.000 335.000 28.550 16.686 GROUP$ = NewWorld BABYMORT GNP_86 LITERACY POP_1990 N of cases 21 20 21 21 Median 32.000 1275.000 85.600 7.241

Case GROUP$ STATISTIC$ BABYMORT GNP_86 LITERACY POP_1990

1 Europe N of cases 20 18 20 202 Europe Median 6 9610 99 10.4623 Islamic N of cases 16 12 16 164 Islamic Median 113 335 28.550 16.6865 NewWorld N of cases 21 20 21 216 NewWorld Median 32 1275 85.6 7.241

SELECT statistic$ <> "N of cases"

Page 238: Statistics I

I-218

Chapter 9

Example 3 Saving Basic Statistics: Multiple Statistics and Grouping Variables

If you want to save two or more statistics for each unique cross-classification of the values of the grouping variables, SYSTAT can write the results in two ways:

� A separate record for each statistic. The values of a new variable named STATISTICS$ identify the statistics.

� One record containing all the requested statistics. SYSTAT generates variable names to label the results.

The first layout is the default; the second is obtained using:

As examples, we save the median, mean, and standard error of the mean for the cross-classification of type of country with government for the OURWORLD data. The nine cells for which we compute statistics are shown below (the number of countries is displayed in each cell):

Note the empty cell in the first row. We illustrate both file layouts—a separate record for each statistic and one record for all results.

One record per statistic. The following commands are used to compute and save statistics for the combinations of GROUP$ and GOV$ shown in the table above:

SAVE filename / AG

Democracy Military One Party

Europe 16 0 4Islamic 4 7 5

New World 12 6 3

STATISTICSUSE ourworldBY group$ gov$SAVE mystats2STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN MEAN SEMBY

Page 239: Statistics I

I-219

Descriptive Statist ics

The MYSTATS2 file with 32 cases and seven variables is shown below:

The average infant mortality rate for European democratic nations is 6.875 (case 2), while the median is 6.0 (case 4).

One record for all statistics. Instead of four records (cases) for each combination of GROUP$ and GOV$, we specify AG (aggregate) to prompt SYSTAT to write one record for each cell:

Case GROUP$ GOV$ STATISTC$ BABYMORT GNP_86 LITERACY POP_1990

1 Europe Democracy N of Cases 16.000 16.000 16.000 16.000 2 Europe Democracy Mean 6.875 9770.000 97.250 22.427 3 Europe Democracy Std. Error 0.547 1057.226 1.055 5.751 4 Europe Democracy Median 6.000 10005.000 99.000 9.969 5 Europe OneParty N of Cases 4.000 2.000 4.000 4.000 6 Europe OneParty Mean 11.500 2045.000 98.750 20.084 7 Europe OneParty Std. Error 1.708 25.000 0.250 6.036 8 Europe OneParty Median 12.000 2045.000 99.000 15.995 9 Islamic Democracy N of Cases 4.000 4.000 4.000 4.00010 Islamic Democracy Mean 91.000 700.000 37.300 12.76111 Islamic Democracy Std. Error 23.083 378.660 9.312 5.31512 Islamic Democracy Median 97.000 370.000 29.550 12.61213 Islamic OneParty N of Cases 5.000 3.000 5.000 5.00014 Islamic OneParty Mean 109.800 1016.667 29.720 15.35515 Islamic OneParty Std. Error 15.124 787.196 9.786 3.28916 Islamic OneParty Median 116.000 280.000 18.000 15.86217 Islamic Military N of Cases 7.000 5.000 7.000 7.00018 Islamic Military Mean 110.857 458.000 37.886 51.44419 Islamic Military Std. Error 11.801 180.039 7.779 18.67820 Islamic Military Median 116.000 350.000 29.000 51.66721 NewWorld Democracy N of Cases 12.000 12.000 12.000 12.00022 NewWorld Democracy Mean 44.667 2894.167 85.800 26.49023 NewWorld Democracy Std. Error 9.764 1085.810 3.143 11.92624 NewWorld Democracy Median 35.000 1645.000 86.800 15.10225 NewWorld OneParty N of Cases 3.000 2.000 3.000 3.00026 NewWorld OneParty Mean 14.667 2995.000 90.500 4.44127 NewWorld OneParty Std. Error 1.333 2155.000 8.251 3.15328 NewWorld OneParty Median 16.000 2995.000 98.500 2.44129 NewWorld Military N of Cases 6.000 6.000 6.000 6.00030 NewWorld Military Mean 53.167 1045.000 63.000 6.88631 NewWorld Military Std. Error 13.245 287.573 10.820 1.51532 NewWorld Military Median 55.000 780.000 60.500 5.726

Page 240: Statistics I

I-220

Chapter 9

The MYSTATS3 file, with 8 cases and 18 variables, is shown below. (We separated them into three panels and shortened the variable names):

Note that there are no European countries with Military governments, so no record is written.

STATISTICSUSE ourworldBY group$ gov$SAVE mystats3 / AGSTATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN MEAN SEMBY

Case GROUP$ GOV$ NU1BABYM ME1BABYM SE1BABYM MD1BABYM

1 Europe Democracy 16 6.875 0.547 6.02 Europe OneParty 4 11.500 1.708 12.03 Islamic Democracy 4 91.000 23.083 97.04 Islamic OneParty 5 109.800 15.124 116.05 Islamic Military 7 110.857 11.801 116.06 NewWorld Democracy 12 44.667 9.764 35.07 NewWorld OneParty 3 14.667 1.333 16.08 NewWorld Military 6 53.167 13.245 55.0

NU2GNP_8 ME2GNP_8 SE2GNP_8 MD2GNP_8 NU3LITER ME3LITER

16 9770.000 1057.226 10005 16 97.2502 2045.000 25.000 2045 4 98.7504 700.000 378.660 370 4 37.3003 1016.667 787.196 280 5 29.7205 458.000 180.039 350 7 37.886

12 2894.167 1085.810 1645 12 85.8002 2995.000 2155.000 2995 3 90.5006 1045.000 287.573 780 6 63.000

SE3LITER MD3LITER NU4POP_1 ME4POP_1 SE4POP_1 MD4POP_1

1.055 99.0 16 22.427 5.751 9.9690.250 99.0 4 20.084 6.036 15.9959.312 29.5 4 12.761 5.315 12.6129.786 18.0 5 15.355 3.289 15.8627.779 29.0 7 51.444 18.678 51.6673.143 86.8 12 26.490 11.926 15.1028.251 98.5 3 4.441 3.153 2.441

10.820 60.5 6 6.886 1.515 5.726

Page 241: Statistics I

I-221

Descriptive Statist ics

Example 4 Stem-and-Leaf Plot

We request robust statistics for BABYMORT (infant mortality), POP_1990 (1990 population in millions), and LITERACY (percentage of the population who can read) from the OURWORLD data file. The input is:

The output follows:

STATISTICSUSE ourworldSTEM babymort pop_1990 literacy

Stem and Leaf Plot of variable: BABYMORT, N = 57 Minimum: 5.0000 Lower hinge: 7.0000 Median: 22.0000 Upper hinge: 74.0000 Maximum: 154.0000 0 H 5666666666677777 1 00123456668 2 M 227 3 028 4 9 5 6 11224779 7 H 4 8 77 9 10 77 11 066 12 559 13 6 14 07 15 4

Stem and Leaf Plot of variable: POP_1990, N = 57 Minimum: 0.2627 Lower hinge: 6.1421 Median: 10.3545 Upper hinge: 25.5665 Maximum: 152.5051 0 00122333444 0 H 5556667777788899 1 M 0000034 1 556789 2 14 2 H 56 3 23 3 79 4 4 5 1 * * * Outside Values * * * 5 6677 6 2 11 48 15 2

Page 242: Statistics I

I-222

Chapter 9

In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf. The stems are listed as a column on the left, and the leaves for each stem are in a row on the right. For infant mortality (BABYMORT), the Maximum number of babies who die in their first year of life is 154 (out of 1,000 live births). Look for this value at the bottom of the BABYMORT display. The stem for 154 is 15, and the leaf is 4. The Minimum value for this variable is 5—its leaf is 5 with a stem of 0.

The median value of 22 is printed here as the Median in the top panel and marked by an M in the plot. The hinges, marked by H’s in the plot, are 7 and 74 deaths, meaning that 25% of the countries in our sample have a death rate of 7 or less, and another 25% have a rate of 74 or higher. Furthermore, the gaps between 49 and 61 deaths and between 87 and 107 indicate that the sample does not appear homogeneous

Focusing on the second plot, the median population size is 10.354, or more than 10 million people. One-quarter of the countries have a population of 6.142 million or less. The largest country (Brazil) has more than 152 million people. The largest stem for POP_1990 is 15, like that for BABYMORT. This 15 comes from 152.505, so the 2 is the leaf and the 0.505 is lost.

The plot for POP_1990 is very right-skewed. Notice that a real number line extends from the minimum stem of 0 (0.623) to the stem of 5 for 51 million. The values below Outside Values (stems of 5, 6, 11, and 25 with 8 leaves) do not fall along a number line, so the right tail of this distribution extends further than one would think at first glance.

The median in the final plot indicates that half of the countries in our sample have a literacy rate of 88% or better. The upper hinge is 99%, so more than one-quarter of the countries have a rate of 99% or better. In the country with the lowest rate (Somalia), only 11.6% of the people can read. The stem for 11.6 is 1 (the 10’s digit), and the leaf is 1 (the units’ digit). The 0.6 is not part of the display. For stem 10, there are two leaves that are 0—so two countries have 100% literacy rates (Finland and Norway). Notice the 11 countries (at the top of the plot) with very low rates. Is there a separate subgroup here?

Stem and Leaf Plot of variable: LITERACY, N = 57 Minimum: 11.6000 Lower hinge: 55.0000 Median: 88.0000 Upper hinge: 99.0000 Maximum: 100.0000 1 1258 2 035689 3 1 4 5 H 002556 6 355 7 0446 8 M 03558 9 H 03344457888889999999999999 10 00

Page 243: Statistics I

I-223

Descriptive Statist ics

Transformations

Because the distribution of POP_1990 is very skewed, it may not be suited for analyses based on normality. To find out, we transform the population values to log base 10 units using the L10 function. The input is:

Following is the output:

For the untransformed values of the population, the stem-and-leaf plot identifies eight outliers. Here, there is only one outlier. More important, however, is the fact that the shape of the distribution for these transformed values is much more symmetric.

Subpopulations

Here, we stratify the values of LITERACY for countries grouped as European, Islamic, and New World. The input is:

STATISTICSUSE ourworldLET logpop90=L10(pop_1990)STEM logpop90

Stem and Leaf Plot of variable: LOGPOP90, N = 57 Minimum: -0.5806 Lower hinge: 0.7883 Median: 1.0151 Upper hinge: 1.4077 Maximum: 2.1833 -0 5 * * * Outside Values * * * 0 01 0 33 0 445 0 H 6667777 0 888888899999 1 M 00000111 1 2222233 1 H 445555 1 777777 1 2 001 2

STATISTICSUSE ourworldBY group$STEM babymort pop_1990 literacyBY

Page 244: Statistics I

I-224

Chapter 9

The output follows:

The following results are for: GROUP$ = Europe Stem and Leaf Plot of variable: LITERACY, N = 20 Minimum: 83.0000 Lower hinge: 98.0000 Median: 99.0000 Upper hinge: 99.0000 Maximum: 100.0000 83 0 93 0 95 0 * * * Outside Values * * * 97 0 98 H 000 99 M 00000000000 100 00 The following results are for: GROUP$ = Islamic Stem and Leaf Plot of variable: LITERACY, N = 16 Minimum: 11.6000 Lower hinge: 19.0000 Median: 28.5500 Upper hinge: 53.5000 Maximum: 70.0000 1 H 1258 2 M 05689 3 1 4 5 H 0255 6 5 7 0 The following results are for: GROUP$ = NewWorld Stem and Leaf Plot of variable: LITERACY, N = 21 Minimum: 23.0000 Lower hinge: 74.0000 Median: 85.6000 Upper hinge: 94.0000 Maximum: 99.0000 2 3 * * * Outside Values * * * 5 0 5 6 6 3 6 5 7 H 44 7 6 8 0 8 M 558 9 H 03444 9 8899

Page 245: Statistics I

I-225

Descriptive Statist ics

The literacy rates for Europe and the Islamic nations do not even overlap. The rates range from 83% to 100% for the Europeans and 11.6% to 70% for the Islamics. Earlier, 11 countries were identified that have rates of 31% or less. From these stratified results, we learn that 10 of the countries are Islamic and 1 (Haiti) is from the New World. The Haitian rate (23%) is identified as an outlier with respect to the values of the other New World countries.

Computation

All computations are in double precision.

Algorithms

SYSTAT uses a one-pass provisional algorithm (Spicer, 1972). Wilkinson and Dallal (1977) summarize the performance of this algorithm versus those used in several statistical packages.

References

Spicer, C. C. (1972). Calculation of power sums of deviations about the mean. Applied Statistics, 21, 226–227.

Wilkinson, L. and Dallal, G. E. (1977). Accuracy of sample moments calculations among widely used statistical programs. The American Statistician, 31, 128–131.

Page 246: Statistics I
Page 247: Statistics I

I-227

Chapte r

10Design of Experiments

Herb Stenson

Design of Experiments (DOE) generates design matrices for a variety of ANOVA and mixture models. You can use Design of Experiments as both an online library and a search engine for experimental designs, saving any design to a SYSTAT file. You can run the associated experiment, add the values of a dependent variable to the same file, and analyze the experimental data by using General Linear Model (or another SYSTAT statistical procedure).

SYSTAT offers three methods for generating experimental designs: Classic DOE, the DOE Wizard, and the DESIGN command.

� Classic DOE provides a standard dialog interface for generating the most popular complete (full) and incomplete (fractional) factorial designs. Complete factorial designs can have two or three levels of each factor, with two-level designs limited to two to seven factors, and three-level designs limited to two to five factors. Incomplete designs include: Latin square designs with 3 to 12 levels per factor; selected two-level designs described by Box, Hunter, and Hunter (1978) with 3 to 11 factors and from 4 to 128 runs; 13 of the most popular Taguchi (1987) designs; all of the Plackett and Burman (1946) two-level designs with 4 to 100 runs; the 6 three-, five-, and seven-level designs described by Plackett and Burman; and the set of 10 three-level designs described by Box and Behnken (1960) in both their blocked and unblocked versions. In addition, the Lattice, Centroid, Axial, and Screening mixture designs can be generated. The number of factors (components of a mixture) can be as large as your computer’s memory allows.

� The DOE Wizard provides an alternative interface consisting of a series of questions defining the structure of the design. The wizard offers more designs than Classic DOE, including response surface and optimal designs. Optimization methods include the Fedorov, k-exchange, and coordinate exchange algorithms

Page 248: Statistics I

I-228

Chapter 10

with three optimality criteria available. The coordinate exchange algorithms accommodate both continuous and categorical variables. The search algorithms for fractional factorial designs allow any number of levels for any factor and search for orthogonal, incomplete blocks if requested. The number of factors for factorial, central composite, and optimal designs is restricted only by your computer’s memory.

� The DESIGN command generates all designs found in Classic DOE using SYSTAT’s command language.

Designs can be replicated as many times as you want, and the runs can be randomized.

Statistical Background

The Research Problem

As an investigator interested in solving problems, you are faced with the task of identifying good solutions. You do this by using what you already know about the problem area to make a judgment about the solution(s). If you possess in-depth process knowledge, then there is little work to be done; you simply apply that knowledge to the problem at hand and derive a solution.

More common is the situation in which you have limited knowledge about the factors involved and their interrelationships, so that any conjecture would be quite uncertain and far from optimal. In these situations, the first step would be to enhance your knowledge. This is usually done by empirical investigation—that is, by systematically observing the factors and how they affect the outcome of interest. The results of these observations become the data in your study.

Process problems usually have factors, or variables, that may affect the outcome, and responses that measure the outcome of interest. The basic problem-solving approach is to develop a model that helps you understand the specific relationships between factors and responses. Such a model allows you to predict which factor values will lead to a desired response, or outcome. These empirical data provide the statistical basis used to generate models of your process.

Page 249: Statistics I

I-229

Design of Experiments

Types of Investigation

You can think of any empirical investigation as falling into one of two broad classes: experiment or observational study. The two classes have different properties and are used to approach different types of problems.

Experiments

Experiments are studies in which the factors are under the direct control of the experimenter. That is, the experimenter assigns certain values of the factors to each run, or observation. The response(s) are recorded for each chosen combination of factor levels.

Because the factors are being manipulated by the experimenter, the experimenter can make inferences about causality. If assigning a certain temperature leads to a decrease in the output of a chemical process, you can be fairly certain that temperature really did cause the decrease because you assigned the temperature value while holding other factors constant.

Unfortunately, experiments do have a drawback in that there are some situations in which it is either impossible or impractical, or even unethical, to exercise control over the factors of interest. In those situations, an observational study must be used.

Observational Studies

Observational studies use only minimal, if any, intervention by the observer on the process. The observer merely observes and records changes in the response as the factors undergo their natural variation. No attempt is made to control the factors.

Because the factors are not under the control of the experimenter, observational studies are very limited in their ability to explain causal relationships. For example, if you observe that shoe size and scholastic achievement show a strong relationship among school children, can you infer that larger feet cause achievement? Of course not. The truth of the matter is that both variables are most likely caused by a third (unmeasured) variable—age. Older students have larger feet, and they have been in school longer. If you could have some control over shoe size, you could make sure that shoe sizes were evenly distributed across students of different ages, and you would be in a much better position to make inferences about the causal relationship between shoe size and achievement.

Page 250: Statistics I

I-230

Chapter 10

But of course it’s silly to speak of controlling shoe size, since you can’t change the size of people’s feet. This illustrates the strength of observational studies—they can be employed where true experimental studies are impossible, for either ethical or practical reasons.

Because the focus of this chapter is the design and analysis of experimental studies, further references to observational studies will be minimal.

The Importance of Having a Strategy

Controlling the factors in an experiment is only the beginning of effective experimental research. Once you determine that you have a problem that can be addressed by experimentation, you need to answer other crucial questions: what will your experiment look like? What levels of which factors will you measure? How will you analyze the results to convert your data to knowledge? These are the questions that SYSTAT can help you answer.

Careful planning of your experiment will give you many advantages over a poorly designed, haphazard approach to data collection. As Box, Hunter, and Hunter (1978) point out,

Frequently conclusions are easily drawn from a well-designed experiment, even when rather elementary methods of analysis are employed. Conversely, even the most sophisticated statistical analysis cannot salvage a badly designed experiment. (p. vii.)

Completeness

By using a well-designed experiment, you will be able to discover the most important relationships in your process. Lack of planning can lead to incomplete designs that leave certain questions unanswered, confounding that causes confusion of two or more effects so that they become statistically indistinguishable, and poor precision of estimates.

Efficiency

Carefully planned experiments allow you to get the information you need at a fraction of the cost of a poorly planned design. Content knowledge can be applied to select specific effects of interest, and your experimental runs can be targeted to answer just those effects. Runs are not wasted on testing effects you already understand well.

Page 251: Statistics I

I-231

Design of Experiments

Insight

A well-designed experiment allows you to see patterns in the data that would be difficult to spot in a simple table of hastily collected values. The mathematical model you build based on your observations will be more reliable, more accurate, and more informative if you use well-chosen run points from an appropriate experimental design.

The Role of Experimental Design in Research

Experimental design is the interface between your question and the “real world.” The design tells you how much data you will need to collect, what factor levels to use for the run points, and how to analyze the results to get a useful model of the process. The model you derive from your experiment can then be applied to the problem at hand, enhancing your knowledge and allowing you to confidently formulate a solution.

The figure below illustrates the flow of knowledge in experimental research. Notice that the diagram is circular—you start with some knowledge, formulate a research question, and perform the research; then the knowledge you gained from the research is used to formulate new research questions, new designs, and so on. As you go through the iterations, you should find that your information increases in both quantity and quality.

Types of Experimental Designs

There is a wide variety of experimental designs, each of which addresses a different type of research problem. These designs tend to fall into broad classes, which can be summarized as follows:

� Factorial designs. These designs are used to identify important effects in your process.

PriorKnowledge

ResearchQuestion

ExperimentalDesign

DataCollection Analysis Interpretation

NewKnowledge

Page 252: Statistics I

I-232

Chapter 10

� Response surface designs. These designs are useful when you want to find the combination of factor values that gives the highest (or lowest) response.

� Mixture designs. These designs are useful when you want to find the ideal proportions of ingredients for a mixture process. Mixture designs take into account the fact that all the component proportions must sum to 1.0.

� Optimal designs. These designs are useful when you have enough information available to give a very detailed specification of the model you want to test. Because optimal designs are very flexible, you can use them in situations where no standard design is available. Optimal designs are also useful when you want to have explicit control over the type of efficiency maximized by the design.

Factorial Designs

In investigating the factors that affect a certain process, the basic building blocks of your investigation are observations of the system under different conditions. You vary the factors under your control and measure what happens to the outcome of the process. The naive inquirer might use a haphazard, trial-and-error approach to testing the factors. Of course, this approach can take a long time and many observations, or runs, to give reasonable results (if it does at all), and, in fact, it may fail to reveal important effects because of the lack of an investigative strategy.

Someone more familiar with scientific methodology might make systematic comparisons of various levels of each factor, holding the others constant. However, while this approach is more reliable than the trial-and-error approach, it can still cause you to overlook important effects. Consider the following hypothetical response plot. The contours indicate points of equal response.

4

4

8

8

12

16

2020

20

40

60

80

100

120

140

160

180

200

220

10 20 30 40 50 60 70 80 90 100X1

Page 253: Statistics I

I-233

Design of Experiments

If you tried the one-at-a-time approach, your ability to accurately measure the effects of the variables would depend on the initial settings you chose. For example, if you chose the point indicated by the horizontal line as your fixed starting value for x2 as you varied x1, you would conclude that the maximum response occurs when . Then, you would fix x1 at 47 and vary x2, concluding that the maximum response occurs when . The two following figures illustrate this problem. However, it is clear from the previous contours that the maximum effect occurs where and , or perhaps even somewhere outside the range that you’ve measured.

This illustrates the importance of considering the factors simultaneously. The only way to find the true effects of the factors on the response variable is to take measurements at carefully planned combinations of the factor levels, as shown below. Such designs are called factorial designs. A factorial design that could be used to explore the hypothetical process would take measurements at high, medium, and low levels of each factor, with all combinations of levels used in the design.

x1 47=

x2 98=x1 100=

x2 220=

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80 90 100X1 (X2 held constant at 98.0)

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250X2 (X1 held constant at 47.0)

Page 254: Statistics I

I-234

Chapter 10

Factorial designs can be classified into two broad types: full (or complete) factorials and fractional factorials, shown below. Full factorials (a) use observations at all combinations of all factor levels . Full factorials give a lot of insight into the effects of the factors, particularly interactions, or joint effects of variables. Unfortunately, they often require a large number of runs, which means that they can be expensive. Fractional factorials (b) use only some combinations of factor levels. This means that they are efficient, requiring fewer runs than their full factorial counterpart. However, to gain this efficiency, they sacrifice some (or all) of their ability to measure interaction effects. This makes them ill-suited to exploring the details of complex processes.

Fractional Factorial Design Types

The following types of fractional factorial designs can be generated:

4

4

8

8

12

16

2020

20

40

60

80

100

120

140

160

180

200

220

10 20 30 40 50 60 70 80 90 100X1

Page 255: Statistics I

I-235

Design of Experiments

� Homogeneous fractional. These are fractional designs in which all factors have the same number of levels.

� Mixed-level fractional. These are fractional designs in which factors have different numbers of levels.

� Box-Hunter. This is a set of fractional designs for two-level factors that can be specified based on the number of factors and the number of runs (as a power of 2).

� Plackett-Burman. These designs are saturated (or nearly saturated) fractional factorial designs based on orthogonal arrays. They are very efficient for estimating main effects but rely on the absence of two-factor interactions.

� Taguchi. These designs are orthogonal arrays allowing for a maximum number of main effects to be estimated from a minimum number of runs in the experiment while allowing for differences in the number of factor levels

� Latin square. These designs are useful when there are restrictions on randomization, where you need to isolate the effects of one or more blocking (or "nuisance") factors. In Latin square designs, all factors must have the same number of levels. Graeco-Latin squares and hyper-Graeco-Latin squares can also be generated when you need to isolate the effects of more than one "nuisance" variable.

Analysis of Factorial Designs

Factorial designs are usually analyzed as linear models. The models available for a design depend on the number of factors and their levels and whether the design is full or fractional.

The simplest models are main-effects models. A main-effects model is summarized by the following equation:

where y is the response variable and represent the treatment effects of the factors. This model assumes that all interactions are negligible. These models are useful for describing very simple processes and for analyzing fractional designs of low resolution. They are also useful for analyzing screening designs, where the goal is not necessarily to model all effects realistically but merely to identify influential factors for further study.

The next level of model complexity, the second-order model, involves adding two-factor interaction terms to the equation. Following is an example for a two-factor model:

y µ α i βj … ε+ + + +=

α i βj …,,

Page 256: Statistics I

I-236

Chapter 10

This model allows you to explore joint effects of factors taken in pairs. For example, the term allows you to see whether the effect of the factor on y depends on the level of . If this term is significant, you can conclude that the effect of does indeed depend on the level of .

Response Surface Designs

There are many situations in which it is not enough to know simply which factors affect a process. You need to know exactly what combination of values for the factors produces the desired result. In other words, you want to optimize your process in terms of the outcome of interest. For example, you may want to find the best combination of temperature and pressure for a chemical process, or you may want to identify the ideal soak time and developer concentration for a photographic development process.

This is typically done by calculating a model of the response based on the factors of interest. The shape of the surface is examined in order to identify the point of maximum response (or minimum response for minimization problems). Such a model is called a response surface, and experimental designs for finding such models are called response surface designs.

In many cases, the response surface must be considered in parts because when you consider all of the possible values for the factors involved, the surface can be quite complex. Because of this complexity, it is often not possible to build a mathematical

yijk µ α i βj αβ( )i j εi jk+ + + +=

αβ( )ij αβ α

β

Page 257: Statistics I

I-237

Design of Experiments

model that truly reflects the shape of the surface. Fortunately, when you look at restricted portions of the response surface, they can usually be modeled successfully with relatively simple equations.

To take advantage of this, experimenters often use a two-stage approach to modeling response surfaces. In the first stage, a “neighborhood” in the space defined by the factors is chosen and a simple linear model is constructed. If the linear model fits the data in that neighborhood, the model is used to find a direction of steepest ascent (or descent for minimization problems). The factor limits that define the neighborhood are then adjusted in the appropriate direction, defining a new neighborhood, and another linear design is used. This continues until the simple design no longer fits the data in that region. Then a more complex model is calculated, and an estimate of the maximum (or minimum) response point can be found. (Occasionally, it may happen that the surface is linear up to the boundary of your factor space, in which case you simply use the linear model to choose the boundary point that maximizes your response.)

Variance of Estimates and Rotatability

In most cases, the purpose of building a mathematical model of a process is to make predictions about what would happen given a particular set of conditions that you have not measured directly. This is particularly true in the case of a response surface experiment—the surface you calculate is essentially a set of predictions for all possible combinations within the limits of your factor measurements. With an adequate model

Page 258: Statistics I

I-238

Chapter 10

and careful measurements, you can usually do a reasonably good job of predicting response throughout the response surface neighborhood of interest.

When you make such predictions, however, you must accept the fact that the model is not perfect—there are often imperfections in your measurements, and the mathematical model almost never fits the true response function exactly. Thus, if you were to conduct the experiment repeatedly, you would get slightly different answers each time. The degree to which your predictions are expected to differ across multiple experiments is known as the variance of prediction, or V( ). The value of V( ) depends on the design used and on where in the factor space you are calculating a prediction. V( ) increases as you get further from the observed data points. Of course, you would like the portion of the design that produces the most precise predictions to be near the optimum that you are trying to locate. Unfortunately, you usually don’t really know where the optimal value is (or in what direction it lies) when you start.

To deal with the fact that you don’t know exactly where the optimum is, you can use designs in which the variance of prediction depends only on the distance from the center of the design, not on the direction from the center. Such designs are called rotatable designs. First-order (linear) orthogonal models are always rotatable. Some central composite designs are rotatable. (In SYSTAT, the distance from the center is automatically chosen to ensure rotatability for unblocked designs. However, for blocked designs, the distance is chosen to ensure orthogonality of blocks, which may lead to nonrotatable designs.) In addition, some Box-Behnken designs are rotatable, and most are nearly rotatable (meaning that directional differences in prediction variance exist, but they are small). In general, three-level factorial response surface designs are not rotatable. This means that care should be used before employing such a design—the precision of your predictions may depend on the direction in which the optimum lies.

Response Surface Design Types

Two types of response surface designs are available:

Central composite. These designs combine a 2k factorial design (or a fraction thereof) with a set of 2k axial points and one or more center points, which allow quadratic surfaces to be modeled. These designs are efficient, requiring fewer runs than the corresponding full factorial design. However, they require each factor to be measured at five different levels.

Box-Behnken. These are second-order designs (which allow estimation of quadratic effects) based on combining a two-level factorial design with an incomplete block

y y

y

Page 259: Statistics I

I-239

Design of Experiments

design. For these designs, factors need to be measured at only three levels. Box-Behnken designs are also quite efficient in requiring relatively few runs.

Analysis of Response Surface Designs

Response surface designs are analyzed with either a linear or a quadratic model, depending on the purpose of the design. If the purpose is hill-climbing, a linear model is usually adequate. If the purpose is to locate the optimum, then a quadratic model is needed.

The linear model takes the form

where k is the number of factors in the design. Similarly, the quadratic model is expressed as

In either case, the estimated equation defines the response surface. This surface is often plotted, either as a 3-D surface plot or a 2-D contour plot, to help the investigator visualize the shape of the response surface.

Mixture Designs

Suppose that you are trying to determine the best blend of ingredients or components for your product. Initially, this appears to be a problem that can be addressed with a straightforward response surface design. However, upon closer examination, you discover that there is an additional consideration in this problem—the amounts of the ingredients are inextricably linked with each other. For example, suppose that you are trying to determine the best combination of pineapple and orange juices for a fruit punch. Increasing the amount of orange juice means that the amount of pineapple juice must be decreased, relative to the whole. (Of course, you could add more pineapple juice as you add more orange juice, but this would simply increase the total amount of punch. It would not alter the fundamental quality of the punch.) The problem is shown in the following plot.

y β0 β1x1 … βkxk+ + +=

y β0 β1x1 … βkxk β11x12 β12x1x2 … β k 1–( )kx k 1–( )xk βkkxk

2+ + + + + + + +=

Page 260: Statistics I

I-240

Chapter 10

By specifying that the components are ingredients in a mixture, you limit the values that the amounts of the components can take. All of the points corresponding to one-gallon blends lie on the line shown in the plot. You can describe the constraint with the equation OJ + PJ = 1 gallon.

Now, suppose that you decide to add a third type of juice, watermelon juice, to the blend. Of course, you still want the total amount of juice to be one gallon, but with three factors you have a bit more flexibility in the mixtures. For example, if you want to increase the amount of orange juice, you can decrease the amount of pineapple juice, the amount of watermelon juice, or both. The constraint now becomes OJ + PJ + WJ = 1 gallon. The combinations of juice amounts that satisfy this constraint lie in a triangular plane within the unconstrained factor space.

Orange Juice (OJ)

0.0 0.2 0.4 0.6 0.8 1.0

Pin

eapp

le J

uice

(P

J )

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.20.4

0.60.8

1.0

Wat

erm

elon

Jui

ce (

WJ)

Ora

nge

Juic

e (O

J)

Pineapple Juice (PJ)

Page 261: Statistics I

I-241

Design of Experiments

The feasible values for a mixture comprise a (k – 1)-dimensional region within the k-dimensional factor space (indicated by the shaded triangle). This region is called a simplex. The pure mixtures (made of only one component) are at the corners of the simplex, and binary mixtures (mixtures of only two components) are along the edges. The concept of the mixture simplex extends to higher-dimensional problems as well—the simplex for a four-component problem is a three-dimensional regular tetrahedron and so on.

To generalize, you measure component amounts as proportions of the whole rather than as absolute amounts. When you take this approach, it is clear that increasing the proportion of one ingredient necessarily decreases the proportion(s) of one or more of the others. There is a constraint that the sum of the ingredient proportions must equal the whole. In the case of proportions, the whole would be denoted by 1.0, and the constraint is expressed as

where x1, . . ., xk are the proportions of each of the k components in the mixture.Because of this constraint, such problems require a special approach. This approach

includes using a special class of experimental designs, called mixture designs. These designs take into account the fact that the component amounts must sum to 1.0.

Unconstrained Mixture Designs

Unconstrained mixture designs allow factor levels to vary from the minimum to the maximum value for the mixture. Four unconstrained designs are available. See Cornell (1990) for more information on each.

Lattice. Lattice designs allow you to specify the number of levels or the number of values that each component (factor) assumes, including 0 and 1. The selection of levels has no effect for the other three types of designs available because the number of factors determines the number of levels for each of them. As Cornell (1990) points out, the vast majority of mixture research employs lattice models; however, the other three types included here are useful in specific situations.

Centroid. Centroid designs consist of every (non-empty) subset of the components, but only with mixtures in which the components appear in equal proportions. Thus, if we asked for a centroid design with four factors (components), the mixtures in the model would consist of all permutations of the set (1,0,0,0), all permutations of the set (1/2,1/2,0,0), all permutations of the set (1/3,1/3,1/3,0), and the set (1/4,1/4,1/4,1/4).

x1 x2 … xk+ + + 1.0=

Page 262: Statistics I

I-242

Chapter 10

Thus, the number of distinct points is 1 less than 2 raised to the q power, where q is the number of components. Centroid designs are useful for investigating mixtures where incomplete mixtures (with at least one component absent) are of primary importance.

Axial. In an axial design with m components, each run consists of at least (m - 1) equal proportions of the components. These designs include: mixtures composed of one component; mixtures composed of (m - 1) components in equal proportions; and mixtures with equal proportions of all components. Thus, if we asked for an axial design with four factors (components), the mixtures in the model would consist of all permutations of the set (1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), all permutations of the set (0,1/3,1/3,1/3), and the set (1/4,1/4,1/4,1/4).

Screen. Screening designs are reduced axial designs, omitting the mixtures that contain all but one components. Thus, if we asked for a screening design with four factors (components), the mixtures in the model would consist of all permutations of the set (1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), and the set (1/4,1/4,1/4,1/4). Screening designs enable you to single out unimportant components from an array of many potential components.

Constrained Mixture Designs

You can also consider mixture problems with additional constraints on the mixture values. For example, suppose that orange juice is much cheaper than the other kinds of juice, and you therefore decide that your punch must contain at least 50% orange juice. However, you also want to make sure that your punch is sufficiently distinct from pure orange juice, so you place another restriction—that orange juice can make up no more than 75% of the punch. These criteria place additional constraints on your mixture, specifically . This restricts the range of feasible solutions in the simplex, as shown below by the outlined area.

0.5 OJ 0.75≤ ≤

Page 263: Statistics I

I-243

Design of Experiments

Analysis of Mixture Designs

In mixture experiments, you are usually trying to find an optimal mixture, according to some criterion. In this sense, mixture models are related to response surface models. However, the constraint on the sum of the component values takes away one degree of freedom from the model. This can be accommodated by reparameterizing the linear model so that there is no intercept term. (This is also known as a Scheffé model.) Thus, the linear model is specified as

and the quadratic form is

for mixtures with k components. Notice that the quadratic form does not include squared terms. Such terms would be redundant, since the square of a component can be reexpressed as a function of the linear and cross-product terms. For example,

The model is estimated using standard general linear modeling techniques. The parameters can be tested (with a sufficient number of observations), and they can be used to define the response function. The plot of this function can give visual insights

PJ0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

OJ

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

WJ

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

y β1x1 β2x2 … βkxk+ + +=

y β1x1 β2x2 … βkxk β12x1x2 … β k 1–( )kxk 1– xk+ + + + + +=

x12 x1 1 x2– …– xk–( ) x1 x1x2– …– x1xk–= =

Page 264: Statistics I

I-244

Chapter 10

into the process under investigation and allow you to select the optimal combination of components for your mixture.

Optimal Designs

In going through the process of designing experiments, you might ask yourself, “What is the advantage of a designed experiment over a more haphazard approach to collecting data?” The answer is that a carefully designed experiment will allow you to estimate the specific model you have in mind for the process, and it will allow you to do so efficiently. Efficiently in this context means that the model can be estimated with high (or at least adequate) precision, with a manageable number of runs.

Through the years, statisticians have identified useful classes of research problems and developed efficient experimental designs for each. Such classes of problems include identifying important effects within a set of two-level factors (Box-Hunter designs), optimizing a process using a quadratic surface (central composite or Box-Behnken designs), or optimizing a mixture process (mixture designs).

One of the standard designs may be appropriate for your research needs. Sometimes, however, your research problem doesn’t quite fit into the mold of these standard designs. Perhaps you have specific ideas about which terms you want to include in your model, or perhaps you can’t afford the number of runs called for in the standard design. The standard design’s efficiency is based on assumptions about the model to be estimated, the number of runs to be collected, and so on. When you try to run experiments that violate these assumptions, you lose some of the efficiency of the design.

You may now be asking yourself, “Well, then, how do I find a design for my idiosyncratic experiment? Isn’t there a way that I can specify exactly what I want and get an efficient design to test it?” The answer is yes—this is where the techniques of optimal experimental design (often abbreviated to optimal design) come in. Optimal design methods allow you to specify your model exactly (including number of runs) and to choose a criterion for measuring efficiency. The design problem is then solved by mathematical programming to find a design that maximizes the efficiency of the design, given your specifications. The use of the word optimal to describe designs generated in this manner means that we are optimizing the design for maximum efficiency relative to the desired efficiency criterion.

Page 265: Statistics I

I-245

Design of Experiments

Optimization Methods

First, you need to choose an optimization method. Different mathematical methods (algorithms) are available for finding the design that optimizes the efficiency criterion. Some of these methods require a candidate set of design points from which to choose the points for the optimal design. Other methods do not require such a candidate set.

Three optimization methods are available:

Fedorov method. This method requires a predefined candidate set. It starts with an initial design, and at each step it identifies a pair of points—one from the design and one from the candidate set—to be exchanged. That is, the candidate point replaces the selected design point to form a new design. The pair exchanged is the pair that shows the greatest reduction in the optimality criterion when exchanged. This process repeats until the algorithm converges.

K-exchange method. This method starts with a set of candidate points and an initial design and exchanges the worst k points at each iteration in order to minimize the objective function. Candidate points must come from a previously generated design.

Coordinate exchange method. This method does not require a candidate set. It starts with an initial design based on a random starting point. At each iteration, k design points are identified for exchange, and the coordinates of these points are adjusted one by one to minimize the objective function. The fact that this method does not require a candidate set makes it useful for problems with large factor spaces. Another advantage of this method is that one can use either continuous or categorical variables, or a mixture of both, in the model.

For the designs that require a candidate set, that set must be defined before you generate your optimal design. The set of points must be in a file that was generated and saved by the Design Wizard. You may eliminate undesirable rows before using the file in the Fedorov or k-exchange method to generate an optimal design based on the candidate design. The same requirements hold for any so-called starting design in a file that is submitted by the user.

It is important to remember that these methods are iterative, based on starting designs with a random component to them. Therefore, they will not always converge on a design that is absolutely optimal—they may fall into a local minimum or saddle point, or they may simply fail to converge within the allowed number of iterations. That is why each method allows you to generate a design multiple times based on different starting designs.

Page 266: Statistics I

I-246

Chapter 10

Efficiency Criteria for Optimal Designs

You may have noticed that no explicit mathematical definition of efficiency was given in the discussion above. This is because there are several different ways of defining and measuring efficiency of designs. Because the object of optimal design is to minimize a specific efficiency criterion, the values used to measure efficiency are also called optimality criteria in this context. You can choose from three optimality criteria:

D-optimality. This criterion measures the generalized variance of the parameter estimates in the model. The generalized variance is the determinant of the parameter dispersion matrix: , where X is the design matrix. The square root of this value is proportional to the volume of the confidence ellipsoid about the parameter estimates. The design is generated to minimize D. (The D stands for determinant.)

A-optimality. This criterion measures the average (or, equivalently, the sum) of the variances for the parameter estimates. Minimizing this criterion, measured as the trace of the parameter dispersion matrix , yields the design with the smallest average variance for the parameter estimates. The design is generated to minimize A. (The A stands for average.)

G-optimality. This criterion focuses on the variance of predicted response values rather than the variance of the parameter estimates. The variance of predictions varies across the factor space (that is, as different levels of the factors are examined). This criterion specifically measures the maximum variance of prediction within the factor space, and seeks to minimize this maximum value, , where v(x) is the variance of the prediction at design point x. (The G stands for global.)

In most circumstances, these methods will give similar results. Using G-optimality can take more time to compute, since each iteration involves both maximization and minimization. In many situations, D-optimality will be a good choice because it is fast and invariant to linear transformations. A-optimality is especially sensitive to the scale of continuous factors, such that a design with factors having very different scales may lead to problems generating a design.

Analysis of Optimal Designs

Analysis of optimal designs closely parallels the analysis of other experimental designs. The general linear model (GLM) is used to build an equation for the model and estimate and test effects.

D X’X( ) 1–=

A trace X’X( ) 1–[ ]=

G maxx χ∈ v x( )=

Page 267: Statistics I

I-247

Design of Experiments

There is one important difference, however. For an optimal experiment, you specify the model for the experiment before you generate the design. This is necessary to ensure that the design is optimized for your particular model, rather than an assumed model (such as a complete factorial or a full quadratic model). This means that for optimal designs, the form of the equation to be estimated is an integral part of the experimental design.

Let’s consider a simple example: suppose that you have three two-level factors (call them A, B, and C), and you want to perform tests of the following effects: A, B, C, AB, and AC. You could use the usual 23 factorial design, which would give you the following runs:

Now, suppose that you want to estimate the model in only six runs. There is no standard design for this, so you must use an optimal design. Using the coordinate exchange method with the D-optimality criterion yields the following design:

A B C

0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1

A B C

1 1 00 0 01 1 11 0 00 0 10 1 0

Page 268: Statistics I

I-248

Chapter 10

However, if we change the form of the model slightly, so that we are asking for the A, B, C, AB, and BC effects, we get a slightly different design:

In general, a design generated based on one model will not be a good design for a different model. The implication of this is that the model used to generate the design places limits on the model that you estimate in your data analysis. In most cases, the two models will be the same, although you may sometimes want to omit terms from the analysis that were in the original model used to generate the design.

Choosing a Design

Deciding which design to use is an important part of the experimental design process. The answer will depend on various aspects of the research problem at hand, such as:

� What type of knowledge do you want to gain from the research?

� How much background knowledge can you bring to bear on the question?

� How many factors are involved in the process of interest?

� How many different ways do you want to measure the outcome?

� What are the constraints, if any, on your factors?

� What is your criterion for the “best design”?

� Will you have to use the results of the experiment to convince others of your conclusions? What will they find convincing?

� What are the constraints on your research process in terms of time, money, human resources, and so forth?

A B C

1 0 10 1 10 1 00 0 00 0 11 1 0

Page 269: Statistics I

I-249

Design of Experiments

Defining the Question

Successful research depends on how well the problem is formulated. It does no good to run an elaborate, highly efficient experiment if it gives you the answer to the wrong question. Spend some time and effort carefully considering your problem. Doing so will help to ensure that your experimental design will give you the information you need to solve your problem.

Identifying Candidates for Factors and Responses

In most cases, it is most efficient to focus on only the important factors and ignore the inconsequential ones. However, you shouldn’t be too eager to eliminate factors from your experiment. Leaving out even one crucial factor can seriously hinder your ability to find true effects and can lead to highly misleading results. If there is any doubt about a factor, it is usually best to include it. Once you have empirical confirmation that its effect really is negligible, you can delete it from subsequent models.

If there is not much background knowledge available to help in your factor selection, you should consider employing a screening design. These designs allow you to test for main effects with a small number of runs. Such designs allow you to examine a large number of candidate factors without exhausting your resources. Once you have identified a set of interesting factors, you can move on to a fuller design to test for more complex effects.

Setting Priorities

Consider what is really important in your study. Do you need the highest precision possible, regardless of what it takes? Or are you more concerned about controlling costs, even if it means settling for an approximate model? Would the cost of overlooking an effect be greater than the cost of including the effect in your model? Giving careful thought to questions like these will help you choose a design that satisfies your criteria and helps you accomplish your research goals.

Page 270: Statistics I

I-250

Chapter 10

Design of Experiments in SYSTAT

Design of Experiments Wizard

To access the Design of Experiments Wizard, from the menus choose:

StatisticsDesign of Experiments

Wizard…

The Design of Experiments Wizard offers nine different design types: General Factorial, Box-Hunter, Latin Square, Taguchi, Plackett-Burman, Box-Behnken, Central Composite, Optimal, and Mixture Model. After selecting a design type, a series of dialogs prompts for design specifications before generating a final design matrix. These specifications typically include the number of factors involved, as well as the number of levels for each factor.

Replications. For any design created by the Design Wizard, replications can be saved to a file. By default, SYSTAT saves the design without replications. If you request n copies of a design, the complete design will be repeated n times in the saved file (global replication). If local replications are desired, simply sort the saved file on the variable named RUN to group replications by run number. Replications do not appear on the output screen.

Note: It is not necessary to have a data file open to use Design of Experiments.

Page 271: Statistics I

I-251

Design of Experiments

Classic Design of Experiments

To access the classic Design of Experiments dialog box, from the menus choose:

StatisticsDesign of Experiments

Classic…

Classic DOE offers a subset of the designs available using the Design Wizard, including factorial, Box-Hunter, Latin Square, Taguchi, Plackett, Box-Behnken, and mixture designs. In contrast to the wizard, classic DOE uses a single dialog to define all design settings. The following options are available:

Levels. For factorial, Latin, and mixture designs, this is the number of levels for the factors. Factorial designs are limited to either two or three levels per factor.

Factors. For factorial, BoxHunter, BoxBehnken, and lattice mixture designs, this is the number of factors, or independent variables.

Runs. For Plackett and BoxHunter designs, this is the number of runs.

Replications. For all designs except BoxBehnken and mixture, this is the number of replications.

Mixture type. For mixture designs, you can specify a mixture type from the drop-down list. Select either Centroid, Lattice, Axial, or Screen.

Taguchi type. For Taguchi designs, you can select a Taguchi type from the drop-down list.

Page 272: Statistics I

I-252

Chapter 10

Save file. This option saves the design to a file.

Print Options. The following two options are available:

� Use letters for labels. Labels the design factors with letters instead of numbers.

� Print Latin square. For Latin square designs, you can print the Latin square.

Design Options. The following two options are available:

� Randomize. Randomizes the order of experimentation.

� Include blocking factor. For BoxBehnken designs, you can include a blocking factor.

Using Commands

With commands:

Note: Some designs generated by the Design Wizard cannot be created using commands.

Usage Considerations

Types of data. No data file is needed to use Design of Experiments.

Print options. For Box-Hunter designs, using PRINT=LONG in Classic DOE yields a listing of the generators (confounded effects) for the design. For Taguchi designs, a table defining the interaction is available.

Quick Graphs. No Quick Graphs are produced.

Saving files. The design can be saved to a file.

DESIGN SAVE filename

FACTORIAL / FACTORS=n REPS=n LETTERS RAND,LEVELS = 2 or 3

BOXHUNTER / FACTORS=n RUNS=n REPS=n LETTERS RAND

LATIN / LEVELS=n SQUARE REPS=n LETTERS RAND

TAGUCHI / TYPE=design REPS=n LETTERS RAND

PLACKETT / RUNS=n REPS=n LETTERS RAND

BOXBEHNKEN / FACTORS=n BLOCK LETTERS RAND

MIXTURE / TYPE=LATTICE or CENTROID or AXIALor SCREEN,

FACTORS=n LEVELS=n RAND LETTERS

Page 273: Statistics I

I-253

Design of Experiments

BY groups. Analysis by groups is not available.

Bootstrapping. Bootstrapping is not available in this procedure.

Case weights. Case weighting is not available in Design of Experiments.

Examples

Example 1 Full Factorial Designs

The DOE Wizard input for a (2 x 2 x 2) design is:

The output is:

Wizard Prompt Response

Design Type General Factorial

Choose a type of design: Full Factorial Design

Divide the design into incomplete blocks? No

Enter the number of factors desired: 3

Is the number of levels to be the same for all factors? Yes

Enter number of levels: 2

Display the factors for this design? Yes

Save the design to a file? No

Factorial Design: 3 Factors, 8 Runs

Factor

RUN A B C

1 0 0 0

2 0 0 1

3 0 1 0

4 0 1 1

5 1 0 0

6 1 0 1

7 1 1 0

8 1 1 1

Page 274: Statistics I

I-254

Chapter 10

To generate this design using commands, the input is:

Example 2Fractional Factorial Design

The DOE Wizard input for a (2 x 2 x 2 x 2) fractional factorial design in which the two-way interactions A*B and A*C must be estimable is:

DESIGN

FACTORIAL / FACTORS=3 LEVELS=2

Wizard Prompt Response

Design Type General Factorial

Choose a type of design: Fractional Factorial Design

Divide the design into incomplete blocks? No

Enter the number of factors desired: 4

Is the number of levels to be the same for all factors? Yes

Enter number of levels: 2

Please choose: Automatically find the smallest design consistent with my criteria

Choose a Search Criterion Require that specific effects be estimable

May main effects be confounded with 2-factor interac-tions? Yes

Are there any specific effects to be estimated other than the effects already cited? Yes

List them by using the appropriate factor letters sepa-rated by asterisks for interactions.

A*BA*C

Are there any effects that are not to be estimated, but yet should not be confounded with effects that are to be estimated?

Yes

List them by using the appropriate factor letters separated by asterisks for interactions. A*D

Display the factors for this design? Yes

Save the design to a file? No

Display another fraction of this design? No

Find another design with same parameters? No

Page 275: Statistics I

I-255

Design of Experiments

The output is:

SYSTAT assumes that the main effects of any design should always be estimated. Notice, however, that the defining relation avoids confounding the interaction of A with any of the other factors, as requested by specifying the effects to be estimated (A*B, A*C) and effects that should not be confounded even though they are not to be estimated (A*D).

Complete Defining Relation

Identity =

B * C * D

The design resolution is 3

Design Generators

Identity =

B * C * D

Fractional Factorial Design: 4 Factors, 8 Runs

Factor

Run A B C D

1 0 0 0 0

2 0 0 1 1

3 0 1 0 1

4 0 1 1 0

5 1 0 0 0

6 1 0 1 1

7 1 1 0 1

8 1 1 1 0

Page 276: Statistics I

I-256

Chapter 10

Example 3Box-Hunter Fractional Factorial Design

To generate a (2 x 2 x 2) Box-Hunter fractional factorial, the input is:

The resulting output is:

To generate this design using commands, enter the following:

Wizard Prompt Response

Design Type Box-Hunter

Enter the number of factors desired: 3

Enter the total number of cells for the entire design: 4

Display the factors for this design? Yes

Save the design to a file? No

Complete Defining Relation

Identity =

A * B * C

The design resolution is 3

Design Generators

Identity =

A * B * C

Box-Hunter Design: 4 Runs, 3 Factors

Factor

RUN A B C

1 -1 -1 1

2 -1 1 -1

3 1 -1 -1

4 1 1 1

DESIGN

BOXHUNTER / FACTORS=3

Page 277: Statistics I

I-257

Design of Experiments

Aliases

For 7 two-level factors, the number of cells (runs) for a complete factorial is 27=128. The following example shows the smallest fractional factorial for estimating main effects. The design codes for the first three factors generate the last four. The input is:

The output is:

Wizard Prompt Response

Design Type Box-Hunter

Enter the number of factors desired: 7

Enter the total number of cells for the entire design: 8

Display the factors for this design? Yes

Save the design to a file? No

Complete Defining Relation

Identity =

A * B * D =

A * C * E =

B * C * D * E =

B * C * F =

A * C * D * F =

A * B * E * F =

D * E * F =

A * B * C * G =

C * D * G =

B * E * G =

A * D * E * G =

A * F * G =

B * D * F * G =

C * E * F * G =

A * B * C * D * E * F * G

The design resolution is 3

Design Generators

Page 278: Statistics I

I-258

Chapter 10

The main effect for factor D is confounded with the interaction between factors A and B; the main effect for factor E is confounded with the interaction between factors A and C; and so on.

Example 4Latin Squares

To generate a Latin square when each factor has four levels, enter the following DOE Wizard responses:

Identity =

A * B * D =

A * C * E =

B * C * F =

A * B * C * G

Box-Hunter Design: 8 Runs, 7 Factors

Factor

RUN A B C D E F G

1 -1 -1 -1 1 1 1 -1

2 -1 -1 1 1 -1 -1 1

3 -1 1 -1 -1 1 -1 1

4 -1 1 1 -1 -1 1 -1

5 1 -1 -1 -1 -1 1 1

6 1 -1 1 -1 1 -1 -1

7 1 1 -1 1 -1 -1 -1

8 1 1 1 1 1 1 1

Wizard Prompt Response

Design Type Latin Square

The types available are: Ordinary Latin Square

Number of levels: 4

Randomize the design? No

Display the square? Yes

Display the factors for this design? Yes

Save the design to a file? No

Page 279: Statistics I

I-259

Design of Experiments

The output is:

To generate this design using commands, enter the following:

Omitting SQUARE prevents the Latin square from appearing in the output.

Latin Square: 4 levels.

A B C D

B C D A

C D A B

D A B C

Latin Square Design: 4 Levels.

Factor

RUN A B C

1 0 0 0

2 0 1 1

3 0 2 2

4 0 3 3

5 1 0 1

6 1 1 2

7 1 2 3

8 1 3 0

9 2 0 2

10 2 1 3

11 2 2 0

12 2 3 1

13 3 0 3

14 3 1 0

15 3 2 1

16 3 3 2

DESIGN

LATIN / LEVELS=4 SQUARE LETTERS

Page 280: Statistics I

I-260

Chapter 10

Permutations

To randomly assign the factors to the cells, the input is:

The resulting output is:

Using commands:

Example 5Taguchi Design

To obtain a Taguchi L12 design with 11 factors, the DOE Wizard input is:

Wizard Prompt Response

Design Type Latin Square

The types available are: Ordinary Latin Square

Number of levels: 4

Randomize the design? Yes

Display the square? Yes

Display the factors for this design? No

Save the design to a file? No

Latin Square: 4 levels.

D C A B

C A B D

B D C A

A B D C

DESIGN

LATIN / LEVELS=4 SQUARE LETTERS RAND

Wizard Prompt Response

Design Type Taguchi

Taguchi Design Type: L12

Display the factors for this design? Yes

Save the design to a file? No

Display confounding matrix? No

Page 281: Statistics I

I-261

Design of Experiments

The output is:

To generate this design using commands, enter the following:

Design L16 with 15 Two-Level Factors Plus Aliases

To obtain a Taguchi L16 design with 15 factors, the input is:

Taguchi L12 Design (12 Runs, 11 Factors, 2 Levels Each)

Factor

RUN A B C D E F G H I J K

1 1 1 1 1 1 1 1 1 1 1 1

2 1 1 1 1 1 2 2 2 2 2 2

3 1 1 2 2 2 1 1 1 2 2 2

4 1 2 1 2 2 1 2 2 1 1 2

5 1 2 2 1 2 2 1 2 1 2 1

6 1 2 2 2 1 2 2 1 2 1 1

7 2 1 2 2 1 1 2 2 1 2 1

8 2 1 2 1 2 2 2 1 1 1 2

9 2 1 1 2 2 2 1 2 2 1 1

10 2 2 2 1 1 1 1 2 2 1 2

11 2 2 1 2 1 2 1 1 1 2 2

12 2 2 1 1 2 1 2 1 2 2 1

DESIGN

TAGUCHI / TYPE=L12

Wizard Prompt Response

Design Type Taguchi

Taguchi Design Type: L16

Display the factors for this design? Yes

Save the design to a file? No

Display confounding matrix? Yes

Page 282: Statistics I

I-262

Chapter 10

The output is:

Taguchi L16 Design (16 Runs, 15 Factors, 2 Levels Each)

Factor

RUN A B C D E F G H I J K L M N O

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

3 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2

4 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1

5 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2

6 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1

7 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1

8 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2

9 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

10 2 1 2 1 2 1 2 2 1 2 1 2 1 2 1

11 2 1 2 2 1 2 1 1 2 1 2 2 1 2 1

12 2 1 2 2 1 2 1 2 1 2 1 1 2 1 2

13 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1

14 2 2 1 1 2 2 1 2 1 1 2 2 1 1 2

15 2 2 1 2 1 1 2 1 2 2 1 2 1 1 2

16 2 2 1 2 1 1 2 2 1 1 2 1 2 2 1

Confoundings For Each Pairwise Interaction

(Note that partial confoundings do not appear.)

FACTOR

FACTOR 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 1

2 3 2

3 2 1 3

4 5 6 7 4

5 4 7 6 1 5

6 7 4 5 2 3 6

7 6 5 4 3 2 1 7

8 9 10 11 12 13 14 15 8

9 8 11 10 13 12 15 14 1 9

10 11 8 9 14 15 12 13 2 3 10

11 10 9 8 15 14 13 12 3 2 1 11

12 13 14 15 8 9 10 11 4 5 6 7 12

13 12 15 14 9 8 11 10 5 4 7 6 1 13

14 15 12 13 10 11 8 9 6 7 4 5 2 3 14

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 15

Page 283: Statistics I

I-263

Design of Experiments

This design can also be generated by the following commands:

The matrix of confoundings identifies the factor pattern associated with the interaction between the row and column factors. For example, the factor pattern for the interaction between factors 6 and 8 is identical to the pattern for factor 14 (N).

Example 6Plackett-Burman Design

To generate a Plackett-Burman design consisting of 11 two-level factors, the DOE Wizard input is:

The output follows:

DESIGN

PRINT=LONG

TAGUCHI / TYPE=L16

Wizard Prompt Response

Design Type Plackett-Burman

Number of levels in design: 2

Runs per replication 12

Display the factors for this design? Yes

Save the design to a file? No

Plackett-Burman Design: 12 Runs, 11 Factors

Factor

RUN A B C D E F G H I J K

1 1 1 0 1 1 1 0 0 0 1 0

2 1 0 1 1 1 0 0 0 1 0 1

3 0 1 1 1 0 0 0 1 0 1 1

4 1 1 1 0 0 0 1 0 1 1 0

5 1 1 0 0 0 1 0 1 1 0 1

6 1 0 0 0 1 0 1 1 0 1 1

7 0 0 0 1 0 1 1 0 1 1 1

8 0 0 1 0 1 1 0 1 1 1 0

9 0 1 0 1 1 0 1 1 1 0 0

10 1 0 1 1 0 1 1 1 0 0 0

11 0 1 1 0 1 1 1 0 0 0 1

12 0 0 0 0 0 0 0 0 0 0 0

Page 284: Statistics I

I-264

Chapter 10

To generate this design using commands, the input is:

Example 7Box-Behnken Design

Each factor in this example has three levels. The DOE wizard input is:

The output is:

DESIGN

PLACKETT / RUNS=12

Wizard Prompt Response

Design Type Box-Behnken

Number of Factors 3

Display the factors for this design? Yes

Save the design to a file? No

Box-Behnken Design: 3 Factors, 15 Runs

Factor

RUN A B C

1 -1 -1 0

2 1 -1 0

3 -1 1 0

4 1 1 0

5 -1 0 -1

6 1 0 -1

7 -1 0 1

8 1 0 1

9 0 -1 -1

10 0 1 -1

11 0 -1 1

12 0 1 1

13 0 0 0

14 0 0 0

15 0 0 0

Page 285: Statistics I

I-265

Design of Experiments

To generate this design using commands, the input is:

Example 8Mixture Design

We illustrate a lattice mixture design in which each of the three factors has five levels; that is, each component of the mixture is 0%, 25%, 50%, 75%, or 100% of the mixture for a given run, subject to the restriction that the sum of the percentages is 100. To generate the design for this situation using the DOE Wizard, enter the following responses at the corresponding prompt:

The resulting mixture design follows:

DESIGN

BOXBEHNKEN / FACTORS=3

Wizard Prompt Response

Design Type Mixture Model

Are there to be constraints for any component(s)? No

The possible kinds of unconstrained design are: Lattice

Enter the number of mixture components: 3

Enter the number of levels for each component: 5

Display the factors for this design? Yes

Save the design to a file? No

Lattice Design: 3 Factors, 15 Runs, 5 Levels

Component

RUN A B C

1 1.000 .000 .000

2 .000 1.000 .000

3 .000 .000 1.000

4 .750 .250 .000

5 .750 .000 .250

6 .000 .750 .250

7 .500 .500 .000

8 .500 .000 .500

9 .000 .500 .500

10 .250 .750 .000

11 .250 .000 .750

Page 286: Statistics I

I-266

Chapter 10

To generate this design using commands, the input is:

After collecting your data, you may want to display it in a triangular scatterplot.

Example 9Mixture Design with Constraints

This example is adapted from an experiment reported in Cornell (1990, p. 265). The problem concerns the mixture of three plasticizers in the production of vinyl for car seats. We know that the combination of plasticizers must make up 79.5% of the mixture. There are further constraints on each of the plasticizers:

32.5% <= P1 <= 67.5%0% <= P2 <= 20.0%12.0% <= P3 <= 21.8%

Because we are interested in only the plasticizers, we can model them separately from the other components in the overall process. Taking this approach, we can reparameterize the components by dividing by 79.5%, giving

0.409 <= A <= 0.8490 <= B <= 0.2520.151 <= C <= 0.274

12 .000 .250 .750

13 .500 .250 .250

14 .250 .500 .250

15 .250 .250 .500

DESIGN

MIXTURE / TYPE=LATTICE FACTORS=3,

LEVELS=5

Page 287: Statistics I

I-267

Design of Experiments

We want to be sure that the design points span the feasible region adequately. To generate the design using the DOE Wizard, the responses to the prompts follow:

Wizard Prompt Response

Design Type Mixture Model

Are there to be constraints for any component(s)? Yes

The possible kind of constrained design are: Extreme vertices plus centroids

Enter the number of mixture components: 3

Enter the maximum dimension to be used tocompute centroids: 1

How many such constraints do you wish to have? 5

Constraint 1: Enter the coefficient for factor 1: 1

Constraint 1: Enter the coefficient for factor 2: 0

Constraint 1: Enter the coefficient for factor 3: 0

Constraint 1: Enter an additive constant: -.409

Constraint 2: Enter the coefficient for factor 1: -1

Constraint 2: Enter the coefficient for factor 2: 0

Constraint 2: Enter the coefficient for factor 3: 0

Constraint 2: Enter an additive constant: .849

Constraint 3: Enter the coefficient for factor 1: 0

Constraint 3: Enter the coefficient for factor 2: -1

Constraint 3: Enter the coefficient for factor 3: 0

Constraint 3: Enter an additive constant: .252

Constraint 4: Enter the coefficient for factor 1: 0

Constraint 4: Enter the coefficient for factor 2: 0

Constraint 4: Enter the coefficient for factor 3: 1

Constraint 4: Enter an additive constant: -.151

Constraint 5: Enter the coefficient for factor 1: 0

Constraint 5: Enter the coefficient for factor 2: 0

Constraint 5: Enter the coefficient for factor 3: -1

Constraint 5: Enter an additive constant: .274

Specify the tolerance for checking constraintsand duplication of points: .00001

Display the factors for this design? Yes

Save the design to a file? No

Page 288: Statistics I

I-268

Chapter 10

The constrained mixture design output follows:

The design contains nine runs: four points at the extreme vertices of the feasible region, four points at the edge centroids, and one point at the overall centroid. The following plot displays the constrained region for the mixture as a blue parallelogram with the actual design points represented as red filled circles.

The following are index numbers of input constraints found to be redundant:

1

2

Extreme Vertices + Centroids Design: 3 Factors, 9 Runs, 4 Vertices

Component

RUN A B C

1 .849 .000 .151

2 .597 .252 .151

3 .726 .000 .274

4 .474 .252 .274

5 .787 .000 .213

6 .535 .252 .213

7 .723 .126 .151

8 .600 .126 .274

9 .661 .126 .213

0.0 0.2 0.4 0.6 0.8 1.0A

0.0

0.2

0.4

0.6

0.8

1.0

B

0.0

0.2

0.4

0.6

0.8

1.0

C

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.00.0

0.2

0.4

0.6

0.8

1.0

Page 289: Statistics I

I-269

Design of Experiments

Example 10Central Composite Response Surface Design

In an industrial experiment reported by Aia et al. (1961), the authors investigated the response surface of a chemical process for producing dihydrated calcium hydrogen orthophosphate (CaHPO4 • 2H2O). The factors of interest are the ratio of NH3 to CaCl2 in the calcium chloride solution, the addition time of the NH3-CaCl2 mixture, and the beginning pH of the NH4H2PO4 solution used. We will now see how this experiment would be designed using the DOE Wizard.

For efficiency and rotatability, we use a central composite design with three factors. The central composite design consists of a 2k factorial (or fraction thereof), a set of 2k axial (or “star”) points on the axes of the design space, and some number of center points. The distance between the axial points and the center of the design determines important properties of the design. In SYSTAT, the distance used ensures rotatability for unblocked designs. For blocked designs, the distance ensures orthogonality of blocks.

The choice of number of center points hinges on desired properties of the design. Orthogonal designs (designs in which the factors are uncorrelated) minimize the average variance of prediction of the response surface equation. However, in some cases, you may decide that it is more important to have the variance of predictions be nearly constant throughout most of the experimental region, even if the overall variance of predictions is increased somewhat. In such situations, we can use designs in which the variance of predictions is the same at the center of the design as it is at any point one unit distant from the center. This property of equal variance between the center of the design and points one unit from the center is called uniform precision. In this example, we sacrifice orthogonality in favor of uniform precision. Therefore, we use six center points instead of the nine points required to make the design nearly orthogonal. (A table of orthogonal and uniform precision designs with appropriate numbers of center points can be found in Montgomery, 1991, p. 546.)

The input to generate the central composite design follows:

Wizard Prompt Response

Design Type Central Composite

Enter the number of factors desired: 3

Are the cube and star portions of the design to be separate blocks? No

Enter number of center points desired: 6

Display the factors for this design? Yes

Save the design to a file? No

Page 290: Statistics I

I-270

Chapter 10

The resulting design is:

In the central composite design, each factor is measured at five different levels. The runs with no zeros for the factors are the factorial (“cube”) points, the runs with only one nonzero factor are the axial (“star”) points, and the runs with all zeros are the center points.

After collecting data according to this design, fit a response surface to analyze the results.

Example 11Optimal Designs: Coordinate Exchange

Consider a situation in which you want to compute a response surface but your resources are very limited. Assume that you have three continuous factors but can afford only 12 runs. This number of runs is not enough for any of the standard response

Second-order Composite Design: 3 Factors 20 Runs

Factor

RUN A B C

1 -1.000 -1.000 -1.000

2 -1.000 -1.000 1.000

3 -1.000 1.000 -1.000

4 -1.000 1.000 1.000

5 1.000 -1.000 -1.000

6 1.000 -1.000 1.000

7 1.000 1.000 -1.000

8 1.000 1.000 1.000

9 -1.682 .000 .000

10 1.682 .000 .000

11 .000 -1.682 .000

12 .000 1.682 .000

13 .000 .000 -1.682

14 .000 .000 1.682

15 .000 .000 .000

16 .000 .000 .000

17 .000 .000 .000

18 .000 .000 .000

19 .000 .000 .000

20 .000 .000 .000

Page 291: Statistics I

I-271

Design of Experiments

surface models. However, you can generate a design with 12 runs that will allow you to estimate the effects of interest using an optimal design.

To generate the design using the DOE Wizard, the responses to the prompts follow:

Wizard Prompt Response

Design Type Optimal

Choose the method to use: Coordinate Exchange

Choose the type of optimality desired: D-optimality

Specify the number of points to replace in a single iteration: 1

Specify the maximum number of iterations within a trial: 100

Specify the relative convergence tolerance: .00001

Specify the number of trials to be run: 3

Random number seed: 131

The starting design is to be: Generated by the program.

Enter the number of factors desired: 3

How many points (runs) are desired? 12

The variables in the design are: All continuous

Limits for factor A lower limit = -1upper limit = 1

Limits for factor B lower limit = -1upper limit = 1

Limits for factor C lower limit = -1upper limit = 1

Does the model for your designed design contain an additive constant? Yes

Define other effects to be included in the model:

A*AB*BC*CA*BA*CB*CA*B*C

Display the factors for this design? Yes

Save the design to a file? No

Display the factors for this design? Yes

Save the design to a file? No

Display the factors for this design? Yes

Save the design to a file? No

Page 292: Statistics I

I-272

Chapter 10

The design that was output on the third trial follows:

The points shown here were generated from a particular run of the algorithm. Since the initial design depends on a randomly chosen starting point, your design may vary slightly from the design shown here. Your design should share several characteristics with this one, however. First, notice that most values appear to be very close to one of three values: –1, 0, or +1. For the purposes of conceptual discussion, we can act as if the values were rounded to the nearest integer. We can see that the design includes the eight corners of the design space (the runs where all values are either –1 or +1). The design also includes three points that are face centers (runs where two values are near 0), and one edge point (where only one value is near 0).

This design will allow you to estimate all first- and second-order effects in your model. Of course, you will not have as much precision as you would if you had used a Box-Behnken or central composite design, because you don’t have as much information to work with. You also lose some of the other advantages of the standard designs, such as rotatability. However, because the design is optimized with respect to generalized variance of parameter estimates, you will be getting as much information as you can out of your 12 runs.

Design from Coordinate-exchange Algorithm: 12 Runs, 3 Factors k = 1

Factor

RUN A B C

1 -1.000 1.000 -0.046

2 -0.038 -0.000 -1.000

3 -1.000 1.000 -1.000

4 1.000 1.000 -1.000

5 1.000 1.000 1.000

6 -1.000 -1.000 -1.000

7 1.000 -1.000 -1.000

8 -1.000 1.000 1.000

9 1.000 -0.046 0.001

10 -1.000 -1.000 1.000

11 0.081 -1.000 -0.036

12 1.000 -1.000 1.000

Page 293: Statistics I

I-273

Design of Experiments

ReferencesAia, M. A, Goldsmith, R. L., and Mooney, R. W. (1961). Precipitating Stoichiometric

CaHPO4·2H2O. Industrial and Engineering Chemistry, 53, 55–57.Box, G. E. P., and Behnken, D. W. (1960). Some new three level designs for the study of

quantitative variables. Technometrics, 2, 455–476.Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters. New

York: Wiley.Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John

Wiley & Sons, Inc.Cornell, J. A. (1990). Experiments with Mixtures. New York: Wiley.Fedorov, V. V. (1972). Theory of Optimal Experiments. New York: Academic Press.Galil, Z., and Kiefer, J. (1980). Time- and space-saving computer methods, related to

Mitchell’s DETMAX, for finding D-optimum designs. Technometrics, 21, 301–313.John, P. W. M. (1971). Statistical Design and Analysis of Experiments. New York:

Macmillan._____. (1990). Statistical Methods in Engineering and Quality Assurance. New York:

Wiley.Johnson, M. E., and Nachtsheim, C. J. (1983). Some guidelines for constructing exact D-

optimal designs on convex design spaces. Technometrics, 25, 271–277.Meyer, R. K., and Nachtsheim, C. J. (1995). The coordinate-exchange algorithm for

constructing exact optimal designs. Technometrics, 37, 60–69.Montgomery, D. C. (1991). Design and Analysis of Experiments. New York: Wiley.Plackett, R. L., and Burman, J. P. (1946). The design of optimum multifactorial

experiments. Biometrika, 33, 305–325.Schneider, A. M., and Stockett, A. L. (1963). An experiment to select optimum operating

conditions on the basis of arbitrary preference ratings. Chemical Engineering Progress Symposium Series, No. 42, Vol. 59.

Taguchi, G. (1986). Introduction to Quality Engineering. Tokyo: Asian Productivity Organization.

Taguchi, G. (1987). System of experimental design (2 volumes). New York: UNIPUB/Kraus International Publications.

Page 294: Statistics I
Page 295: Statistics I

I-275

Chapte r

11Discriminant Analysis

Laszlo Engelman

Discriminant Analysis performs linear and quadratic discriminant analysis, providing linear or quadratic functions of the variables that “best” separate cases into two or more predefined groups. The variables in the linear function can be selected in a forward or backward stepwise manner, either interactively by the user or automatically by SYSTAT. For the latter, at each step, SYSTAT enters the variable that contributes most to the separation of the groups (or removes the variable that is the least useful).

The command language allows you to emphasize the difference between specific groups; contrasts can be used to guide variable selection. Cases can be classified even if they are not used in the computations.

Discriminant analysis is related to both multivariate analysis of variance and multiple regression. The cases are grouped in cells like a one-way multivariate analysis of variance and the predictor variables form an equation like that for multiple regression. In discriminant analysis, Wilks’ lambda, the same test statistic used in multivariate ANOVA, is used to test the equality of group centroids. Discriminant analysis can be used not only to test multivariate differences among groups, but also to explore:

� Which variables are most useful for discriminating among groups

� If one subset of variables performs equally well as another

� Which groups are most alike and most different

Page 296: Statistics I

I-276

Chapter 11

Statistical Background

When we have categorical variables in a model, it is often because we are trying to classify cases; that is, what group does someone or something belong to? For example, we might want to know whether someone with a grade point average (GPA) of 3.5 and an Advanced Psychology Test score of 600 is more like the group of graduate students successfully completing a Ph.D. or more like the group that fails. Or, we might want to know whether an object with a plastic handle and no concave surfaces is more like a wrench or a screwdriver.

Once we attempt to classify, our attention turns from parameters (coefficients) in a model to the consequences of classification. We now want to know what proportion of subjects will be classified correctly and what proportion incorrectly. Discriminant analysis is one method for answering these questions.

Linear Discriminant Model

If we know that our classifying variables are normally distributed within groups, we can use a classification procedure called linear discriminant analysis (Fisher, 1936). Before we present the method, however, we should warn you that the procedure requires you to know that the groups share a common covariance matrix and you must know what the covariance matrix values are. We have not found an example of discriminant analysis in the social sciences where this was true. The most appropriate applications we have found are in engineering, where a covariance matrix can be deduced from physical measurements. Discriminant analysis is used, for example, in automated vision systems for detecting objects on moving conveyer belts.

Why do we need to know the covariance matrix? We are going to use it to calculate Mahalanobis distances (developed by the Indian statistician Prasanta C. Mahalanobis). These distances are calculated between cases we want to classify and the center of each group in a multidimensional space. The closer a case is to the center of one group (relative to its distance to other groups), the more likely it is to be classified as belonging to that group. The figure on p. 277 shows what we are doing.

The borders of this graph comprise the two predictors GPA and GRE. The two “hills” are centered at the mean values of the two groups (No Ph.D. and Ph.D.). Most of the data in each group are supposed to be under the highest part of each hill. The hills, in other words, mathematically represent the concentration of data values in the scatterplot beneath.

Page 297: Statistics I

I-277

Discriminant Analysis

The shape of the hills was computed from a bivariate normal distribution using the covariance matrix averaged within groups. We’ve plotted this figure this way to show you that this model is like pie-in-the-sky if you use the information in the data below to compute the shape of these hills. As you can see, there is a lot of smoothing of the data going on, and if one or two data values in the scatterplot influence unduly the shape of the hills above, you will have an unrepresentative model when you try to use it on new samples.

How do we classify a new case into one group or another? Look at the figure again. The “new case” could belong to one or the other group. It’s more likely to belong to the closer group, however. The simple way to find how far this case is from the center of each group would be to take a direct walk from the new case to the center of each group in the data plot.

Page 298: Statistics I

I-278

Chapter 11

Instead of walking in sample data space below, however, we must climb the hills of our theoretical model above when using the normal classification model. In other words, we will use our theoretical model to calculate distances. The covariance matrix we used to draw the hills in the figure makes distances depend on the direction we are heading. The distance to a group is thus proportional to the altitude (not the horizontal distance) we must climb to get to the top of the corresponding hill.

Because these hills can be oblong in shape, it is possible to be quite far from the top of the hill as the crow flies, yet have little altitude to cover in a climb. Conversely, it is possible to be close to the center of the hill and have a steep climb to get to the top. Discriminant analysis adjusts for the covariance that causes these eccentricities in hill shape. That is why we need the covariance matrix in the first place.

So much for the geometric representation. What do the numbers look like? Let’s look at how to set up the problem with SYSTAT. The input is:

The output is:

DISCRIM USE ADMIT PRINT LONG MODEL PHD = GRE,GPA ESTIMATE

Group frequencies ----------------- 1 2 Frequencies 51 29 Group means ----------- GPA 4.423 4.639 GRE 590.490 643.448 Pooled within covariance matrix -- DF= 78 ------------------------------------------------ GPA GRE GPA 0.095 GRE 1.543 4512.409 Within correlation matrix ------------------------- GPA GRE GPA 1.000 GRE 0.075 1.000 Total covariance matrix -- DF= 79 ------------------------------------------------ GPA GRE GPA 0.104 GRE 4.201 5111.610 Total correlation matrix ------------------------ GPA GRE GPA 1.000 GRE 0.182 1.000

Page 299: Statistics I

I-279

Discriminant Analysis

There’s a lot to follow on this output. The counts and means per group are shown first. Next comes the Pooled within covariance matrix, computed by averaging the separate-

Between groups F-matrix -- df = 2 77 ---------------------------------------------- 1 2 1 0.0 2 9.469 0.0 Wilks lambda Lambda = 0.8026 df = 2 1 78 Approx. F= 9.4690 df = 2 77 prob = 0.0002 Classification functions ---------------------- 1 2 Constant -133.910 -150.231 GPA 44.818 46.920 GRE 0.116 0.127 Classification matrix (cases in row categories classified into columns) --------------------- 1 2 %correct 1 38 13 75 2 7 22 76 Total 45 35 75 Jackknifed classification matrix -------------------------------- 1 2 %correct 1 37 14 73 2 7 22 76 Total 44 36 74 Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 0.246 0.444 1.000 Wilks lambda= 0.803 Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002 Pillai’s trace= 0.197 Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002 Lawley-Hotelling trace= 0.246 Approx.F= 9.469 DF= 2, 77 p-tail= 0.0002 Canonical discriminant functions -------------------------------- 1 Constant -15.882 GPA 2.064 GRE 0.011 Canonical discriminant functions -- standardized by within variances -------------------------------------------------------------------- 1 GPA 0.635 GRE 0.727 Canonical scores of group means ------------------------------- 1 -.369 2 .649

Page 300: Statistics I

I-280

Chapter 11

group covariance matrices, weighting by group size. The Total covariance matrix ignores the groups. It includes variation due to the group separation. These are the same matrices found in the MANOVA output with PRINT=LONG. The Between groups F-matrix shows the F value for testing the difference between each pair of groups on all the variables (GPA and GRE). The Wilks’ lambda is for the multivariate test of dispersion among all the groups on all the variables, just as in MANOVA. Each case is classified by our model into the group whose classification function yields the largest score. Each function is like a regression equation. We compute the predicted value of each equation for a case’s values on GPA and GRE and classify the case into the group whose function yields the largest value.

Next come the separate F statistics for each variable and the Classification matrix. The goodness of classification is comparable to that for the PROBIT model. We did a little worse with the No Ph.D. group and a little better with the Ph.D. The Jackknifed classification matrix is an attempt to approximate cross-validation. It will tend to be somewhat optimistic, however, because it uses only information from the current sample, leaving out single cases to classify the remainder. There is no substitute for trying the model on new data.

Finally, the program prints the same information produced in a MANOVA by SYSTAT’s MGLH (GLM and ANOVA). The multivariate test statistics show the groups are significantly different on GPA and GRE taken together.

Linear Discriminant Function

We mentioned in the last section that the canonical coefficients are like a regression equation for computing distances up the hills. Let’s look more closely at these coefficients. The following figure shows the plot underlying the surface in the last figure. Superimposed at the top of the GRE axis are two normal distributions centered at the means for the two groups. The standard deviations of these normal distributions are computed within groups. The within-group standard deviation is the square root of the diagonal GRE variance element of the residual covariance matrix (4512.409). The same is done for GPA on the right, using square root of the within-groups variance (0.095) for the standard deviation and the group means for centering the normals.

Page 301: Statistics I

I-281

Discriminant Analysis

Either of these variables separates the groups somewhat. The diagonal line underlying the two diagonal normal distributions represents a linear combination of these two variables. It is computed using the canonical discriminant functions in the output. These are the same as the canonical coefficients produced by MGLH. Before applying these coefficients, the variables must be standardized by the within-group standard deviations. Finally, the dashed line perpendicular to this diagonal cuts the observations into two groups: those to the left and those to the right of the dashed line.

You can see that this new canonical variable and its perpendicular dashed line are an orthogonal (right-angle-preserving) rotation of the original axes. The separation of the two groups using normal distributions drawn on the rotated canonical variable is slightly better than that for either variable alone. To classify on the linear discriminant axis, make the mean on this new variable 0 (halfway between the two diagonal normal curves). Then add a scale along the diagonal, running from negative to positive. If we do this, then any observations with negative scores on this diagonal scale will be classified into the No Ph.D. group (to the left of the dashed perpendicular bisector) and those with positive scores into the Ph.D. (to the right). All Y’s to the left of the dashed line and N’s to the right are misclassifications. Try rotating these axes any other way to get a better count of correctly classified cases (watch out for ties). The linear discriminant function is the best rotation.

Using this linear discriminant function variable, we get the same classifications we got with the Mahalanobis distance method. Before computers, this was the preferred method for classifying because the computations are simpler.

400 500 600 700 800GRE

3.5

4.0

4.5

5.0

5.5G

PA

Y Y

Y

YY

YY

Y

Y

Y

Y

YY

YYYY

YY

Y

Y

Y Y

YYY

Y

YY

YY

Y

Y

Y

Y

Y

YYY

Y

Y

Y

Y

Y

Y

YY

Y

YY

YY

Y

YYY

Y

Y

Y

Y

Y

Y

Y

YY

Y

Y

Y

Y

N

N

NN

N

NNN

N

N

N

N

N

NNNNN

N

NN

N

N

N

N

NN

NN

NN N

N

N

NNN N

N

N

NN

NNN

N

N

N

N

NN

N

NN

N

N

N

N

NN

N

N

N

N

N

N

N

N

N

N

N

N

NN

N

NN

N

NN

N

N

N

N

N

NN

N

N N

Page 302: Statistics I

I-282

Chapter 11

We just use the equation:

Fz = 0.635*ZGPA + 0.727*ZGRE

The two Z variables are the raw scores minus the overall mean divided by the within-groups standard deviations. If Fz is less than 0, classify No Ph.D.; otherwise, classify Ph.D.

As we mentioned, the Mahalanobis method and the linear discriminant function method are equivalent. This is somewhat evident in the figure. The intersection of the two hills is a straight line running from the northwest to the southeast corner in the same orientation as the dashed line. Any point to the left of this line will be closer to the top of the left hill, and any point to the right will be closer to the top of the right hill.

Prior Probabilities

Our sample contained fewer Ph.D.s than No Ph.D.s. If we want to use our discriminant model to classify new cases and if we believe that this difference in sample sizes reflects proportions in the population, then we can adjust our formula to favor No Ph.D.s. In other words, we can make the prior probabilities (assuming we know nothing about GRE and GPA scores) favor a No Ph.D. classification. We can do this by adding the option

PRIORS = 0.625, 0.375

to the MODEL command. Do not be tempted to use this method as a way of improving your classification table. If the probabilities you choose do not reflect real population differences, then new samples will on average be classified worse. It would make sense in our case because we happen to know that more people in our department tend to drop out than stay for the Ph.D.

You might have guessed that the default setting is for prior probabilities to be equal (both 0.5). In the last figure, this makes the dashed line run halfway between the means of the two groups on the discriminant axis. By changing the priors, we move this dashed line (the normal distributions stay in the same place).

Multiple Groups

The discriminant model generalizes to more than two groups. Imagine, for example, three hills in the first figure. All the distances and classifications are computed in the

Page 303: Statistics I

I-283

Discriminant Analysis

same manner. The posterior probabilities for classifying cases are computed by comparing three distances rather than two.

The multiple group (canonical) discriminant model yields more than one discriminant axis. For three groups, we get two sets of canonical discriminant coefficients. For four groups, we get three. If we have fewer variables than groups, then we get only as many sets as there are variables. The group classification function coefficients are handy for classifying new cases with the multiple group model. Simply multiply each coefficient times each variable and add in the constant. Then assign the case to the group whose set yields the largest value.

Discriminant Analysis in SYSTAT

Discriminant Analysis Main Dialog Box

To open the Discriminant Analysis dialog box, from the menus choose:

StatisticsClassification

Discriminant Analysis...

The following options can be specified:

Quadratic. The Quadratic check box requests quadratic discriminant analysis. If not selected, linear discriminant analysis is performed.

Save. For each case, Distances saves the Mahalanobis distances to each group centroid and the posterior probability of the membership in each group. Scores saves the

Page 304: Statistics I

I-284

Chapter 11

canonical variable scores. Scores/Data and Distances/Data save scores and distances along with the data.

Discriminant Analysis Options

SYSTAT includes several controls for stepwise model building and tolerance. To access these options, click Options in the main dialog box.

The following can be specified:

Tolerance. The tolerance sets the matrix inversion tolerance limit. Tolerance = 0.001 is the default.

Two estimation options are available:

� Complete. All variables are used in the model.

� Stepwise. Variables can be selected in a forward or backward stepwise manner, either interactively by the user or automatically by SYSTAT.

If you select stepwise estimation, you can specify the direction in which the estimation should proceed, whether SYSTAT should control variable entry and elimination, and any desired criteria for variable entry and elimination.

� Backward. In backward stepping, all variables are entered, irrespective of their F-to-enter values (if a variable fails the Tolerance limit, however, it is excluded). F-to-remove and F-to-enter values are reported. When Backward is selected along with Automatic, at each step, SYSTAT removes the variable with the lowest F-to-

Page 305: Statistics I

I-285

Discriminant Analysis

remove value that passes the Remove limit of the F statistic (or reenters the variable with the largest F-to-enter above the Remove limit of the F statistic).

� Forward. In forward stepping, the variables are entered in the model. F-to-enter values are reported for all candidate variables, and F-to-remove values are reported for forced variables. When Forward is selected along with Automatic, at each step, SYSTAT enters the variable with the highest F-to-enter that passes the Enter limit of the F statistic (or removes the variable with the lowest F-to-remove below the Remove limit of the F statistic).

� Automatic. SYSTAT enters or removes variables automatically. F-to-enter and F-to-remove limits are used.

� Interactive. Variables are interactively removed from and/or added to the model at each step. In the Command pane, type a STEP command to enter and remove variables interactively.

Variables are added to or eliminated from the model based on one of two possible criteria.

� Probability. Variables with probability (F-to-enter) smaller than the Enter probability are entered into the model if Tolerance permits. The default Enter value is 0.15. For highly correlated predictors, you may want to set Enter = 0.01. Variables with probability (F-to-remove) larger than the Remove probability are removed from the model. The default Remove value is 0.15.

� F-statistic. Variables with F-to-enter values larger than the Enter F value are entered into the model if Tolerance permits. The default Enter value is 4. Variables with F-to-remove values smaller than the Remove F value are removed from the model. The default Remove value is 3.9.

STEP One variable is entered into or removed from the model (based on the Enter and Remove limits of the F statistic).

STEP + Variable with the largest F-to-enter is entered into the model (irrespective of the Enter limit of the F statistic).

STEP – Variable with the smallest F-to-remove is removed from the model (irrespective of the Remove limit of the F statistic).

STEP c, e Variables named c and e are stepped into/out of the model (irrespective of the Enter and Remove limits of the F statistic).

STEP 3, 5 Third and fifth variables are stepped into/out of the model (irrespective of the Enter and Remove limits of the F statistic).

STEP/NUMBER = 3 Three variables are entered into or removed from the model.STOP Stops the stepping and generates final output (classification

matrices, eigenvalues, canonical variables, etc.).

Page 306: Statistics I

I-286

Chapter 11

You can also specify variables to include in the model, regardless of whether they meet the criteria for entry into the model. In the Force text box, enter the number of variables, in the order in which they appear in the Variables list, to force into the model (for example, Force = 2 means include the first two variables on the Variables list in the main dialog box). Force = 0 is the default.

Discriminant Analysis Statistics

You can select any desired output elements by clicking Statistics in the main dialog box.

All selected statistics will be displayed in the output. Depending on the specified length of your output, you may also see additional statistics. By default, the print length is set to Short (you will see all of the statistics on the Short Statistics list). To change the length of your output, choose Options from the Edit menu. Select Short, Medium, or Long from the Length drop-down list. Again, all selected statistics will be displayed in the output, regardless of the print setting.

Short Statistics. Options for Short Statistics are FMatrix (between-groups F matrix), FStats (F-to-enter/remove statistics), Eigen (eigenvalues and canonical correlation), CMeans (canonical scores of group means), and Sum (summary panel).

Medium Statistics. Options for Medium Statistics are those for Short Statistics plus Means (group frequencies and means), Wilks (Wilks’ lambda and approximate F), CFunc (discriminant functions), Traces (Lawley-Hotelling and Pillai and Wilks’ traces), CDFunc (canonical discriminant functions), SCDFunc (standardized canonical

Page 307: Statistics I

I-287

Discriminant Analysis

discriminant functions), Class (classification matrix), and JClass (Jackknifed classification matrix).

Long Statistics. Options for Long Statistics are those for Medium Statistics plus WCov (within covariance matrix), WCorr (within correlation matrix), TCov (total covariance matrix), TCorr (total correlation matrix), GCov (groupwise covariance matrix), and GCorr (groupwise correlation matrix).

Mahalanobis distances, posterior probabilities (Mahal), and canonical scores (CScore) for each case must be specified individually.

Using Commands

Select your data by typing USE filename and continue as follows:

In addition to indicating a length for the PRINT output, you can select elements not included in the output for the specified length. Elements for each length include:

Basic DISCRIM MODEL grpvar = varlist / QUADRATIC PRIORS=n1,n2,… CONTRAST [matrix] PRINT / length element SAVE / DATA SCORES DISTANCES ESTIMATE / TOL=n

Stepwise (Instead of ESTIMATE, specify START)

START / FORWARD TOL=n ENTER=p REMOVE=p FENTER=n FREMOVE=n FORCE=n BACKWARD STEP no argument or / NUMBER=n AUTO ENTER=p REMOVE=p FENTER=n FREMOVE=n + or - or varlist or nvari, nvarj, … (sequence of STEPs) STOP

Length Element

SHORT FMATRIX FSTATS EIGEN CMEANS SUM CLASS JCLASS

MEDIUM MEANS WILKS CFUNC TRACES CDFUNC SCDFUNC

LONG WCOV WCOR TCOV TCOR GCOV GCOR

Page 308: Statistics I

I-288

Chapter 11

MAHAL and CSCORE must be specified individually. No length specification includes these statistics.

Usage Considerations

Types of data. DISCRIM uses rectangular data only.

Print options. Print options allow the user to select panels of output to display, including group means, variances, covariances, and correlations.

Quick Graphs. For two canonical variables, SYSTAT produces a canonical scores plot, in which the axes are the canonical variables and the points are the canonical variable scores. This plot includes confidence elipses for each group. For analyses involving more than two canonical variables, SYSTAT displays a SPLOM of the first three canonical variables.

Saving files. You can save the Mahalanobis distances to each group centroid (with the posterior probability of the membership in each group) or the canonical variable scores.

BY groups. DISCRIM analyzes data by groups.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. DISCRIM uses a FREQ variable to increase the number of cases.

Case weights. You can weight each case in a discriminant analysis using a weight variable. Use a binary weight variable coded 0 and 1 for cross-validation. Cases that have a zero weight do not influence the estimation of the discriminant functions but are classified into groups.

Examples

Example 1 Discriminant Analysis Using Complete Estimation

In this example, we examine measurements made on 150 iris flowers: sepal length, sepal width, petal length, and petal width (in centimeters). The data are from Fisher

Page 309: Statistics I

I-289

Discriminant Analysis

(1936) and are grouped by species: Setosa, Versicolor, and Virginica (coded as 1, 2, and 3, respectively).

The goal of the discriminant analysis is to find a linear combination of the four measures that best classifies or discriminates among the three species (groups of flowers). Here is a SPLOM of the four measures with within-group bivariate confidence ellipses and normal curves. The input is:

The plot follows:

Let’s see what a default analysis tells us about the separation of the groups and the usefulness of the variables for the classification. The input is:

Note the shortcut notation (..) in the MODEL statement for listing consecutive variables in the file (otherwise, simply list each variable name separated by a space).

DISCRIM USE iris SPLOM sepallen..petalwid / HALF GROUP=species ELL DENSITY=NORM OVERLAY

USE irisLABEL species / 1=”Setosa”, 2=”Versicolor”, 3=”Virginica”DISCRIM MODEL species = sepallen .. petalwid PRINT / MEANS ESTIMATE

SE

PA

LLE

NS

EP

ALW

IDP

ET

ALL

EN

SEPALLEN

PE

TA

LWID

SEPALWID PETALLEN PETALWID

SE

PA

LLE

NS

EP

ALW

IDP

ET

ALL

EN

SEPALLEN

PE

TA

LWID

SEPALWID PETALLEN PETALWID

SE

PA

LLE

NS

EP

ALW

IDP

ET

ALL

EN

SEPALLEN

PE

TA

LWID

SEPALWID PETALLEN PETALWID321

SPECIES

Page 310: Statistics I

I-290

Chapter 11

The output follows:

Group frequencies----------------- Setosa Versicolor Virginica Frequencies 50 50 50 Group means----------- SEPALLEN 5.0060 5.9360 6.5880 SEPALWID 3.4280 2.7700 2.9740 PETALLEN 1.4620 4.2600 5.5520 PETALWID 0.2460 1.3260 2.0260

Between groups F-matrix -- df = 4 144---------------------------------------------- Setosa Versicolor Virginica Setosa 0.0 Versicolor 550.1889 0.0 Virginica 1098.2738 105.3127 0.0

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 2 SEPALLEN 4.72 0.347993 | 3 SEPALWID 21.94 0.608859 | 4 PETALLEN 35.59 0.365126 | 5 PETALWID 24.90 0.649314 |

Classification matrix (cases in row categories classified into columns)--------------------- Setosa Versicolo Virginica %correct Setosa 50 0 0 100 Versicolor 0 48 2 96 Virginica 0 1 49 98 Total 50 49 51 98 Jackknifed classification matrix-------------------------------- Setosa Versicolo Virginica %correct Setosa 50 0 0 100 Versicolor 0 48 2 96 Virginica 0 1 49 98 Total 50 49 51 98

Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 32.192 0.985 0.991 0.285 0.471 1.000 Canonical scores of group means------------------------------- Setosa 7.608 .215 Versicolor -1.825 -.728 Virginica -5.783 .513

Page 311: Statistics I

I-291

Discriminant Analysis

Group Frequencies

The Group frequencies panel shows the count of flowers within each group and the means for each variable. If the group code or one or more measures are missing, the case is not used in the analysis.

Between Groups F-Matrix

For each pair of groups, use these F statistics to test the equality of group means. These values are proportional to distance measures and are computed from Mahalanobis statistics. Thus, the centroids for Versicolor and Virginica are closest (105.3); those for Setosa and Virginica (1098.3) are farthest apart. If you explore differences among several pairs, don’t use the probabilities associated with these F’s as a test because of the simultaneous inference problem. Compare the relative size of these values with the distances between-group means in the canonical variable plot.

F Statistics and Tolerance

Use F-to-remove statistics to determine the relative importance of variables included in the model. The numerator degrees of freedom for each F is the number of groups minus 1, and the denominator df is the (total sample size) – (number of groups) – (number of variables in the model) + 1; for example, for these data, 3 – 1 and 150 – 3

D2

Page 312: Statistics I

I-292

Chapter 11

– 4 + 1, or 2 and 144. Because you may be scanning F’s for several variables, do not use the probabilities from the usual F tables for a test. Here we conclude that SEPALLEN is least helpful for discriminating among the species (F = 4.72).

Classification Tables

In the Classification matrix, each case is classified into the group where the value of its classification function is largest. For Versicolor (row name), 48 flowers are classified correctly and 2 are misclassified (classified as Virginica)—96% of the Versicolor flowers are classified correctly. Overall, 98% of the flowers are classified correctly (see the last row of the table). The results in the first table can be misleading because we evaluated the classification rule using the same cases used to compute it. They may provide an overly optimistic estimate of the rule’s success. The Jackknifed classification matrix attempts to remedy the problem by using functions computed from all of the data except the case being classified. The method of leaving out one case at a time is called the jackknife and is one form of cross-validation.

For these data, the results are the same. If the percentage for correct classification is considerably lower in the Jackknifed panel than in the first matrix, you may have too many predictors in your model.

Eigenvalues, Canonical Correlations, Cumulative Proportion of Total Dispersion, and Canonical Scores of Group Means

The first canonical variable is the linear combination of the variables that best discriminates among the groups, the second canonical variable is orthogonal to the first and is the next best combination of variables, and so on. For our data, the first eigenvalue (32.2) is very large relative to the second, indicating that the first canonical variable captures most of the difference among the groups—at the right of this panel, notice that it accounts for more than 99% of the total dispersion of the groups.

The Canonical correlation between the first canonical variable and a set of two dummy variables representing the groups is 0.985; the correlation between the second canonical variable and the dummy variables is 0.471. (The number of dummy variables is the number of groups minus 1.) Finally, the canonical variables are evaluated at the group means. That is, in the canonical variable plot, the centroid for the Setosa flowers is (7.608, 0.215), Versicolor is (–1.825, –0.728), and so on, where the first canonical variable is the x coordinate and the second, the y coordinate.

Page 313: Statistics I

I-293

Discriminant Analysis

Canonical Scores Plot

The axes of this Quick Graph are the first two canonical variables, and the points are the canonical variable scores. The confidence ellipses are centered on the centroid of each group. The Setosa flowers are well differentiated from the others. There is some overlap between the other two groups. Look for outliers in these displays because they can affect your analysis.

Example 2 Discriminant Analysis Using Automatic Forward Stepping

Our problem for this example is to derive a rule for classifying countries as European, Islamic, or New World. We know that strong correlations exist among the candidate predictor variables, so we are curious about just which subset will be useful. Here are the candidate predictors:

Because the distributions of the economic variables are skewed with long right tails, we log transform GDP_CAP and take the square root of EDUC, HEALTH, and MIL.

Alternatively, you could also use shortcut notation to request the square root transformations:

URBAN Percentage of the population living in citiesBIRTH_RT Births per 1000 people in 1990DEATH_RT Deaths per 1000 people in 1990B_TO_D Ratio of births to deaths in 1990BABYMORT Infant deaths during the first year per 1000 live birthsGDP_CAP Gross domestic product per capita (in U.S. dollars)LIFEEXPM Years of life expectancy for malesLIFEEXPF Years of life expectancy for femalesEDUC U.S. dollars spent per person on education in 1986HEALTH U.S. dollars spent per person on health in 1986MIL U.S. dollars spent per person on the military in 1986LITERACY Percentage of the population who can read

LET gdp_cap = L10(gdp_cap)LET educ = SQR(educ)LET health = SQR(health)LET mil = SQR(mil)

LET (educ, health, mil) = SQR(@)

Page 314: Statistics I

I-294

Chapter 11

We use automatic forward stepping in an effort to identify the best subset of predictors. After stepping stops, you need to type STOP to ask SYSTAT to produce the summary table, classification matrices, and information about canonical variables. The input is:

Notice that the initial results appear after START / FORWARD is specified. STEP /

AUTO and STOP are selected later, as indicated in the output that follows:

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy PRINT / MEANS START / FORWARD STEP / AUTO FENTER=4 FREMOVE=3.9 STOP

Group frequencies----------------- Europe Islamic NewWorld Frequencies 19 15 21 Group means----------- URBAN 68.7895 30.0667 56.3810 BIRTH_RT 12.5789 42.7333 26.9524 DEATH_RT 10.1053 13.4000 7.4762 BABYMORT 7.8947 102.3333 42.8095 GDP_CAP 4.0431 2.7640 3.2139 EDUC 21.5275 6.4156 8.9619 HEALTH 21.9537 3.1937 6.8898 MIL 15.9751 7.5431 6.0903 B_TO_D 1.2658 3.5472 3.9509 LIFEEXPM 72.3684 54.4000 66.6190 LIFEEXPF 79.5263 57.1333 71.5714 LITERACY 97.5263 36.7333 79.9571

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- | 6 URBAN 23.20 1.000000 | 8 BIRTH_RT 103.50 1.000000 | 10 DEATH_RT 14.41 1.000000 | 12 BABYMORT 53.62 1.000000 | 16 GDP_CAP 59.12 1.000000 | 19 EDUC 27.12 1.000000 | 21 HEALTH 49.62 1.000000 | 23 MIL 19.30 1.000000 | 34 B_TO_D 31.54 1.000000 | 30 LIFEEXPM 37.08 1.000000 | 31 LIFEEXPF 50.30 1.000000 | 32 LITERACY 63.64 1.000000

Page 315: Statistics I

I-295

Discriminant Analysis

Using commands, type STEP / AUTO.

**************** Step 1 -- Variable BIRTH_RT Entered **************** Between groups F-matrix -- df = 1 52---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 206.5877 0.0 NewWorld 55.8562 59.0625 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 103.50 1.000000 | 6 URBAN 1.26 0.724555 | 10 DEATH_RT 19.41 0.686118 | 12 BABYMORT 2.13 0.443802 | 16 GDP_CAP 4.56 0.581395 | 19 EDUC 5.12 0.831381 | 21 HEALTH 9.52 0.868614 | 23 MIL 8.55 0.907501 | 34 B_TO_D 14.94 0.987994 | 30 LIFEEXPM 4.31 0.437850 | 31 LIFEEXPF 3.58 0.371618 | 32 LITERACY 10.32 0.324635

**************** Step 2 -- Variable DEATH_RT Entered **************** Between groups F-matrix -- df = 2 51---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 120.1297 0.0 NewWorld 59.7595 29.7661 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 118.41 0.686118 | 6 URBAN 0.07 0.694384 10 DEATH_RT 19.41 0.686118 | 12 BABYMORT 1.83 0.279580 | 16 GDP_CAP 7.88 0.520784 | 19 EDUC 5.03 0.812622 | 21 HEALTH 6.47 0.864170 | 23 MIL 13.21 0.789555 | 34 B_TO_D 0.82 0.186108 | 30 LIFEEXPM 3.34 0.158185 | 31 LIFEEXPF 5.20 0.120507 | 32 LITERACY 2.22 0.265285 **************** Step 3 -- Variable MIL Entered **************** Between groups F-matrix -- df = 3 50---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 80.7600 0.0 NewWorld 55.6502 24.6740 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 77.85 0.683054 | 6 URBAN 3.87 0.509585 10 DEATH_RT 25.39 0.596945 | 12 BABYMORT 1.02 0.258829 23 MIL 13.21 0.789555 | 16 GDP_CAP 0.67 0.304330 | 19 EDUC 0.01 0.534243 | 21 HEALTH 1.24 0.652294 | 34 B_TO_D 0.81 0.186064 | 30 LIFEEXPM 0.28 0.135010 | 31 LIFEEXPF 1.34 0.091911 | 32 LITERACY 3.51 0.252509

Page 316: Statistics I

I-296

Chapter 11

When using commands, type STOP.

Variable F-to-enter Number ofentered or or variables Wilks’ Approx. removed F-to-remove in model lambda F-value df1 df2 p-tail------------ ----------- --------- ----------- ----------- ---- ----- ---------BIRTH_RT 103.495 1 0.2008 103.4953 2 52 0.00000DEATH_RT 19.406 2 0.1140 50.0200 4 102 0.00000MIL 13.212 3 0.0746 44.3576 6 100 0.00000

Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 13 2 87 NewWorld 2 2 17 81 Total 21 15 19 89 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 13 2 87 NewWorld 2 3 16 76 Total 21 16 18 87

Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 5.247 0.916 0.821 1.146 0.731 1.000 Canonical scores of group means------------------------------- Europe -2.938 .409 Islamic 2.481 1.243 NewWorld .886 -1.258

Canonical Scores Plot

-4 -2 0 2 4FACTOR(1)

-4

-2

0

2

4

FA

CT

OR

(2)

NewWorldIslamicEurope

GROUP

Page 317: Statistics I

I-297

Discriminant Analysis

From the panel of Group means, note that, on the average, the percentage of the population living in cities (URBAN) is 68.8% in Europe, 30.1% in Islamic nations, and 56.4% in the New World. The LITERACY rates for these same groups are 97.5%, 36.7%, and 80.0%, respectively.

After the group means, you will find the F-to-enter statistics for each variable not in the functions. When no variables are in the model, each F is the same as that for a one-way analysis of variance. Thus, group differences are the strongest for BIRTH_RT (F = 103.5) and weakest for DEATH_RT (F = 14.41). At later steps, each F corresponds to the F for a one-way analysis of covariance where the covariates are the variables already included.

At step 1, SYSTAT enters BIRTH_RT because its F-to-enter is largest in the last panel and now displays the same F in the F-to-remove panel. BIRTH_RT is correlated with several candidate variables, so notice how their F-to-enter values drop when BIRTH_RT enters (for example, for GDP_CAP, from 59.1 to 4.6). DEATH_RT now has the highest F-to-enter, so SYSTAT will enter it at step 2. From the between-groups F-matrix, note that when BIRTH_RT is used alone, Europe and Islamic countries are the groups that differ most (206.6), and Europe and the New World are the groups that differ least (55.9).

After DEATH_RT enters, the F-to-enter for MIL (money spent per person on the military) is largest, so SYSTAT enters it at step 3. The SYSTAT default limit for F-to-enter values is 4.0. No variable has an F-to-enter above the limit, so the stepping stops. Also, all F-to-remove values are greater than 3.9, so no variables are removed.

The summary table contains one row for each variable moved into the model. The F-to-enter (F-to-remove) is printed for each, along with Wilks’ lambda and its approximate F statistic, numerator and denominator degrees of freedom, and tail probability.

After the summary table, SYSTAT prints the classification matrices. From the biased estimate in the first matrix, our three-variable rule classifies 89% of the countries correctly. For the jackknifed results, this percentage drops to 87%. All of the European nations are classified correctly (100%), while almost one-fourth of the New World countries are misclassified (two as Europe and three as Islamic). These countries can be identified by using MAHAL—the posterior probability for each case belonging to each group is printed. You will find, for example, that Canada is misclassified as European and that Haiti and Bolivia are misclassified as Islamic.

If you focus on the canonical results, you motice that the first canonical variable accounts for 82.1% of the dispersion, and in the Canonical scores of group means panel, the groups are ordered from left to right: Europe, New World, and then Islamic. The second canonical variable contrasts Islamic versus New World (1.243 versus –1.258).

Page 318: Statistics I

I-298

Chapter 11

In the canonical variable plot, the European nations (on the left) are well separated from the other groups. The plus sign (+) next to the European confidence ellipse is Canada. If you are unsure about which ellipse corresponds to what group, look at the Canonical scores of group means.

Example 3 Discriminant Analysis Using Automatic Backward Stepping

It is possible that classification rules for other subsets of the variables perform better than that found using forward stepping—especially when there are correlations among the variables. We try backward stepping. The input is:

Notice that we request STEP after an initial report and PRINT and STOP later.

The output follows:

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy PRINT SHORT / CFUNC IDVAR = country$ START / BACKWARD STEP / AUTO FENTER=4 FREMOVE=3.9 PRINT / TRACES CDFUNC SCDFUNC STOP

Between groups F-matrix -- df = 12 41---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 25.3059 0.0 NewWorld 18.0596 7.3754 0.0 Classification functions---------------------- Europe Islamic NewWorld Constant -4408.4004 -4396.8904 -4408.5297 URBAN -2.4175 -2.3572 -2.2871 BIRTH_RT 41.9790 43.1675 43.1322 DEATH_RT 50.0202 48.1539 48.1950 BABYMORT 9.3190 9.3806 9.3461 GDP_CAP 243.6686 234.5165 237.0805 EDUC 2.0078 4.0450 3.4276 HEALTH -17.9706 -19.8527 -19.3068 MIL -9.8420 -10.1746 -10.6076 B_TO_D -59.6547 -62.2446 -61.8195 LIFEEXPM -9.8216 -9.1537 -9.4952 LIFEEXPF 93.5933 93.0934 93.4108 LITERACY 7.5909 7.5834 7.7178

Page 319: Statistics I

I-299

Discriminant Analysis

Using commands, type STEP / AUTO.

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.17 0.436470 | 8 BIRTH_RT 2.01 0.059623 | 10 DEATH_RT 2.26 0.091463 | 12 BABYMORT 0.10 0.083993 | 16 GDP_CAP 0.62 0.143526 | 19 EDUC 6.12 0.065095 | 21 HEALTH 5.36 0.083198 | 23 MIL 7.11 0.323519 | 34 B_TO_D 0.55 0.136148 | 30 LIFEEXPM 0.26 0.036088 | 31 LIFEEXPF 0.07 0.012280 | 32 LITERACY 1.45 0.177756 |

**************** Step 1 -- Variable LIFEEXPF Removed **************** Between groups F-matrix -- df = 11 42---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 28.2000 0.0 NewWorld 20.1693 8.2086 0.0 Classification functions---------------------- Europe Islamic NewWorld Constant -2135.2865 -2147.9924 -2144.2709 URBAN -0.8690 -0.8170 -0.7416 BIRTH_RT 20.1471 21.4523 21.3429 DEATH_RT 29.3876 27.6314 27.6026 BABYMORT 3.7505 3.8419 3.7885 GDP_CAP 292.1240 282.7130 285.4413 EDUC -3.8832 -1.8145 -2.4518 HEALTH -5.8347 -7.7816 -7.1945 MIL -6.9769 -7.3247 -7.7480 B_TO_D -13.7461 -16.5811 -16.0004 LIFEEXPM 32.7200 33.1607 32.9634 LITERACY 5.5340 5.5374 5.6648 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.45 0.466202 | 31 LIFEEXPF 0.07 0.012280 8 BIRTH_RT 3.04 0.077495 | 10 DEATH_RT 2.45 0.100658 | 12 BABYMORT 0.41 0.140589 | 16 GDP_CAP 0.68 0.144854 | 19 EDUC 6.71 0.066537 | 21 HEALTH 6.78 0.092071 | 23 MIL 7.39 0.328943 | 34 B_TO_D 0.70 0.148030 | 30 LIFEEXPM 0.24 0.077817 | 32 LITERACY 1.48 0.185492 |

Page 320: Statistics I

I-300

Chapter 11

(We omit the output for steps 2 through 6.)

**************** Step 7 -- Variable URBAN Removed **************** Between groups F-matrix -- df = 5 48---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 61.5899 0.0 NewWorld 40.9350 15.6004 0.0

Classification functions---------------------- Europe Islamic NewWorld Constant -22.4825 -38.4306 -17.6982 BIRTH_RT 0.3003 1.3372 0.9382 DEATH_RT 1.4220 0.6592 0.2591 EDUC -0.1787 1.3011 0.8506 HEALTH 0.7483 -0.8816 -0.3976 MIL 0.7537 0.4181 0.1794 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 27.89 0.622699 | 6 URBAN 3.65 0.504724 10 DEATH_RT 15.51 0.583392 | 12 BABYMORT 1.12 0.243722 19 EDUC 5.20 0.083925 | 16 GDP_CAP 1.20 0.171233 21 HEALTH 6.67 0.102470 | 34 B_TO_D 1.24 0.180347 23 MIL 7.42 0.501019 | 30 LIFEEXPM 0.02 0.123573 | 31 LIFEEXPF 0.49 0.076049 | 32 LITERACY 3.42 0.250341

Variable F-to-enter Number ofentered or or variables Wilks’ Approx. removed F-to-remove in model lambda F-value df1 df2 p-tail------------ ----------- --------- ----------- ----------- ---- ----- ---------LIFEEXPF 0.068 11 0.0405 15.1458 22 84 0.00000LIFEEXPM 0.237 10 0.0410 16.9374 20 86 0.00000BABYMORT 0.219 9 0.0414 19.1350 18 88 0.00000B_TO_D 0.849 8 0.0430 21.4980 16 90 0.00000GDP_CAP 1.429 7 0.0457 24.1542 14 92 0.00000LITERACY 2.388 6 0.0505 27.0277 12 94 0.00000URBAN 3.655 5 0.0583 30.1443 10 96 0.00000 Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 13 2 87 NewWorld 1 2 18 86 Total 20 15 20 91 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 13 2 87 NewWorld 1 2 18 86 Total 20 15 20 91

Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 6.984 0.935 0.859 1.147 0.731 1.000

Page 321: Statistics I

I-301

Discriminant Analysis

Using commands, type PRINT / TRACES CDFUNC SCDFUNC, then STOP.

Wilks’ lambda= 0.058 Approx.F= 30.144 df= 10, 96 p-tail= 0.0000 Pillai’s trace= 1.409 Approx.F= 23.360 df= 10, 98 p-tail= 0.0000 Lawley-Hotelling trace= 8.131 Approx.F= 38.215 df= 10, 94 p-tail= 0.0000

Canonical discriminant functions-------------------------------- 1 2 Constant -1.9836 -5.4022 URBAN . . BIRTH_RT 0.1603 0.0414 DEATH_RT -0.1588 0.2771 BABYMORT . . GDP_CAP . . EDUC 0.2358 0.0063 HEALTH -0.2604 -0.0015 MIL -0.0736 0.1497 B_TO_D . . LIFEEXPM . . LIFEEXPF . . LITERACY . . Canonical discriminant functions -- standardized by within variances-------------------------------------------------------------------- 1 2 URBAN . . BIRTH_RT 0.9737 0.2512 DEATH_RT -0.5188 0.9050 BABYMORT . . GDP_CAP . . EDUC 1.5574 0.0413 HEALTH -1.5572 -0.0091 MIL -0.3910 0.7952 B_TO_D . . LIFEEXPM . . LIFEEXPF . . LITERACY . .

Canonical scores of group means------------------------------- Europe -3.389 .410 Islamic 2.864 1.243 NewWorld 1.020 -1.259

Page 322: Statistics I

I-302

Chapter 11

Before stepping starts, SYSTAT uses all candidate variables to compute classification functions. The output includes the coefficients for these functions used to classify cases into groups. A variable is omitted only if it fails the Tolerance limit. For each case, SYSTAT computes three functions. The first is:

–4408.4 – 2.417*urban + 41.979*birth_rt + ... + 7.591*literacy

Each case is assigned to the group with the largest value.Tolerance measures the correlation of a candidate variable with the variables

included in the model, and its values range from 0 to 1.0. If a variable is highly correlated with one or more of the others, the value of Tolerance is very small and the resulting estimates of the discriminant function coefficients may be very unstable. To avoid a loss of accuracy in the matrix inversion computations, rarely should you set the value of this limit to a lower value (the default is 0.001). LIFEEXPF, female life expectancy, has a very low Tolerance value, so it may be redundant or highly correlated with another variable or a linear combination of other variables. The Tolerance value of LIFEEXPM, male life expectancy, is also low—these two measures of life expectancy may be highly correlated with one another. Notice also that the value for BIRTH_RT is very low (0.059623) and its F-to-remove value is 2.01; its F-to-enter at step 0 in the forward stepping example was 103.5.

At step 7, no variable has an F-to-remove value less than 3.9, so the stepping stops. The final model found by backward stepping includes five variables: BIRTH_RT, DEATH_RT, EDUC, HEALTH, and MIL. We are not happy, however, with the low Tolerance values for two of these variables. The model found via automatic forward

Page 323: Statistics I

I-303

Discriminant Analysis

stepping did not include EDUC or HEALTH (their F-to-enter statistics at step 3 are 0.01 and 1.24, respectively). URBAN and LITERACY appear more likely candidates, but their F’s are still less than 4.0.

In both classification matrices, 91% of the countries are classified correctly using the five-variable discriminant functions. This is a slight improvement over the three-variable model from the forward stepping example, where the percentages were 89% for the first matrix and 87% for the jackknifed results. The improvement from 87% to 91% is because two New World countries are now classified correctly. We add two variables and gain two correct classifications.

Wilks’ lambda (or U statistic), a multivariate analysis of variance statistic that varies between 0 and 1, tests the equality of group means for the variables in the discriminant functions. Wilks’ lambda is transformed to an approximate F statistic for comparison with the F distribution. Here, the associated probability is less than 0.00005, indicating a highly significant difference among the groups. The Lawley-Hotelling trace and its F approximation are documented in Morrison (1976). When there are only two groups, it and Wilks’ lambda are equivalent. Pillai’s trace and its F approximation are taken from Pillai (1960).

The canonical discriminant functions list the coefficients of the canonical variables computed first for the data as input and then for the standardized values. For the unstandardized data, the first canonical variable is:

–1.984 + 0.160*birth_rt – 0.159*death_rt + 0.236*educ – 0.260*health – 0.074*mil

The coefficients are adjusted so that the overall mean of the corresponding scores is 0 and the pooled within-group variances are 1. After standardizing, the first canonical variable is:

0.974*birth_rt – 0.519*death_rt + 1.557*educ – 1.557*health – 0.391*mil

Usually, one uses the latter set of coefficients to interpret what variables “drive” each canonical variable. Here, EDUC and HEALTH, the variables with low tolerance values, have the largest coefficients, and they appear to cancel one another. Also, in the final model, the size of their F-to-remove values indicates they are the least useful variables in the model. This indicates that we do not have an optimum set of variables. These two variables contribute little alone, while together they enhance the separation of the groups. This suggests that the difference between EDUC and HEALTH could be a useful variable (for example, LET diff = educ – health). We did this, and the following is the first canonical variable for standardized values (we omit the constant):

1.024*birth_rt – 0.539*death_rt – 0.480*mil + 0.553*diff

Page 324: Statistics I

I-304

Chapter 11

From the Canonical scores of group means for the first canonical variable, the groups line up with Europe first, then New World in the middle, and Islamic on the right. In the second dimension, DEATH_RT and MIL (military expenditures) appear to separate Islamic and New World countries.

Mahalanobis Distances and Posterior Probabilities

For Mahalanobis distances, even if you have already specified PRINT=LONG, you must type PRINT / MAHAL to obtain Mahalanobis distances. The output is:

Mahalanobis distance-square from group means and Posterior probabilities for group membership Priors = .333 .333 .333 Europe Islamic NewWorld Europe------------Ireland 3.0 1.00 33.7 .00 13.6 .00Austria 4.0 1.00 37.7 .00 19.8 .00Belgium * .3 1.00 42.7 .00 26.0 .00Denmark 9.1 1.00 37.6 .00 24.9 .00Finland 2.1 1.00 40.5 .00 22.3 .00France 2.3 1.00 45.5 .00 29.1 .00Greece 5.7 1.00 48.6 .00 28.3 .00Switzerland 11.9 1.00 71.7 .00 48.3 .00Spain 3.6 1.00 42.8 .00 18.9 .00UK 2.1 1.00 42.8 .00 29.9 .00Italy .6 1.00 44.7 .00 23.0 .00Sweden 4.3 1.00 51.7 .00 35.9 .00Portugal 3.6 1.00 40.4 .00 18.8 .00Netherlands 2.1 1.00 43.9 .00 24.2 .00WGermany 6.0 1.00 65.8 .00 45.5 .00Norway 5.3 1.00 38.5 .00 28.4 .00Poland 2.7 .99 29.5 .00 12.5 .01Hungary 4.4 1.00 39.8 .00 24.3 .00EGermany 8.0 1.00 42.4 .00 31.9 .00Czechoslov 1.8 1.00 40.9 .00 25.1 .00 Islamic------------Gambia 43.2 .00 2.9 1.00 15.3 .00Iraq 71.3 .00 23.5 1.00 41.7 .00Pakistan 38.7 .00 .5 .98 8.6 .02Bangladesh 37.2 .00 2.0 .91 6.8 .09Ethiopia 40.5 .00 1.1 .99 10.0 .01Guinea 41.2 .00 8.0 1.00 24.1 .00Malaysia --> 36.6 .00 7.7 .17 4.5 .83Senegal 42.8 .00 .9 .98 9.1 .02Mali 49.3 .00 5.5 1.00 23.5 .00Libya 60.3 .00 15.6 1.00 30.1 .00Somalia 50.0 .00 1.1 1.00 13.1 .00Afghanistan * . . . . . .Sudan 43.8 .00 .3 .99 10.1 .01Turkey --> 25.0 .00 7.2 .05 1.5 .95Algeria 43.1 .00 4.1 .79 6.7 .21Yemen 57.4 .00 3.1 1.00 23.2 .00

Page 325: Statistics I

I-305

Discriminant Analysis

For each case (up to 250 cases), the Mahalanobis distance squared ( ) is computed to each group mean. The closer a case is to a particular mean, the more likely it belongs to that group. The posterior probability for the distance of a case to a mean is the ratio of EXP( ) for the group divided by the sum of EXP( ) for all groups (prior probabilities, if specified, affect these computations).

An arrow (-->) marks incorrectly classified cases, and an asterisk (*) flags cases with missing values. New World countries Bolivia and Haiti are classified as Islamic, and Canada is classified as Europe. Note that even though an asterisk marks Belgium, results are printed—the value of the unused candidate variable URBAN is missing. No results are printed for Afghanistan because MIL, a variable in the final model, is missing.

You can identify cases with all large distances as outliers. A case can have a 1.0 probability of belonging to a particular group but still have a large distance. Look at Iraq. It is correctly classified as Islamic, but its distance is 23.5. The distances in this panel are distributed approximately as a chi-square with degrees of freedom equal to the number of variables in the function.

NewWorld------------Argentina 11.5 .03 19.8 .00 4.4 .97Barbados 16.4 .00 20.9 .00 4.7 1.00Bolivia --> 27.7 .00 3.4 .56 3.8 .44Brazil 27.4 .00 11.5 .00 .6 1.00Canada --> 6.7 1.00 35.9 .00 19.3 .00Chile 21.1 .00 15.7 .00 1.5 1.00Colombia 35.2 .00 13.9 .00 1.9 1.00CostaRica 34.8 .00 21.1 .00 5.5 1.00Venezuela 41.2 .00 13.4 .01 4.6 .99DominicanR. 26.0 .00 13.2 .00 1.3 1.00Uruguay 13.6 .07 22.9 .00 8.6 .93Ecuador 32.8 .00 8.6 .02 1.0 .98ElSalvador 35.3 .00 7.5 .07 2.5 .93Jamaica 25.6 .00 19.1 .00 1.9 1.00Guatemala 37.6 .00 4.5 .33 3.1 .67Haiti --> 37.9 .00 2.0 .99 10.6 .01Honduras 39.8 .00 6.4 .27 4.5 .73Trinidad 34.1 .00 11.4 .03 4.1 .97Peru 20.2 .00 10.5 .02 2.4 .98Panama 23.8 .00 16.5 .00 2.4 1.00Cuba 12.0 .03 18.5 .00 5.1 .97 --> case misclassified * case not used in computation

D2

0.5 * D2– 0.5 * D2–

Page 326: Statistics I

I-306

Chapter 11

Example 4 Discriminant Analysis Using Interactive Stepping

Automatic forward and backward stepping can produce different sets of predictor variables, and still other subsets of the variables may perform equally as well or possibly better. Here we use interactive stepping to explore alternative sets of variables.

Using the OURWORLD data, let’s say you decide not to include birth and death rates in the model because the rates are changing rapidly for several nations (that is, we omit these variables from the model). We also add the difference between EDUC and HEALTH as a candidate variable.

SYSTAT provides several ways to specify which variables to move into (or out of) the model. The input is:

After interpreting these commands and printing the output below, SYSTAT waits for us to enter STEP instructions.

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LET diffrnce = educ - health LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy diffrnce PRINT SHORT / SCDFUNC GRAPH=NONE START / BACK

Between groups F-matrix -- df = 12 41---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 25.3059 0.0 NewWorld 18.0596 7.3754 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.17 0.436470 | 40 DIFFRNCE 0000000.00 0.000000 8 BIRTH_RT 2.01 0.059623 | 10 DEATH_RT 2.26 0.091463 | 12 BABYMORT 0.10 0.083993 | 16 GDP_CAP 0.62 0.143526 | 19 EDUC 6.12 0.065095 | 21 HEALTH 5.36 0.083198 | 23 MIL 7.11 0.323519 | 34 B_TO_D 0.55 0.136148 | 30 LIFEEXPM 0.26 0.036088 | 31 LIFEEXPF 0.07 0.012280 | 32 LITERACY 1.45 0.177756 |

Page 327: Statistics I

I-307

Discriminant Analysis

A summary of the STEP arguments (variable numbers are visible in the output) follows:

Notice that the seventh STEP specification (g) removes EDUC and HEALTH and enters DIFFRNCE. Remember, after the last step, type STOP for the canonical variable results and other summaries.

Steps 1 and 2

Input:

Output:

a. STEP birth_rt death_rt Remove two variablesb. STEP lifeexpf Remove one variablec. STEP – Remove lifeexpm d. STEP – Remove babymorte. STEP – Remove urbanf. STEP – Remove gdp_capg. STEP educ health diffrnce Remove educ and health; add diffrnceh. STEP + Enter gdp_cap

STOP

STEP birth_rt death_rt

**************** Step 1 -- Variable BIRTH_RT Removed **************** Between groups F-matrix -- df = 11 42---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 26.3672 0.0 NewWorld 18.0391 8.2404 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.64 0.437926 | 8 BIRTH_RT 2.01 0.059623 10 DEATH_RT 2.00 0.092765 | 40 DIFFRNCE 0000.00 0.000000 12 BABYMORT 0.14 0.091364 | 16 GDP_CAP 1.40 0.150944 | 19 EDUC 5.99 0.065824 | 21 HEALTH 4.24 0.090886 | 23 MIL 5.92 0.384992 | 34 B_TO_D 0.35 0.329976 | 30 LIFEEXPM 0.42 0.036548 | 31 LIFEEXPF 0.96 0.015962 | 32 LITERACY 1.79 0.292005 |

Page 328: Statistics I

I-308

Chapter 11

Step 3

Input:

Output:

**************** Step 2 -- Variable DEATH_RT Removed **************** Between groups F-matrix -- df = 10 43---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 27.8162 0.0 NewWorld 18.1733 9.2794 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.20 0.452548 | 8 BIRTH_RT 1.75 0.060472 12 BABYMORT 0.23 0.108992 | 10 DEATH_RT 2.00 0.092765 16 GDP_CAP 1.14 0.153540 | 40 DIFFRNCE 0.00 0.000000 19 EDUC 6.52 0.065850 | 21 HEALTH 6.28 0.093470 | 23 MIL 6.69 0.385443 | 34 B_TO_D 6.48 0.651944 | 30 LIFEEXPM 0.51 0.036592 | 31 LIFEEXPF 0.28 0.019231 | 32 LITERACY 1.89 0.312350 |

STEP lifeexpf

**************** Step 3 -- Variable LIFEEXPF Removed **************** Between groups F-matrix -- df = 9 44---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 31.1645 0.0 NewWorld 20.4611 10.4752 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.27 0.472161 | 8 BIRTH_RT 1.88 0.086049 12 BABYMORT 0.79 0.147553 | 10 DEATH_RT 1.31 0.111768 16 GDP_CAP 1.80 0.171189 | 31 LIFEEXPF 0.28 0.019231 19 EDUC 7.51 0.066995 | 40 DIFFRNCE 00000.00 0.000000 21 HEALTH 7.37 0.095626 | 23 MIL 6.88 0.389511 | 34 B_TO_D 6.49 0.683545 | 30 LIFEEXPM 0.28 0.151179 | 32 LITERACY 2.44 0.338715 |

Page 329: Statistics I

I-309

Discriminant Analysis

Steps 4 through 7

Input:

Output:

(We omit steps 5, 6, and 7. Each step corresponds to a STEP –.)

Steps 8, 9, and 10

Input:

Output:

STEP -

**************** Step 4 -- Variable LIFEEXPM Removed **************** Between groups F-matrix -- df = 8 45---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 35.3422 0.0 NewWorld 23.3116 11.9720 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 6 URBAN 2.48 0.486188 | 8 BIRTH_RT 0.68 0.138508 12 BABYMORT 0.52 0.249802 | 10 DEATH_RT 1.38 0.182210 16 GDP_CAP 1.71 0.173599 | 30 LIFEEXPM 0.28 0.151179 19 EDUC 7.32 0.069441 | 31 LIFEEXPF 0.04 0.079455 21 HEALTH 7.18 0.099905 | 40 DIFFRNCE 000.00 0.000000 23 MIL 7.05 0.391379 | 34 B_TO_D 9.06 0.769167 | 32 LITERACY 2.40 0.346292 |

STEP educ health diffrnce

**************** Step 8 -- Variable EDUC Removed **************** Between groups F-matrix -- df = 4 49---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 49.9302 0.0 NewWorld 34.1490 20.8722 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 21 HEALTH 2.44 0.652730 | 6 URBAN 2.32 0.520120 23 MIL 6.67 0.601236 | 8 BIRTH_RT 3.24 0.248104 34 B_TO_D 16.14 0.887452 | 10 DEATH_RT 0.40 0.241846 32 LITERACY 33.24 0.761872 | 12 BABYMORT 2.09 0.326834 | 16 GDP_CAP 1.12 0.277122 | 19 EDUC 5.14 0.083616 | 30 LIFEEXPM 0.88 0.313546 | 31 LIFEEXPF 2.03 0.250043 | 40 DIFFRNCE 5.14 0.743192

Page 330: Statistics I

I-310

Chapter 11

**************** Step 9 -- Variable HEALTH Removed **************** Between groups F-matrix -- df = 3 50---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 61.6708 0.0 NewWorld 41.4085 28.1939 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 23 MIL 14.70 0.771975 | 6 URBAN 2.55 0.523182 34 B_TO_D 27.09 0.914822 | 8 BIRTH_RT 3.91 0.248706 32 LITERACY 52.35 0.805675 | 10 DEATH_RT 0.42 0.241913 | 12 BABYMORT 3.11 0.337422 | 16 GDP_CAP 3.02 0.391015 | 19 EDUC 0.33 0.538428 | 21 HEALTH 2.44 0.652730 | 30 LIFEEXPM 1.58 0.327654 | 31 LIFEEXPF 3.33 0.269779 | 40 DIFFRNCE 6.98 0.772114 **************** Step 10 -- Variable DIFFRNCE Entered **************** Between groups F-matrix -- df = 4 49---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 60.8974 0.0 NewWorld 38.7925 22.4751 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 23 MIL 16.65 0.683968 | 6 URBAN 2.50 0.522963 34 B_TO_D 13.97 0.900149 | 8 BIRTH_RT 3.89 0.246110 32 LITERACY 47.38 0.792219 | 10 DEATH_RT 0.41 0.241913 40 DIFFRNCE 6.98 0.772114 | 12 BABYMORT 3.26 0.333341 | 16 GDP_CAP 4.30 0.372308 | 19 EDUC 0.94 0.514966 | 21 HEALTH 0.94 0.628279 | 30 LIFEEXPM 0.98 0.326826 | 31 LIFEEXPF 2.40 0.269658

Page 331: Statistics I

I-311

Discriminant Analysis

Step 11

Input:

Output:

Final Model

Input:

Output:

STEP +

**************** Step 11 -- Variable GDP_CAP Entered **************** Between groups F-matrix -- df = 5 48---------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 57.5419 0.0 NewWorld 35.7426 18.6879 0.0 Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 16 GDP_CAP 4.30 0.372308 | 6 URBAN 2.72 0.513543 23 MIL 5.88 0.478530 | 8 BIRTH_RT 1.04 0.189556 34 B_TO_D 9.46 0.887953 | 10 DEATH_RT 1.00 0.215879 32 LITERACY 12.31 0.609614 | 12 BABYMORT 0.71 0.256567 40 DIFFRNCE 8.37 0.735173 | 19 EDUC 0.36 0.324618 | 21 HEALTH 0.36 0.396047 | 30 LIFEEXPM 0.04 0.259888 | 31 LIFEEXPF 0.24 0.180725

STOP

Variable F-to-enter Number ofentered or or variables Wilks’ Approx. removed F-to-remove in model lambda F-value df1 df2 p-tail------------ ----------- --------- ----------- ----------- ---- ----- ---------BIRTH_RT 2.011 11 0.0444 14.3085 22 84 0.00000DEATH_RT 2.002 10 0.0486 15.2053 20 86 0.00000LIFEEXPF 0.275 9 0.0492 17.1471 18 88 0.00000LIFEEXPM 0.277 8 0.0498 19.5708 16 90 0.00000BABYMORT 0.524 7 0.0510 22.5267 14 92 0.00000URBAN 2.615 6 0.0568 25.0342 12 94 0.00000GDP_CAP 3.583 5 0.0655 27.9210 10 96 0.00000EDUC 5.143 4 0.0795 31.1990 8 98 0.00000HEALTH 2.438 3 0.0874 39.7089 6 100 0.00000DIFFRNCE 6.983 4 0.0680 34.7213 8 98 0.00000GDP_CAP 4.299 5 0.0577 30.3710 10 96 0.00000 Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 14 1 93 NewWorld 1 1 19 90 Total 20 15 20 95

Page 332: Statistics I

I-312

Chapter 11

A summary of results for the models estimated by forward, backward, and interactive stepping follows:

Notice that the largest difference between the two classification methods (95% versus 87%) occurs for the last model, which includes the most variables. A difference like

Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 19 0 0 100 Islamic 0 12 3 80 NewWorld 1 3 17 81 Total 20 15 20 87 Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 6.319 0.929 0.822 1.369 0.760 1.000 Canonical discriminant functions -- standardized by within variances-------------------------------------------------------------------- 1 2 URBAN . . BIRTH_RT . . DEATH_RT . . BABYMORT . . GDP_CAP 0.6868 0.0377 EDUC . . HEALTH . . MIL 0.0676 0.8395 B_TO_D -0.4461 -0.5037 LIFEEXPM . . LIFEEXPF . . LITERACY 0.3903 -0.8573 DIFFRNCE -0.6378 -0.0291 Canonical scores of group means------------------------------- Europe 3.162 .535 Islamic -2.890 1.281 NewWorld -.796 -1.399

Model % Correct (Class)

% Correct (Jackknife)

Forward (automatic) 1. BIRTH_RT DEATH_RT MIL 89 87Backward (automatic) 2. BIRTH_RT DEATH_RT MIL EDUC HEALTH 91 91

Interactive (ignoring BIRTH_RT and DEATH_RT) 3. MIL B_TO_D LITERACY 84 84 4. MIL B_TO_D LITERACY EDUC HEALTH 91 89 5. MIL B_TO_D LITERACY DIFFRNCE 91 89 6. MIL B_TO_D LITERACY DIFFRNCE GDP_CAP 95 87

Page 333: Statistics I

I-313

Discriminant Analysis

this one (8%) can indicate overfitting of correlated candidate variables. Since the jackknifed results can still be overly optimistic, cross-validation should be considered.

Example 5 Contrasts

Contrasts are available with commands only. When you have specific hypotheses about differences among particular groups, you can specify one or more contrasts to direct the entry (or removal) of variables in the model.

According to the jackknifed classification results in the stepwise examples, the European countries are always classified correctly (100% correct). All of the misclassifications are New World countries classified as Islamic or vice versa. In order to maximize the difference between the second (Islamic) and third groups (New World), we specify contrast coefficients with commands:

If we want to specify linear and quadratic contrasts across four groups, we could specify:

or

Here, we use the first contrast and request interactive forward stepping. The input is:

CONTRAST [0 -1 1]

CONTRAST [-3 -1 1 3; -1 1 1 -1]

CONTRAST [-3 -1 1 3 -1 1 1 -1]

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy CONTRAST [0 -1 1] PRINT / SHORT START / FORWARD STEP literacy STEP mil STEP urban STOP

Page 334: Statistics I

I-314

Chapter 11

After viewing the results, remember to cancel the contrast if you plan to do other discriminant analyses:

The output follows:

(We omit results for steps 1, 2, and 3.)

CONTRAST / CLEAR

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- | 6 URBAN 21.87 1.000000 | 8 BIRTH_RT 59.06 1.000000 | 10 DEATH_RT 28.79 1.000000 | 12 BABYMORT 44.12 1.000000 | 16 GDP_CAP 14.32 1.000000 | 19 EDUC 1.30 1.000000 | 21 HEALTH 3.34 1.000000 | 23 MIL 0.65 1.000000 | 34 B_TO_D 1.12 1.000000 | 30 LIFEEXPM 35.00 1.000000 | 31 LIFEEXPF 43.16 1.000000 | 32 LITERACY 64.84 1.000000

**************** Step 1 -- Variable LITERACY Entered ****************

Variable F-to-enter Number ofentered or or variables Wilks’ Approx. removed F-to-remove in model lambda F-value df1 df2 p-tail------------ ----------- --------- ----------- ----------- ---- ----- ---------LITERACY 64.844 1 0.4450 64.8444 1 52 0.00000MIL 9.963 2 0.3723 42.9917 2 51 0.00000URBAN 2.953 3 0.3515 30.7433 3 50 0.00000 Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 18 0 1 95 Islamic 0 14 1 93 NewWorld 2 3 16 76 Total 20 17 18 87 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 18 0 1 95 Islamic 0 14 1 93 NewWorld 2 3 16 76 Total 20 17 18 87

Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 1.845 0.805 1.000 Canonical scores of group means------------------------------- Europe .882 Islamic -2.397 NewWorld .914

Page 335: Statistics I

I-315

Discriminant Analysis

Compare the F-to-enter values with those in the forward stepping example. The statistics here indicate that for the economic variables (GDP_CAP, EDUC, HEALTH, and MIL), differences between the second and third groups are much smaller than those when European countries are included.

The Jackknifed classification matrix indicates that when LITERACY, MIL, and URBAN are used, 87% of the countries are classified correctly. This is the same percentage correct as in the forward stepping example for the model with BIRTH_RT, DEATH_RT, and MIL. Here, however, one fewer Islamic country is misclassified, and one European country is now classified incorrectly.

When you look at the canonical results, you see that because a single contrast has one degree of freedom, only one dimension is defined—that is, there is only one eigenvalue and one canonical variable.

Example 6 Quadratic Model

One of the assumptions necessary for linear discriminant analysis is equality of covariance matrices. Within-group scatterplot matrices (SPLOM’s) provide a picture of how measures co-vary. Here we add 85% ellipses of concentration to enhance our view of the bivariate relations. Since our sample sizes do not differ markedly (15 to 21 countries per group), the ellipses for each pair of variables should have approximately the same shape and tilt across groups if the equality of covariance assumption holds. The input is:

DISCRIM USE ourworld LET(educ, health, mil) = SQR(@) STAND SPLOM birth_rt death_rt educ health mil / HALF ROW=1, GROUP=group$ ELL=.85 DENSITY=NORMAL

Page 336: Statistics I

I-316

Chapter 11

Because the length, width, and tilt of the ellipses for most pairs of variables vary markedly across groups, the assumption of equal covariance matrices has not been met.

Fortunately, the quadratic model does not require equality of covariances. However, it has a different problem: it requires a larger minimum sample size than that needed for the linear model. For five variables, for example, the linear and quadratic models, respectively, for each group are:

So the linear model has six parameters to estimate for each group, and the quadratic has 21. These parameters aren’t all independent, so we don’t require as many as ( ) cases for a quadratic fit.

In this example, we fit a quadratic model using the subset of variables identified in the backward stepping example. Following this, we examine results for the subset identified in the interactive stepping example before EDUC and HEALTH are removed. The input is:

DISCRIM USE ourworld LET (educ, health, mil) = SQR(@) LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = birth_rt death_rt educ health mil / QUAD PRINT SHORT / GCOV WCOV GCOR CFUNC MAHAL IDVAR = country$ ESTIMATE

MODEL group = educ health mil b_to_d literacy / QUAD ESTIMATE

Europe

BIR

TH

_RT

DE

AT

H_R

TE

DU

CH

EA

LTH

BIRTH_RT

MIL

DEATH_RT EDUC HEALTH MIL

Islamic

BIR

TH

_RT

DE

AT

H_R

TE

DU

CH

EA

LTH

BIRTH_RT

MIL

DEATH_RT EDUC HEALTH MIL

NewWorld

BIR

TH

_RT

DE

AT

H_R

TE

DU

CH

EA

LTH

BIRTH_RT

MIL

DEATH_RT EDUC HEALTH MIL

f a bx1 cx2 dx3 ex4 fx5+ + + + +=

f a bx1 cx2 dx3 ex4 fx5 gx1x2 ... px4x5 qx12 ... ux5

2+ + + + + + + + + + +=

3 21×

Page 337: Statistics I

I-317

Discriminant Analysis

Output for the first model follows:

Pooled within covariance matrix -- df= 53------------------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 36.2044 DEATH_RT 10.8948 10.4790 EDUC -16.1749 -7.2497 42.8231 HEALTH -12.9261 -4.9333 36.5504 35.0939 MIL -9.6390 -7.7297 22.0789 16.9130 27.7095 Group Europe covariance matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 1.7342 DEATH_RT 0.0184 1.8184 EDUC 2.0051 1.3359 47.1696 HEALTH 1.3943 -0.3625 44.2594 47.3538 MIL 0.8255 1.2689 15.2891 14.7387 15.7686 Group Europe correlation matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 1.0000 DEATH_RT 0.0104 1.0000 EDUC 0.2217 0.1442 1.0000 HEALTH 0.1539 -0.0391 0.9365 1.0000 MIL 0.1579 0.2370 0.5606 0.5394 1.0000 Ln( Det(COV of group Europe) )= 8.67105970 Group Europe discriminant function coefficients------------------------------------------------- BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT -0.1588 DEATH_RT -0.0487 -0.2038 EDUC 0.0498 0.1140 -0.0627 HEALTH -0.0408 -0.1196 0.1162 -0.0617 MIL 0.0104 0.0367 0.0011 0.0144 -0.0249 Constant 4.1354 4.3332 -1.6504 1.7008 -0.0468 Constant Constant -51.1780

Group Islamic covariance matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 48.6381 DEATH_RT 27.5429 25.5429 EDUC -19.8729 -20.3689 33.7508 HEALTH -10.9262 -10.6192 18.8309 10.8603 MIL -15.5902 -28.4991 36.6788 19.3235 66.0183 Group Islamic correlation matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 1.0000 DEATH_RT 0.7814 1.0000 EDUC -0.4905 -0.6937 1.0000 HEALTH -0.4754 -0.6376 0.9836 1.0000 MIL -0.2751 -0.6940 0.7770 0.7217 1.0000 Ln( Det(COV of group Islamic) )= 10.34980794

Page 338: Statistics I

I-318

Chapter 11

Group Islamic discriminant function coefficients------------------------------------------------- BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT -0.0236 DEATH_RT 0.0703 -0.0751 EDUC 0.0099 -0.0726 -0.3578 HEALTH -0.0424 0.1331 1.0933 -0.8951 MIL 0.0261 -0.0469 0.0485 -0.0360 -0.0190 Constant 0.9492 -0.5959 1.2818 -0.9994 -0.3956 Constant Constant -20.4487 Group NewWorld covariance matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 60.2476 DEATH_RT 9.5738 8.1619 EDUC -30.8573 -6.2226 45.0446 HEALTH -27.9303 -5.2955 41.6304 40.4104 MIL -15.4143 -1.7399 18.3092 17.2913 12.2372 Group NewWorld correlation matrix------------------------------------ BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT 1.0000 DEATH_RT 0.4317 1.0000 EDUC -0.5923 -0.3245 1.0000 HEALTH -0.5661 -0.2916 0.9758 1.0000 MIL -0.5677 -0.1741 0.7798 0.7776 1.0000 Ln( Det(COV of group NewWorld) )= 11.46371023 Group NewWorld discriminant function coefficients------------------------------------------------- BIRTH_RT DEATH_RT EDUC HEALTH MIL BIRTH_RT -0.0077 DEATH_RT 0.0121 -0.0401 EDUC -0.0079 -0.0213 -0.1260 HEALTH 0.0040 0.0114 0.2418 -0.1331 MIL -0.0115 0.0196 0.0225 0.0210 -0.0580 Constant 0.4354 0.2643 0.8264 -0.6543 0.5229 Constant Constant -13.3124 Ln( Det(Pooled covariance matrix) )= 13.05914566 Test for equality of covariance matrices Chisquare= 139.5799 df= 30 prob= 0.0000

Between groups F-matrix -- df = 5 49--------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 64.4526 0.0 NewWorld 43.1437 15.9199 0.0 Mahalanobis distance-square from group means and Posterior probabilities for group membership Priors = .333 .333 .333 Europe Islamic NewWorld

Page 339: Statistics I

I-319

Discriminant Analysis

(We omit the eigenvalues, etc.)

Look at the quadratic function displayed at the beginning of this example. For our data, the coefficients for the European group are:

a = –51.178, b = 4.135, c = 4.333, d = –1.650, e = 1.701, f = –0.047, g = –0.049, …,p = 0.014, q = –0.159, …, and u = –0.025

or

f = –51.178 + 4.135*birth_rt + … –0.049*birth_rt*death_rt + …–0.159*birth_rt2 + … –0.025*mil2

Similar functions exist for the other two groups.

(We omit the distances and probabilities for the Europe and Islamic groups.) NewWorld------------Argentina 48.1 .00 45.2 .00 3.8 1.00Barbados 31.8 .00 65.0 .00 6.5 1.00Bolivia --> 369.3 .00 4.1 .65 4.2 .35Brazil 133.3 .00 9.4 .03 1.1 .97Canada --> 14.5 .88 533.6 .00 15.7 .12Chile 66.6 .00 16.6 .00 1.8 1.00Colombia 161.1 .00 9.2 .04 1.8 .96CostaRica 181.6 .00 93.2 .00 7.8 1.00Venezuela 180.9 .00 16.6 .01 6.0 .99DominicanR. 175.3 .00 21.5 .00 2.3 1.00Uruguay 23.1 .00 38.4 .00 5.8 1.00Ecuador 212.2 .00 5.8 .13 .8 .87ElSalvador 312.9 .00 10.0 .03 2.0 .97Jamaica 73.8 .00 20.2 .00 2.5 1.00Guatemala 404.9 .00 6.0 .17 1.7 .83Haiti --> 792.1 .00 3.9 .99 11.2 .01Honduras 395.9 .00 16.1 .00 4.1 1.00Trinidad 164.1 .00 38.0 .00 5.6 1.00Peru 167.6 .00 18.9 .00 4.9 1.00Panama 133.9 .00 97.7 .00 3.4 1.00Cuba 33.6 .00 39.7 .00 6.8 1.00 --> case misclassified * case not used in computation

Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 20 0 0 100 Islamic 0 14 1 93 NewWorld 1 2 18 86 Total 21 16 19 93 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 20 0 0 100 Islamic 0 13 2 87 NewWorld 1 2 18 86 Total 21 15 20 91

Page 340: Statistics I

I-320

Chapter 11

The output also includes the chi-square test for equality of covariance matrices. The results are highly significant ( ). Thus, we reject the hypothesis of equal covariance matrices.

The Mahalanobis distances reveal that only four cases are misclassified: Turkey as a New World country, Canada as European, and Haiti and Bolivia as Islamic.

The classification matrix indicates that 93% of the countries are correctly classified; using the jackknifed results, the percentage drops to 91%. The latter percentage agrees with that for the linear model using the same variables.

The output for the second model follows:

Between groups F-matrix -- df = 5 49--------------------------------------------- Europe Islamic NewWorld Europe 0.0 Islamic 51.5154 0.0 NewWorld 33.6025 17.9915 0.0 Mahalanobis distance-square from group means and Posterior probabilities for group membership Priors = .333 .333 .333 Europe Islamic NewWorld NewWorld------------Argentina 30.9 .00 48.3 .00 4.3 1.00Barbados 35.5 .00 68.7 .00 7.4 1.00Bolivia 186.2 .00 10.1 .08 2.2 .92Brazil 230.8 .00 8.1 .13 1.2 .87Canada --> 19.4 .74 524.3 .00 16.3 .26Chile 144.3 .00 17.2 .00 1.6 1.00Colombia 475.1 .00 29.8 .00 1.9 1.00CostaRica 834.5 .00 190.5 .00 10.3 1.00Venezuela 932.5 .00 83.6 .00 8.8 1.00DominicanR. 267.4 .00 18.6 .00 2.0 1.00Uruguay 15.2 .04 60.5 .00 3.9 .96Ecuador 276.0 .00 11.5 .02 1.0 .98ElSalvador 498.0 .00 17.6 .00 1.7 1.00Jamaica 312.0 .00 15.5 .00 .7 1.00Guatemala 501.3 .00 7.9 .24 2.5 .76Haiti --> 648.4 .00 4.6 .99 10.2 .01Honduras 688.1 .00 31.8 .00 4.0 1.00Trinidad 315.4 .00 43.1 .00 4.6 1.00Peru 179.9 .00 16.3 .02 5.1 .98Panama 411.0 .00 109.7 .00 3.6 1.00Cuba 54.7 .00 54.5 .00 6.8 1.00 --> case misclassified * case not used in computation Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 20 0 0 100 Islamic 0 15 0 100 NewWorld 1 1 19 90 Total 21 16 19 96

p < 0.00005

Page 341: Statistics I

I-321

Discriminant Analysis

This model does slightly better than the first one—the classification matrices here show that 96% and 93%, respectively, are classified correctly. This is because Turkey and Bolivia are classified correctly here and misclassified with the first model.

Example 7 Cross-Validation

At the end of the interactive stepping example, we reported the percentage of correct classification for six models. The same sample was used to compute the estimates and evaluate the success of the rules. We also reported results for the jackknifed classification procedure that removes and replaces one case at a time. This approach, however, may still give an overly optimistic picture. Ideally, we should try the rules on a new sample and compare results with those for the original data. Since this usually isn’t practical, researchers often use a cross-validation procedure—that is, they randomly split the data into two samples, use the first sample to estimate the classification functions, and then use the resulting functions to classify the second sample. The first sample is often called the learning sample and the second, the test sample. The proportion of correct classification for the test sample is an empirical measure for the success of the discrimination.

Cross-validation is easy to implement in discriminant analysis. Cases assigned a weight of 0 are not used to estimate the discriminant functions but are classified into groups. In this example, we generate a uniform random number (values range from 0 to 1.0) for each case, and when it is less than 0.65, the value 1.0 is stored in a new weight variable named CASE_USE. If the random number is equal to or greater than 0.65, a 0 is placed in the weight variable. So, approximately 65% of the cases have a weight of 1.0, and 35%, a weight of 0.

Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 19 0 1 95 Islamic 0 14 1 93 NewWorld 1 1 19 90 Total 20 15 21 93 Eigen Canonical Cumulative proportion values correlations of total dispersion --------- ------------ --------------------- 5.585 0.921 0.801 1.391 0.763 1.000 Canonical scores of group means------------------------------- Europe -2.916 .501 Islamic 2.725 1.322 NewWorld .831 -1.422

Page 342: Statistics I

I-322

Chapter 11

We now request a cross-validation for each of the following six models using the OURWORLD data:

Use interactive forward stepping to “toggle” variables in and out of the model subsets. The input is:

Here are the results from the first STEP after MIL enters:

1. BIRTH_RT DEATH_RT MIL2. BIRTH_RT DEATH_RT MIL EDUC HEALTH 3. MIL B_TO_D LITERACY4. MIL B_TO_D LITERACY EDUC HEALTH5. MIL B_TO_D LITERACY DIFFRNCE6. MIL B_TO_D LITERACY DIFFRNCE GDP_CAP

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LET diffrnce = educ - health LET case_use = URN < .65 WEIGHT = case_use LABEL group / 1=”Europe”, 2=”Islamic”, 3=”NewWorld” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy diffrnce PRINT NONE / FSTATS CLASS JCLASS GRAPH NONE START / FORWARD STEP birth_rt death_rt mil STEP educ health STEP birth_rt death_rt educ health b_to_d literacy STEP educ health STEP educ health diffrnce STEP gdp_cap STOP

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 57.86 0.640126 | 6 URBAN 7.41 0.415097 10 DEATH_RT 24.56 0.513344 | 12 BABYMORT 0.20 0.234804 23 MIL 13.43 0.760697 | 16 GDP_CAP 3.22 0.394128 | 19 EDUC 2.00 0.673136 | 21 HEALTH 4.68 0.828565 | 34 B_TO_D 0.16 0.209796 | 30 LIFEEXPM 0.42 0.136526 | 31 LIFEEXPF 0.83 0.104360 | 32 LITERACY 1.54 0.244547 | 40 DIFFRNCE 5.23 0.784797

Page 343: Statistics I

I-323

Discriminant Analysis

Three classification matrices result. The first presents results for the learning sample, the cases with CASE_USE values of 1.0. Overall, 95% of these countries are classified correctly. The sample size is 13 + 9 + 16 = 38—or 67.9% of the original sample of 56 countries. The second classification table reflects those cases not used to compute estimates, the test sample. The percentage of correct classification drops to 76% for these 17 countries. The final classification table presents the jackknifed results for the learning sample. Notice that the percentages of correct classification are closer to those for the learning sample than for the test sample.

Now we add the variables EDUC and HEALTH, with the following results:

Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 13 0 0 100 Islamic 0 8 1 89 NewWorld 0 1 15 94 Total 13 9 16 95 Classification of cases with zero weight or frequency----------------------------------------------------- Europe Islamic NewWorld %correct Europe 6 0 0 100 Islamic 0 4 2 67 NewWorld 2 0 3 60 Total 8 4 5 76 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 13 0 0 100 Islamic 0 8 1 89 NewWorld 1 1 14 88 Total 14 9 15 92

Variable F-to-remove Tolerance | Variable F-to-enter Tolerance-------------------------------------+------------------------------------- 8 BIRTH_RT 21.13 0.588377 | 6 URBAN 6.50 0.414511 10 DEATH_RT 16.52 0.508827 | 12 BABYMORT 0.07 0.221475 19 EDUC 2.24 0.103930 | 16 GDP_CAP 3.06 0.242491 21 HEALTH 4.88 0.127927 | 34 B_TO_D 0.32 0.198963 23 MIL 5.68 0.567128 | 30 LIFEEXPM 0.05 0.117494 | 31 LIFEEXPF 0.04 0.080161 | 32 LITERACY 1.75 0.238831 | 40 DIFFRNCE 0.00 0.000000 Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic NewWorld %correct Europe 13 0 0 100 Islamic 0 8 1 89 NewWorld 0 1 15 94 Total 13 9 16 95

Page 344: Statistics I

I-324

Chapter 11

After we add EDUC and HEALTH, the results here for the learning sample do not differ from those for the previous model. However, for the test sample, the addition of EDUC and HEALTH increases the percentage correct from 76% to 88%.

We continue by issuing the STEP specifications listed above, each time noting the total percentage correct as well as the percentages for the Islamic and New World groups. After scanning the classification results from both the test sample and the learning sample jackknifed panel, we conclude that model 2 (BIRTH_RT, DEATH_RT, MIL, EDUC, and HEALTH) is best and that model 1 performs the worst.

Classification of New Cases

Group membership is known in the current example. What if you have cases where the group membership is unknown? For example, you might want to apply the rules developed for one sample to a new sample.

When the value of the grouping variable is missing, SYSTAT still classifies the case. For example, we set the group code for New World countries to missing

and request automatic forward stepping for the model containing BIRTH_RT, DEATH_RT, MIL, EDUC, and HEALTH:

Classification of cases with zero weight or frequency----------------------------------------------------- Europe Islamic NewWorld %correct Europe 6 0 0 100 Islamic 0 5 1 83 NewWorld 1 0 4 80 Total 7 5 5 88 Jackknifed classification matrix-------------------------------- Europe Islamic NewWorld %correct Europe 13 0 0 100 Islamic 0 8 1 89 NewWorld 0 1 15 94 Total 13 9 16 95

IF group = 3 THEN LET group$ = .

Page 345: Statistics I

I-325

Discriminant Analysis

The following are the Mahalanobis distances and posterior probabilities for the countries with missing group codes and also the classification matrix. The weight variable is not used here.

Argentina, Barbados, Canada, Uruguay, and Cuba are classified as European; the other 15 countries are classified as Islamic.

DISCRIM USE ourworld LET gdp_cap = L10 (gdp_cap) LET (educ, health, mil) = SQR(@) LET diffrnce = educ - health IF group = 3 THEN LET group = . LABEL group / 1=”Europe”, 2=”Islamic” MODEL group = urban birth_rt death_rt babymort, gdp_cap educ health mil b_to_d, lifeexpm lifeexpf literacy diffrnce IDVAR = country$ PRINT / MAHAL START / FORWARD STEP / AUTO STOP

Mahalanobis distance-square from group means and Posterior probabilities for group membership Priors = .500 .500 Europe Islamic Not Grouped------------Argentina * 28.6 1.00 59.6 .00Barbados * 25.9 1.00 71.9 .00Bolivia * 120.7 .00 2.7 1.00Brazil * 115.7 .00 10.0 1.00Canada * 2.1 1.00 124.1 .00Chile * 63.2 .00 35.5 1.00Colombia * 204.0 .00 22.4 1.00CostaRica * 306.5 .00 60.8 1.00Venezuela * 297.0 .00 49.2 1.00DominicanR. * 129.9 .00 10.8 1.00Uruguay * 12.5 1.00 91.4 .00Ecuador * 149.3 .00 8.4 1.00ElSalvador * 183.7 .00 10.1 1.00Jamaica * 100.2 .00 32.7 1.00Guatemala * 155.3 .00 5.5 1.00Haiti * 136.8 .00 1.4 1.00Honduras * 216.6 .00 13.2 1.00Trinidad * 132.6 .00 14.0 1.00Peru * 99.4 .00 7.4 1.00Panama * 160.5 .00 18.9 1.00Cuba * 19.4 1.00 70.7 .00 --> case misclassified * case not used in computation Classification matrix (cases in row categories classified into columns)--------------------- Europe Islamic %correct Europe 19 0 100 Islamic 0 15 100 Total 19 15 100 Not Grouped 5 16

Page 346: Statistics I

I-326

Chapter 11

References

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and discriminant analysis. Computational Statistics, Vol. 2, Y. Dodge and J. Whittaker, eds. Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag, 111–126.

Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The

Statistical Center, University of Phillipines.

Page 347: Statistics I

I-327

Chapte r

12Factor Analysis

Herb Stenson and Leland Wilkinson

Factor analysis provides principal components analysis and common factor analysis (maximum likelihood and iterated principal axis). SYSTAT has options to rotate, sort, plot, and save factor loadings. With the principal components method, you can also save the scores and coefficients. Orthogonal methods of rotation include varimax, equamax, quartimax, and orthomax. A direct oblimin method is also available for oblique rotation. Users can explore other rotations by interactively rotating a 3-D Quick Graph plot of the factor loadings. Various inferential statistics (for example, confidence intervals, standard errors, and chi-square tests) are provided, depending on the nature of the analysis that is run.

Statistical Background

Principal components (PCA) and common factor (MLA for maximum likelihood and IPA for iterated principal axis) analyses are methods of decomposing a correlation or covariance matrix. Although principal components and common factor analyses are based on different mathematical models, they can be used on the same data and both usually produce similar results. Factor analysis is often used in exploratory data analysis to:

� Study the correlations of a large number of variables by grouping the variables in “factors” so that variables within each factor are more highly correlated with variables in that factor than with variables in other factors.

� Interpret each factor according to the meaning of the variables.

� Summarize many variables by a few factors. The scores from the factors can be used as input data for t tests, regression, ANOVA, discriminant analysis, and so on.

Page 348: Statistics I

I-328

Chapter 12

Often the users of factor analysis are overwhelmed by the gap between theory and practice. In this chapter, we try to offer practical hints. It is important to realize that you may need to make several passes through the procedure, changing options each time, until the results give you the necessary information for your problem.

If you understand the component model, you are on the way toward understanding the factor model, so let’s begin with the former.

A Principal Component

What is a principal component? The simplest way to see is through real data. The following data consist of Graduate Record Examination verbal and quantitative scores. These scores are from 25 applicants to a graduate psychology department.

VERBAL QUANTITATIVE

590 530620 620640 620650 550620 610610 660560 570610 730600 650740 790560 580680 710600 540520 530660 650750 710630 640570 660600 650570 570600 550690 540770 670610 660600 640

Page 349: Statistics I

I-329

Factor Analysis

Now, we could decide to try linear regression to predict verbal scores from quantitative. Or, we could decide to predict quantitative from verbal by the same method. The data don’t suggest which is a dependent variable; either will do. What if we aren’t interested in predicting either one separately but instead want to know how both variables hang together jointly? This is what a principal component does. Karl Pearson, who developed principal component analysis in 1901, described a component as a “line of closest fit to systems of points in space.” In short, the regression line indicates best prediction, and the component line indicates best association.

The following figure shows the regression and component lines for our GRE data. The regression of y on x is the line with the smallest slope (flatter than diagonal). The regression of x on y is the line with the largest slope (steeper than diagonal). The component line is between the other two. Interestingly, when most people are asked to draw a line relating two variables in a scatterplot, they tend to approximate the component line. It takes a lot of explaining to get them to realize that this is not the best line for predicting the vertical axis variable (y) or the horizontal axis variable (x).

Notice that the slope of the component line is approximately 1, which means that the two variables are weighted almost equally (assuming the axis scales are the same). We could make a new variable called GRE that is the sum of the two tests:

GRE = VERBAL + QUANTITATIVE

This new variable could summarize, albeit crudely, the information in the other two. If the points clustered almost perfectly around the component line, then the new component variable could summarize almost perfectly both variables.

500 600 700 800Verbal GRE Score

500

600

700

800

Qua

ntita

tive

GR

E S

core

500 600 700 800500

600

700

800

500 600 700 800500

600

700

800

500 600 700 800500

600

700

800

Page 350: Statistics I

I-330

Chapter 12

Multiple Principal Components

The goal of principal components analysis is to summarize a multivariate data set as accurately as possible using a few components. So far, we have seen only one component. It is possible, however, to draw a second component perpendicular to the first. The first component will summarize as much of the joint variation as possible. The second will summarize what’s left. If we do this with the GRE data, of course, we will have as many components as original variables—not much of a saving. We usually seek fewer components than variables, so that the variation left over is negligible.

Component Coefficients

In the above equation for computing the first principal component on our test data, we made both coefficients equal. In fact, when you run the sample covariance matrix using factor analysis in SYSTAT, the coefficients are as follows:

GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE

They are indeed nearly equal. Their magnitude is considerably less than 1 because principal components are usually scaled to conserve variance. That is, once you compute the components with these coefficients, the total variance on the components is the same as the total variance on the original variables.

Component Loadings

Most researchers want to know the relation between the original variables and the components. Some components may be nearly identical to an original variable; in other words, their coefficients may be nearly 0 for all variables except one. Other components may be a more even amalgam of several original variables.

Component loadings are the covariances of the original variables with the components. In our example, these loadings are 51.085 for VERBAL and 62.880 for QUANTITATIVE. You may have noticed that these are proportional to the coefficients; they are simply scaled differently. If you square each of these loadings and add them up separately for each component, you will have the variance accounted for by each component.

Page 351: Statistics I

I-331

Factor Analysis

Correlations or Covariances

Most researchers prefer to analyze the correlation rather than covariance structure among their variables. Sample correlations are simply covariances of sample standardized variables. Thus, if your variables are measured on very different scales or if you feel the standard deviations of your variables are not theoretically significant, you will want to work with correlations instead of covariances. In our test example, working with correlations yields loadings of 0.879 for each variable instead of 51.085 and 62.880. When you factor the correlation instead of the covariance matrix, then the loadings are the correlations of each component with each original variable.

For our test data, loadings of 0.879 mean that if you created a GRE component by standardizing VERBAL and QUANTITATIVE and adding them together weighted by the coefficients, you would find the correlation between these component scores and the original VERBAL scores to be 0.879. The same would be true for QUANTITATIVE.

Signs of Component Loadings

The signs of loadings within components are arbitrary. If a component (or factor) has more negative than positive loadings, you may change minus signs to plus and plus to minus. SYSTAT does this automatically for components that have more negative than positive loadings, and thus will occasionally produce components or factors that have different signs from those in other computer programs. This occasionally confuses users. In mathematical terms, and are equivalent.

Factor Analysis

We have seen how principal components analysis is a method for computing new variables that summarize variation in a space parsimoniously. For our test variables, the equation for computing the first component was:

GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE

This component equation is linear, of the form:

Component = Linear combination of {Observed variables}

Factor analysts turn this equation around:

Observed variable = Linear combination of {Factors} + Error

Ax λx= Ax– λx–=

Page 352: Statistics I

I-332

Chapter 12

This model was presented by Spearman near the turn of the century in the context of a single intelligence factor and extended to multiple mental measurement factors by Thurstone several decades later. Notice that the factor model makes observed variables a function of unobserved factors. Even though this looks like a linear regression model, none of the graphical and analytical techniques used for regression can be applied to the factor model because there is no unique, observable set of factor scores or residuals to examine.

Factor analysts are less interested in prediction than in decomposing a covariance matrix. This is why the fundamental equation of factor analysis is not the above linear model, but rather its quadratic form:

Observed covariances = Factor covariances + Error covariances

The covariances in this equation are usually expressed in matrix form, so that the model decomposes an observed covariance matrix into a hypothetical factor covariance matrix plus a hypothetical error covariance matrix. The diagonals of these two hypothetical matrices are known, respectively, as communalities and specificities.

In ordinary language, then, the factor model expresses variation within and relations among observed variables as partly common variation among factors and partly specific variation among random errors.

Estimating Factors

Factor analysis involves several steps:

� First, the correlation or covariance matrix is computed from the usual cases-by-variables data file or it is input as a matrix.

� Second, the factor loadings are estimated. This is called initial factor extraction. Extraction methods are described in this section.

� Third, the factors are rotated to make the loadings more interpretable—that is, rotation methods make the loadings for each factor either large or small, not in-between. These methods are described in the next section.

Factors must be estimated iteratively in a computer. There are several methods available. The most popular approach, available in SYSTAT, is to modify the diagonal of the observed covariance matrix and calculate factors the same way components are computed. This procedure is repeated until the communalities reproduced by the factor covariances are indistinguishable from the diagonal of the modified matrix.

Page 353: Statistics I

I-333

Factor Analysis

Rotation

Usually the initial factor extraction does not give interpretable factors. One of the purposes of rotation is to obtain factors that can be named and interpreted. That is, if you can make the large loadings larger than before and the smaller loadings smaller, then each variable is associated with a minimal number of factors. Hopefully, the variables that load strongly together on a particular factor will have a clear meaning with respect to the subject area at hand.

It helps to study plots of loadings for one factor against those for another. Ideally, you want to see clusters of loadings at extreme values for each factor: like what A and C are for factor 1, and B and D are for factor 2 in the left plot, and not like E and F in the middle plot.

In the middle plot, the loadings in groups E and F are sizeable for both factors 1 and 2. However, if you lift the plot axes away from E and F, rotating them 45 degrees, and then set them down as on the right, you achieve the desired effect. Sounds easy for two factors. For three factors, imagine that the loadings are balls floating in a room and that you rotate the floor and walls so that each loading is as close to the floor or a wall as it can be. This concept generalizes to more dimensions.

Researchers let the computer do the rotation automatically. There are many criteria for achieving a simple structure among component loadings, although Thurstone’s are most widely cited. For p variables and m components:

� Each component should have at least m near-zero loadings.

� Few components should have nonzero loadings on the same variable.

SYSTAT provides five methods of rotating loadings: varimax, equamax, quartimax, orthomax, and oblimin.

0

-1

1

-1 0 1

a

b

c

d

0

-1

1

-1 0 1

e

f

-1

0

1

e

f

0

-1

1

Page 354: Statistics I

I-334

Chapter 12

Principal Components versus Factor Analysis

SYSTAT can perform both principal components and common factor analysis. Some view principal components analysis as a method of factor analysis, although there is a theoretical distinction between the two. Principal components are weighted linear composites of observed variables. Common factors are unobserved variables that are hypothesized to account for the intercorrelations among observed variables.

One significant practical difference is that common factor scores are indeterminate, whereas principal component scores are not. There are no sufficient estimators of scores for subjects on common factors (rotated or unrotated, maximum likelihood, or otherwise). Some computer models provide “regression” estimates of factor scores, but these are not estimates in the usual statistical sense. This problem arises not because factors can be arbitrarily rotated (so can principal components), but because the common factor model is based on more unobserved parameters than observed data points, an unusual circumstance in statistics.

In recent years, “maximum likelihood” factor analysis algorithms have been devised to estimate common factors. The implementation of these algorithms in popular computer packages has led some users to believe that the factor indeterminacy problem does not exist for “maximum likelihood” factor estimates. It does.

Mathematicians and psychometricians have known about the factor indeterminacy problem for decades. For a historical review of the issues, see Steiger (1979); for a general review, see Rozeboom (1982). For further information on principal components, consult Harman (1976), Mulaik (1978), Gnanadesikan (1977), or Mardia, Kent, and Bibby (1979).

Because of the indeterminacy problem, SYSTAT computes subjects’ scores only for the principal components model where subjects’ scores are a simple linear transformation of scores on the factored variables. SYSTAT does not save scores from a common factor model.

Applications and Caveats

While there is not room here to discuss more statistical issues, you should realize that there are several myths about factors versus components:

Myth. The factor model allows hypothesis testing; the component model doesn’t.Fact. Morrison (1967) and others present a full range of formal statistical tests for components.

Page 355: Statistics I

I-335

Factor Analysis

Myth. Factor loadings are real; principal component loadings are approximations.Fact. This statement is too ambiguous to have any meaning. It is easy to define things so that factors are approximations of components.

Myth. Factor analysis is more likely to uncover lawful structure in your data; principal components are more contaminated by error. Fact. Again, this statement is ambiguous. With further definition, it can be shown to be true for some data, false for other. It is true that, in general, factor solutions will have lower dimensionality than corresponding component solutions. This can be an advantage when searching for simple structure among noisy variables, as long as you compare the result to a principal components solution to avoid being fooled by the sort of degeneracies illustrated above.

Factor Analysis in SYSTAT

Factor Analysis Main Dialog Box

For factor analysis, from the menus choose:

StatisticsData Reduction

Factor Analysis…

Page 356: Statistics I

I-336

Chapter 12

The following options are available:

Model variables. Variables used to create factors.

Method. SYSTAT offers three estimation methods:

� Principal components analysis (PCA) is the default method of analysis.

� Iterated principal axis (IPA) provides an iterative method to extract common factors by starting with the principal components solution and iteratively solving for communalities.

� Maximum likelihood analysis (MLA) iteratively finds communalities and common factors.

Display. You can sort factor loadings by size or display extended results. Selecting Extended results displays all possible Factor output.

Sample size for matrix input. If your data are in the form of a correlation or covariance matrix, you must specify the sample size on which the input matrix is based so that inferential statistics (available with extended results) can be computed.

Matrix for extraction. You can factor a correlation matrix or a covariance matrix. Most frequently, the correlation matrix is used. You can also delete missing cases pairwise instead of listwise. Listwise deletes any case with missing data for any variable in the list. Pairwise examines each pair of variables and uses all cases with both values present.

Extraction parameters. You can limit the results by specifying extraction parameters.

� Minimum eigenvalue. Specify the smallest eigenvalue to retain. The default is 1.0 for PCA and IPA (not available with maximum likelihood). Incidentally, if you specify 0, factor analysis ignores components with negative eigenvalues (which can occur with pairwise deletion).

� Number of factors. Specify the number of factors to compute. If you specify both the number of factors and the minimum eigenvalue, factor analysis uses whichever criterion results in the smaller number of components.

� Iterations. Specify the number of iterations SYSTAT should perform (not available for principal components). The default is 25.

� Convergence. Specify the convergence criterion (not available for principal components). The default is 0.001.

Page 357: Statistics I

I-337

Factor Analysis

Rotation Parameters

This dialog box specifies the factor rotation method.

The following methods are available:

� No rotation. Factors are not rotated.

� Varimax. An orthogonal rotation method that minimizes the number of variables that have high loadings on each factor. It simplifies the interpretation of the factors.

� Equamax. A rotation method that is a combination of the varimax method, which simplifies the factors, and the quartimax method, which simplifies the variables. The number of variables that load highly on a factor and the number of factors needed to explain a variable are minimized.

� Quartimax. A rotation method that minimizes the number of factors needed to explain each variable. It simplifies the interpretation of the observed variables.

� Orthomax. Specifies families of orthogonal rotations. Gamma specifies the member of the family to use. Varying Gamma changes maximization of the variances of the loadings from columns (Varimax) to rows (Quartimax).

� Oblimin. Specifies families of oblique (non-orthogonal) rotations. Gamma specifies the member of the family to use. For Gamma, specify 0 for moderate correlations, positive values to allow higher correlations, and negative values to restrict correlations.

Page 358: Statistics I

I-338

Chapter 12

Save

You can save factor analysis results for further analyses.

For the maximum likelihood and iterated principal axis methods, you can save only loadings. For the principal components method, select from these options:

� Do not save results. Results are not saved.

� Factor scores. Standardized factor scores.

� Residuals. Residuals for each case. For a correlation matrix, the residual is the actual z score minus the predicted z score using the factor scores times the loadings to get the predicted scores. For a covariance matrix, the residuals are from unstandardized predictions. With an orthogonal rotation, Q and PROB are also saved. Q is the sum of the squared residuals, and PROB is its probability.

� Principal components. Unstandardized principal components scores with mean 0 and variance equal to the eigenvalue for the factor (only for PCA without rotation).

� Factor coefficients. Coefficients that produce standardized scores. For a correlation matrix, multiply the coefficients by the standardized variables; for a covariance matrix, use the original variables.

� Eigenvectors. Eigenvectors (only for PCA without a rotation). Use to produce unstandardized scores.

� Factor loadings. Factor loadings.

� Save data with scores. Saves the selected item and all the variables in the working data file as a new data file. Use with options for scores (not loadings, coefficients, or other similar options).

Page 359: Statistics I

I-339

Factor Analysis

If you save scores, the variables in the file are labeled FACTOR(1), FACTOR(2), and so on. Any observations with missing values on any of the input variables will have missing values for all scores. The scores are normalized to have zero mean and, if the correlation matrix is used, unit variance. If you use the covariance matrix and perform no rotations, SYSTAT does not standardize the component scores. The sum of their variances is the same as for the original data.

If you want to use the score coefficients to get component scores for new data, multiply the coefficients by the standardized data. SYSTAT does this when it saves scores. Another way to do cross-validation is to assign a zero weight to those cases not used in the factoring and to assign a unit weight to those cases used. The zero-weight cases are not used in the factoring, but scores are computed for them.

When Factor scores or Principal components is requested, T2 and PROB are also saved. The former is the Hotelling statistic that squares the standardized distance from each case to the centroid of the factor space (that is, the sum of the squared, standardized factor scores). PROB is the upper-tail probability of T2. Use this statistic to identify outliers within the factor space. T2 is not computed with an oblique rotation.

Using Commands

After selecting a data file with USE filename, continue with:

Usage Considerations

Types of data. Data for factor analysis can be a cases-by-variables data file, a correlation matrix, or a covariance matrix.

Print options. Factor analysis offers three categories of output: short (the default), medium, and long. Each has specific output panels associated with it.

For Short, the default, panels are: Latent roots or eigenvalues (not MLA), initial and final communality estimates (not PCA), component loadings (PCA) or factor pattern

FACTORMODEL varlistSAVE filename / SCORES DATA LOAD COEF VECTORS PC RESIDESTIMATE / METHOD = PCA or IPA or MLA , LISTWISE or PAIRWISE N=n CORR or COVA , NUMBER=n EIGEN=n ITER=n CONV=n SORT , ROTATE = VARIMAX or EQUAMAX or QUARTIMAX , or ORTHOMAX or OBLIMIN GAMMA=n

T2

Page 360: Statistics I

I-340

Chapter 12

(MLA, IPA), variance explained by components (PCA) or factors (MLA, IPA), percentage of total variance explained, change in uniqueness and log likelihood at each iteration (MLA only), and canonical correlations (MLA only). When a rotation is requested: rotated loadings (PCA) or pattern (MLA, IPA) matrix, variance explained by rotated components, percentage of total variance explained, and correlations among oblique components or factors (oblimin only).

By specifying Medium, you get the panels listed for Short, plus: the matrix to factor, the chi-square test that all eigenvalues are equal (PCA only), the chi-square test that last k eigenvalues are equal (PCA only), and differences of original correlations or covariances minus fitted values. For covariance matrix input (not MLA or IPA): asymptotic 95% confidence limits for the eigenvalues and estimates of the population eigenvalues with standard errors.

With Long, you get the panels listed for Short and Medium, plus: latent vectors (eigenvectors) with standard errors (not MLA) and the chi-square test that the number of factors is k (MLA only). With an oblimin rotation: direct and indirect contribution of factors to variances and the rotated structure matrix.

Quick Graphs. Factor analysis produces a scree plot and a factor loadings plot.

Saving files. You can save factor scores, residuals, principal components, factor coefficients, eigenvectors, or factor loadings as a new data file. For the iterated principal axis and maximum likelihood methods, you can save only factor loadings. You can save only eigenvectors and principal components for unrotated solutions using the principal components method.

BY groups. Factor analysis produces separate analyses for each level of any BY variables.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. Factor analysis uses FREQUENCY variables to duplicate cases for rectangular data files.

Case weights. For rectangular data, you can weight cases using a WEIGHT variable.

Page 361: Statistics I

I-341

Factor Analysis

Examples

Example 1 Principal Components

Principal components (PCA, the default method) is a good way to begin a factor analysis (and possibly the only method you may need). If one variable is a linear combination of the others, the program will not stop (MLA and IPA both require a nonsingular correlation or covariance matrix). The PCA output can also provide indications that:

� One or more variables have little relation to the others and, therefore, are not suited for factor analysis—so in your next run, you might consider omitting them.

� The final number of factors may be three or four and not double or triple this number.

To illustrate this method of factor extraction, we borrow data from Harman (1976), who borrowed them from a 1937 unpublished thesis by Mullen. This classic data set is widely used in the literature. For example, Jackson (1991) reports loadings for the PCA, MLA, and IPA methods. The data are measurements recorded for 305 youth aged seven to seventeen: height, arm span, length of forearm, length of lower leg, weight, bitrochanteric diameter (the upper thigh), girth, and width. Because the units of these measurements differ, we analyze a correlation matrix:

Height Arm_Span Forearm Lowerleg Weight Bitro Girth Width

Height 1.000

Arm_Span 0.846 1.000

Forearm 0.805 0.881 1.000

Lowerleg 0.859 0.826 0.801 1.000

Weight 0.473 0.376 0.380 0.436 1.000

Bitro 0.398 0.326 0.319 0.329 0.762 1.000

Girth 0.301 0.277 0.237 0.327 0.730 0.583 1.000

Width 0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.000

Page 362: Statistics I

I-342

Chapter 12

The correlation matrix is stored in the YOUTH file. SYSTAT knows that the file contains a correlation matrix, so no special instructions are needed to read the matrix. The input is:

Notice the shortcut notation (..) for listing consecutive variables in a file.

The output follows:

FACTOR USE youth MODEL height .. width ESTIMATE / METHOD=PCA N=305 SORT ROTATE=VARIMAX

Latent Roots (Eigenvalues) 1 2 3 4 5 4.6729 1.7710 0.4810 0.4214 0.2332 6 7 8 0.1867 0.1373 0.0965

Component loadings 1 2 HEIGHT 0.8594 0.3723 ARM_SPAN 0.8416 0.4410 LOWERLEG 0.8396 0.3953 FOREARM 0.8131 0.4586 WEIGHT 0.7580 -0.5247 BITRO 0.6742 -0.5333 WIDTH 0.6706 -0.4185 GIRTH 0.6172 -0.5801

Variance Explained by Components 1 2 4.6729 1.7710 Percent of Total Variance Explained 1 2 58.4110 22.1373

Rotated Loading Matrix ( VARIMAX, Gamma = 1.0000) 1 2 ARM_SPAN 0.9298 0.1955 FOREARM 0.9191 0.1638 HEIGHT 0.8998 0.2599 LOWERLEG 0.8992 0.2295 WEIGHT 0.2507 0.8871 BITRO 0.1806 0.8404 GIRTH 0.1068 0.8403 WIDTH 0.2509 0.7496

Page 363: Statistics I

I-343

Factor Analysis

Notice that we did not specify how many factors we wanted. For PCA, the assumption is to compute as many factors as there are eigenvalues greater than 1.0—so, in this run, you study results for two factors. After examining the output, you may want to specify a minimum eigenvalue or, very rarely, a lower limit.

Unrotated loadings (and orthogonally rotated loadings) are correlations of the variables with the principal components (factors). They are also the eigenvectors of the correlation matrix multiplied by the square roots of the corresponding eigenvalues. Usually these loadings are not useful for interpreting the factors. For some industrial applications, researchers prefer to examine the eigenvectors alone.

The Variance explained for each component is the eigenvalue for the factor. The first factor accounts for 58.4% of the variance; the second, 22.1%. The Total Variance is the sum of the diagonal elements of the correlation (or covariance) matrix. By summing the Percent of Total Variance Explained for the two factors ( ), you can say that more than 80% of the variance of all eight variables is explained by the first two factors.

In the Rotated Loading Matrix, the rows of the display have been sorted, placing the loadings > 0.5 for factor 1 first, and so on. These are the coefficients of the factors after

"Variance" Explained by Rotated Components 1 2 3.4973 2.9465 Percent of Total Variance Explained 1 2 43.7165 36.8318

Scree Plot

0 1 2 3 4 5 6 7 8 9Number of Factors

0

1

2

3

4

5

Eig

enva

lue

Factor Loadings Plot

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTHBITROWEIGHT

WIDTH

ARM_SPANFOREARMHEIGHTLOWERLEG

58.411 22.137+ 80.548=

Page 364: Statistics I

I-344

Chapter 12

rotation, so notice that large values for the unrotated loadings are larger here and the small values are smaller. The sums of squares of these coefficients (for each factor or column) are printed below under the heading Variance Explained by Rotated Components. Together, the two rotated factors explain more than 80% of the variance. Factor analysis offers five types of rotation. Here, by default, the orthogonal varimax method is used.

To interpret each factor, look for variables with high loadings. The four variables that load highly on factor 1 can be said to measure “lankiness”; while the four that load highly on factor 2, “stockiness.” Other data sets may include variables that do not load highly on any specific factor.

In the factor scree plot, the eigenvalues are plotted against their order (or associated component). Use this display to identify large values that separate well from smaller eigenvalues. This can help to identify a useful number of factors to retain. Scree is the rubble at the bottom of a cliff; the large retained roots are the cliff, and the deleted ones are the rubble.

The points in the factor loadings plot are variables, and the coordinates are the rotated loadings. Look for clusters of loadings at the extremes of the factors. The four variables at the right of the plot load highly on factor 1 and all reflect length. The variables at the top of the plot load highly on factor 2 and reflect width.

Example 2 Maximum Likelihood

This example uses maximum likelihood for initial factor extraction and 2 as the number of factors. Other options remain as in the principal components example. The input is:

FACTOR USE youth MODEL height .. width ESTIMATE / METHOD=MLA N=305 NUMBER=2 SORT ROTATE=VARIMAX

Page 365: Statistics I

I-345

Factor Analysis

The output follows:

Initial Communality Estimates 1 2 3 4 5 0.8162 0.8493 0.8006 0.7884 0.7488 6 7 8 0.6041 0.5622 0.4778

Iterative Maximum Likelihood Factor Analysis: Convergence = 0.001000. Iteration Maximum Change in Negative log of Number SQRT(uniqueness) Likelihood 1 0.722640 0.384050 2 0.243793 0.273332 3 0.051182 0.253671 4 0.010359 0.253162 5 0.000493 0.253162 Final Communality Estimates 1 2 3 4 5 0.8302 0.8929 0.8338 0.8006 0.9109 6 7 8 0.6363 0.5837 0.4633 Canonical Correlations 1 2 0.9823 0.9489

Factor pattern 1 2 HEIGHT 0.8797 0.2375 ARM_SPAN 0.8735 0.3604 LOWERLEG 0.8551 0.2633 FOREARM 0.8458 0.3442 WEIGHT 0.7048 -0.6436 BITRO 0.5887 -0.5383 WIDTH 0.5743 -0.3653 GIRTH 0.5265 -0.5536 Variance Explained by Factors 1 2 4.4337 1.5179 Percent of Total Variance Explained 1 2 55.4218 18.9742

Page 366: Statistics I

I-346

Chapter 12

The first panel of output contains the communality estimates. The communality of a variable is its theoretical squared multiple correlation with the factors extracted. For MLA (and IPA), the assumption for the initial communalities is the observed squared multiple correlation with all the other variables.

Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000) 1 2 ARM_SPAN 0.9262 0.1873 FOREARM 0.8942 0.1853 HEIGHT 0.8628 0.2928 LOWERLEG 0.8569 0.2576 WEIGHT 0.2268 0.9271 BITRO 0.1891 0.7750 GIRTH 0.1289 0.7530 WIDTH 0.2734 0.6233 "Variance" Explained by Rotated Factors 1 2 3.3146 2.6370 Percent of Total Variance Explained 1 2 41.4331 32.9628 Percent of Common Variance Explained 1 2 55.6927 44.3073

Factor Loadings Plot

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTHBITRO

WEIGHT

WIDTH

ARM_SPANFOREARMHEIGHTLOWERLEG

Page 367: Statistics I

I-347

Factor Analysis

The canonical correlations are the largest multiple correlations for successive orthogonal linear combinations of factors with successive orthogonal linear combinations of variables. These values are comfortably high. If, for other data, some of the factors have values that are much lower, you might want to request fewer factors.

The loadings and amount of variance explained are similar to those found in the principal components example. In addition, maximum likelihood reports the percentage of common variance explained. Common variance is the sum of the communalities. If A is the unrotated MLA factor pattern matrix, common variance is the trace of A’A.

Number of Factors

In this example, we specified two factors to extract. If you were to omit this specification and rerun the example, SYSTAT adds this report to the output:

SYSTAT will also report this message if you request more than four factors for these data. This result is due to a theorem by Lederman and indicates that the degrees of freedom allow estimates of loadings and communalities for only four factors.

If we set the print length to long, SYSTAT reports:

The results of this chi-square test indicate that you do not reject the hypothesis that there are four factors (p value > 0.05). Technically, the hypothesis is that “no more than four factors are required.” This, of course, does not negate 2 as the right number. For the YOUTH data, here are rotated loadings for four factors:

The Maximum Number of Factors for Your Data is 4

Chi-square Test that the Number of Factors is 4CSQ = 4.3187 P = 0.1154 DF = 2.00

Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000) 1 2 3 4 ARM_SPAN 0.9372 0.1984 -0.2831 0.0465 LOWERLEG 0.8860 0.2142 0.1878 0.1356 HEIGHT 0.8776 0.2819 0.1134 -0.0077 FOREARM 0.8732 0.1957 -0.0851 -0.0065 WEIGHT 0.2414 0.8830 0.1077 0.1080 BITRO 0.1823 0.8233 0.0163 -0.0784 GIRTH 0.1133 0.7315 -0.0048 0.5219 WIDTH 0.2597 0.6459 -0.1400 0.0819

Page 368: Statistics I

I-348

Chapter 12

The loadings for the last two factors do not make sense. Possibly, the fourth factor has one variable, GIRTH, but it still has a healthier loading on factor 2. This test is based on an assumption of multivariate normality (as is MLA itself). If not true, then the test is invalid.

Example 3 Iterated Principal Axis

This example continues with the YOUTH data described in the principal components example, this time using the IPA (iterated principal axis) method to extract factors. The input is:

The output is:

FACTOR USE youth MODEL height .. width ESTIMATE / METHOD=IPA SORT ROTATE=VARIMAX

Initial Communality Estimates 1 2 3 4 5 0.8162 0.8493 0.8006 0.7884 0.7488 6 7 8 0.6041 0.5622 0.4778 Iterative Principal Axis Factor Analysis: Convergence = 0.001000. Iteration Maximum Change in Number SQRT(communality) 1 0.308775 2 0.039358 3 0.017077 4 0.008751 5 0.004934 6 0.002923 7 0.001776 8 0.001093 9 0.000677 Final Communality Estimates 1 2 3 4 5 0.8381 0.8887 0.8205 0.8077 0.8880 6 7 8 0.6403 0.5835 0.4921

Page 369: Statistics I

I-349

Factor Analysis

Latent Roots (Eigenvalues) 1 2 3 4 5 4.4489 1.5100 0.1016 0.0551 0.0150 6 7 8 -0.0374 -0.0602 -0.0743 Factor pattern 1 2 HEIGHT 0.8561 0.3244 ARM_SPAN 0.8482 0.4114 LOWERLEG 0.8309 0.3424 FOREARM 0.8082 0.4090 WEIGHT 0.7500 -0.5706 BITRO 0.6307 -0.4924 WIDTH 0.6074 -0.3509 GIRTH 0.5688 -0.5098 Variance Explained by Factors 1 2 4.4489 1.5100 Percent of Total Variance Explained 1 2 55.6110 18.8753 Rotated Pattern Matrix ( VARIMAX, Gamma = 1.0000) 1 2 ARM_SPAN 0.9203 0.2045 FOREARM 0.8874 0.1815 HEIGHT 0.8724 0.2775 LOWERLEG 0.8639 0.2478 WEIGHT 0.2334 0.9130 BITRO 0.1884 0.7777 GIRTH 0.1291 0.7529 WIDTH 0.2581 0.6523 "Variance" Explained by Rotated Factors 1 2 3.3150 2.6439 Percent of Total Variance Explained 1 2 41.4377 33.0485 Percent of Common Variance Explained 1 2 55.6314 44.3686

Page 370: Statistics I

I-350

Chapter 12

Before the first iteration, the communality of a variable is its multiple correlation squared with the remaining variables. At each iteration, communalities are estimated from the loadings matrix, A, by finding the trace of A’A, where the number of columns in A is the number of factors. Iterations continue until the largest change in any communality is less than that specified with Convergence. Replacing the diagonal of the correlation (or covariance) matrix with these final communality estimates and computing the eigenvalues yields the latent roots in the next panel.

Example 4 Rotation

Let’s compare the unrotated and orthogonally rotated loadings from the principal components example with those from an oblique rotation. The input is:

FACTOR USE youth PRINT = LONG MODEL height .. width ESTIMATE / METHOD=PCA N=305 SORT

MODEL height .. width ESTIMATE / METHOD=PCA N=305 SORT ROTATE=VARIMAX

MODEL height .. width ESTIMATE / METHOD=PCA N=305 SORT ROTATE=OBLIMIN

Factor Loadings Plot

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTHBITRO

WEIGHT

WIDTH

ARM_SPANFOREARMHEIGHT

LOWERLEG

Page 371: Statistics I

I-351

Factor Analysis

We focus on the output directly related to the rotations:

Component loadings 1 2 HEIGHT 0.8594 0.3723 ARM_SPAN 0.8416 0.4410 LOWERLEG 0.8396 0.3953 FOREARM 0.8131 0.4586 WEIGHT 0.7580 -0.5247 BITRO 0.6742 -0.5333 WIDTH 0.6706 -0.4185 GIRTH 0.6172 -0.5801 Variance Explained by Components 1 2 4.6729 1.7710 Percent of Total Variance Explained 1 2 58.4110 22.1373

Rotated Loading Matrix ( VARIMAX, Gamma = 1.0000) 1 2 ARM_SPAN 0.9298 0.1955 FOREARM 0.9191 0.1638 HEIGHT 0.8998 0.2599 LOWERLEG 0.8992 0.2295 WEIGHT 0.2507 0.8871 BITRO 0.1806 0.8404 GIRTH 0.1068 0.8403 WIDTH 0.2509 0.7496 "Variance" Explained by Rotated Components 1 2 3.4973 2.9465 Percent of Total Variance Explained 1 2 43.7165 36.8318

Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0) 1 2 ARM_SPAN 0.9572 -0.0166 FOREARM 0.9533 -0.0482 LOWERLEG 0.9157 0.0276 HEIGHT 0.9090 0.0604 WEIGHT 0.0537 0.8975 GIRTH -0.0904 0.8821 BITRO -0.0107 0.8642 WIDTH 0.0876 0.7487

Page 372: Statistics I

I-352

Chapter 12

"Variance" Explained by Rotated Components 1 2 3.5273 2.9166 Percent of Total Variance Explained 1 2 44.0913 36.4569 Direct and Indirect Contributions of Factors To Variance 1 2 1 3.5087 2 0.0186 2.8979

Rotated Structure Matrix

1 2 ARM_SPAN 0.9350 0.4523 FOREARM 0.9500 0.3962 LOWERLEG 0.9277 0.4225 HEIGHT 0.9325 0.3629 WEIGHT 0.4407 0.9206 GIRTH 0.3620 0.8596 BITRO 0.4104 0.7865 WIDTH 0.2900 0.8431

No rotation

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTH

WIDTH

BITROWEIGHT

HEIGHT

ARM_SPANLOWERLEG

FOREARM

Varimax

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTHBITROWEIGHT

WIDTH

ARM_SPANFOREARMHEIGHTLOWERLEG

Oblimin

-1.0 -0.5 0.0 0.5 1.0FACTOR(1)

-1.0

-0.5

0.0

0.5

1.0

FA

CT

OR

(2)

GIRTHBITRO WEIGHT

WIDTH

ARM_SPANFOREARMLOWERLEG

HEIGHT

Page 373: Statistics I

I-353

Factor Analysis

The values in Direct and Indirect Contributions of Factors to Variance are useful for determining if part of a factor’s contribution to “Variance” Explained is due to its correlation with another factor. Notice that

3.509 + 0.019 = 3.528 (or 3.527)

is the “Variance” Explained for factor 1, and

2.898 + 0.019 = 2.917

is the “Variance” Explained for factor 2 (differences in the last digit are due to a rounding error).

Think of the values in the Rotated Structure Matrix as correlations of the variable with the factors. Here we see that the first four variables are highly correlated with the first factor. The remaining variables are highly correlated with the second factor.

The factor loading plots illustrate the effects of the rotation methods. While the unrotated factor loadings form two distinct clusters, they both have strong positive loadings for factor 1. The “lanky” variables have moderate positive loadings on factor 2 while the “stocky” variables have negative loadings on factor 2. With the varimax rotation, the “lanky” variables load highly on factor 1 with small loadings on factor 2; the “stocky” variables load highly on factor 2. The oblimin rotation does a much better job of centering each cluster at 0 on its minor factor.

Example 5 Factor Analysis Using a Covariance Matrix

Jackson (1991) describes a project in which the maximum thrust of ballistic missiles was measured. For a specific measure called “total impulse,” it is necessary to calculate the area under a curve. Originally, a planimeter was used to obtain the area, and later an electronic device performed the integration directly but unreliably in its early usage. As data, two strain gauges were attached to each of 40 Nike rockets, and both types of measurements were recorded in parallel (making four measurements per rocket). The covariance matrix of the measures is stored in the MISSLES file.

In this example, we illustrate features associated with covariance matrix input (asymptotic 95% confidence limits for the eigenvalues, estimates of the population eigenvalues with standard errors, and latent vectors (eigenvectors or characteristic vectors) with standard errors).

Page 374: Statistics I

I-354

Chapter 12

The input is:

The output is:

FACTOR USE missles MODEL integra1 planmtr1 integra2 planmtr2 PRINT = LONG ESTIMATE / METHOD=PCA COVA N=40

Latent Roots (Eigenvalues) 1 2 3 4 335.3355 48.0344 29.3305 16.4096 Empirical upper bound for the first Eigenvalue = 398.0000.Asymptotic 95% Confidence Limits for the Eigenvalues, N = 40. Upper Limits: 1 2 3 4 596.9599 85.5102 52.2138 29.2122 Lower Limits: 1 2 3 4 233.1534 33.3975 20.3930 11.4093 Unbiased Estimates of Population Eigenvalues 1 2 3 4 332.6990 46.9298 31.0859 18.3953 Unbiased Estimates of Standard Errors of Eigenvalues 1 2 3 4 74.9460 10.1768 5.7355 3.2528 Chi-Square Test that all Eigenvalues are Equal, N = 40 CSQ = 110.6871 P = 0.0000 df = 9.00

Latent Vectors (Eigenvectors) 1 2 3 4 INTEGRA1 0.4681 0.6215 0.5716 0.2606 PLANMTR1 0.6079 0.1788 -0.7595 0.1473 INTEGRA2 0.4590 -0.1387 0.1677 -0.8614 PLANMTR2 0.4479 -0.7500 0.2615 0.4104 Standard Error for Each Eigenvector Element 1 2 3 4 INTEGRA1 0.0532 0.1879 0.2106 0.1773 PLANMTR1 0.0412 0.2456 0.0758 0.2066 INTEGRA2 0.0342 0.1359 0.2366 0.0519 PLANMTR2 0.0561 0.1058 0.2633 0.1276

Page 375: Statistics I

I-355

Factor Analysis

SYSTAT performs a test to determine if all eigenvalues are equal. The null hypothesis is that all eigenvalues are equal against an alternative hypothesis that at least one root is different. The results here indicate that you reject the null hypothesis (p < 0.00005). At least one of the eigenvalues differs from the others.

The size and sign of the loadings reflect how the factors and variables are related. The first factor has fairly similar loadings for all four variables. You can interpret this

Component loadings 1 2 3 4 INTEGRA1 8.5727 4.3072 3.0954 1.0559 PLANMTR1 11.1325 1.2389 -4.1131 0.5965 INTEGRA2 8.4051 -0.9616 0.9084 -3.4893 PLANMTR2 8.2017 -5.1983 1.4165 1.6625

Variance Explained by Components 1 2 3 4 335.3355 48.0344 29.3305 16.4096 Percent of Total Variance Explained 1 2 3 4 78.1467 11.1940 6.8352 3.8241 Differences: Original Minus Fitted Correlations or Covariances INTEGRA1 PLANMTR1 INTEGRA2 PLANMTR2 INTEGRA1 0.0000 PLANMTR1 0.0000 0.0000 INTEGRA2 0.0000 0.0000 0.0000 PLANMTR2 0.0000 0.0000 0.0000 0.0000

Scree Plot

0 1 2 3 4 5Number of Factors

0

100

200

300

400

Eig

enva

lue

Factor Loadings Plot

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2)

PLANMTR2INTEGRA2 INTEGRA1PLANMTR1

FACTOR(3)PLANMTR1

INTEGRA2INTEGRA1

PLANMTR2

FACTOR(4)

FA

CT

OR

(1)

INTEGRA2

PLANMTR1

PLANMTR2INTEGRA1

FA

CT

OR

(2)

PLANMTR2

INTEGRA2PLANMTR1

INTEGRA1

PLANMTR1INTEGRA2

INTEGRA1

PLANMTR2

FA

CT

OR

(2)

INTEGRA2PLANMTR1

PLANMTR2

INTEGRA1

FA

CT

OR

(3)

PLANMTR2INTEGRA2

PLANMTR1

INTEGRA1PLANMTR2INTEGRA2

INTEGRA1

PLANMTR1

FA

CT

OR

(3)

INTEGRA2

PLANMTR1

PLANMTR2INTEGRA1

FACTOR(1)

FA

CT

OR

(4)

PLANMTR2

INTEGRA2

PLANMTR1INTEGRA1

FACTOR(2)

PLANMTR2

INTEGRA2

INTEGRA1PLANMTR1

FACTOR(3)

PLANMTR1

INTEGRA2

INTEGRA1PLANMTR2

FACTOR(4)

FA

CT

OR

(4)

Page 376: Statistics I

I-356

Chapter 12

factor as an overall average of the area under the curve across the four measures. The second factor represents gauge differences because the signs are different for each. The third factor is primarily a comparison between the first planimeter and the first integration device. The last factor has no simple interpretation.

When there are four or more factors, the Quick Graph of the loadings is a SPLOM. The first component represents 78% of the variability of the product, so plots of loadings for factors 2 through 4 convey little information (notice that values in the stripe displays along the diagonal concentrate around 0, while those for factor 1 fall to the right).

Example 6 Factor Analysis Using a Rectangular File

Begin this analysis from the OURWORLD cases-by-variables data file. Each case contains information for one of 57 countries. We will study the interrelations among a subset of 13 variables including economic measures (gross domestic product per capita and U.S. dollars spent per person on education, health, and the military), birth and death rates, population estimates for 1983, 1986, and 1990 plus predictions for 2020, and the percentages of the population who can read and who live in cities.

We request principal components extraction with an oblique rotation. As a first step, SYSTAT computes the correlation matrix. Correlations measure linear relations. However, plots of the economic measures and population values as recorded indicate a lack of linearity, so you use base 10 logarithms to transform six variables, and you use square roots to transform two others. The input is:

FACTOR USE ourworld LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990, pop_2020), = L10(@) LET (mil,educ) = SQR(@) MODEL urban birth_rt death_rt gdp_cap gnp_86 mil, educ b_to_d literacy pop_1983 pop_1986, pop_1990 pop_2020 PRINT=MEDIUM SAVE pcascore / SCORES ESTIMATE / METHOD=PCA SORT ROTATE=OBLIMIN

Page 377: Statistics I

I-357

Factor Analysis

The output is:

Matrix to be factored URBAN BIRTH_RT DEATH_RT GDP_CAP GNP_86 URBAN 1.0000 BIRTH_RT -0.8002 1.0000 DEATH_RT -0.5126 0.5110 1.0000 GDP_CAP 0.7636 -0.9189 -0.4012 1.0000 GNP_86 0.7747 -0.8786 -0.4518 0.9736 1.0000 MIL 0.6453 -0.7547 -0.1482 0.8657 0.8514 EDUC 0.6238 -0.7528 -0.2151 0.8996 0.9207 B_TO_D -0.3074 0.5106 -0.4340 -0.5293 -0.4411 LITERACY 0.7997 -0.9302 -0.6601 0.8337 0.8404 POP_1983 0.2133 -0.0836 0.0152 0.0583 0.0090 POP_1986 0.1898 -0.0523 0.0291 0.0248 -0.0215 POP_1990 0.1700 -0.0252 0.0284 -0.0015 -0.0447 POP_2020 0.0054 0.1880 0.0743 -0.2116 -0.2484

MIL EDUC B_TO_D LITERACY POP_1983 MIL 1.0000 EDUC 0.8869 1.0000 B_TO_D -0.6184 -0.5252 1.0000 LITERACY 0.6421 0.6869 -0.2737 1.0000 POP_1983 0.2206 -0.0062 -0.1526 -0.0050 1.0000 POP_1986 0.1942 -0.0306 -0.1358 -0.0327 0.9984 POP_1990 0.1727 -0.0513 -0.1070 -0.0534 0.9966 POP_2020 -0.0339 -0.2555 0.0617 -0.2360 0.9531 POP_1986 POP_1990 POP_2020 POP_1986 1.0000 POP_1990 0.9992 1.0000 POP_2020 0.9605 0.9673 1.0000 Latent Roots (Eigenvalues) 1 2 3 4 5 6.3950 4.0165 1.6557 0.4327 0.2390 6 7 8 9 10 0.0966 0.0812 0.0403 0.0251 0.0110 11 12 13 0.0054 0.0012 0.0002 Empirical upper bound for the first Eigenvalue = 7.4817. Chi-Square Test that all Eigenvalues are Equal, N = 49 CSQ = 1542.2903 P = 0.0000 df = 78.00 Chi-Square Test that the Last 10 Eigenvalues Are Equal CSQ = 636.4350 P = 0.0000 df = 59.89Component loadings 1 2 3 GDP_CAP 0.9769 -0.0366 -0.0606 GNP_86 0.9703 -0.0846 0.0040 BIRTH_RT -0.9512 0.0136 -0.0774

Page 378: Statistics I

I-358

Chapter 12

LITERACY 0.8972 -0.1008 0.3004 EDUC 0.8927 -0.0857 -0.2296 MIL 0.8770 0.1501 -0.2909 URBAN 0.8393 0.1425 0.2300 B_TO_D -0.5166 -0.1225 0.7762 POP_1990 0.0382 0.9972 0.0394 POP_1986 0.0636 0.9966 0.0253 POP_1983 0.0945 0.9940 0.0248 POP_2020 -0.1796 0.9748 0.1002 DEATH_RT -0.4533 0.0820 -0.8662 Variance Explained by Components 1 2 3 6.3950 4.0165 1.6557

Percent of Total Variance Explained 1 2 3 49.1924 30.8964 12.7361Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0) 1 2 3 GDP_CAP 0.9779 -0.0399 0.0523 GNP_86 0.9714 -0.0816 -0.0146 BIRTH_RT -0.9506 0.0040 0.0843 EDUC 0.8961 -0.1049 0.2194 LITERACY 0.8956 -0.0700 -0.3112 MIL 0.8777 0.1242 0.2924 URBAN 0.8349 0.1658 -0.2285 B_TO_D -0.5224 -0.0501 -0.7787 POP_1990 0.0236 0.9977 0.0095 POP_1986 0.0491 0.9958 0.0234 POP_1983 0.0801 0.9932 0.0235 POP_2020 -0.1945 0.9805 -0.0510 DEATH_RT -0.4459 -0.0011 0.8730

"Variance" Explained by Rotated Components 1 2 3 6.3946 4.0057 1.6669 Percent of Total Variance Explained 1 2 3 49.1895 30.8129 12.8225

Correlations among Oblique Factors or Components 1 2 3 1 1.0000 2 0.0127 1.0000 3 -0.0020 0.0452 1.0000

Page 379: Statistics I

I-359

Factor Analysis

By default, SYSTAT extracts three factors because three eigenvalues are greater than1.0. On factor 1, seven or eight variables have high loadings. The eighth, B_TO_D (aratio of birth-to-death rate) has a higher loading on factor 3. With the exception ofBIRTH_RT, the other variables are economic measures, so let’s identify this as the“economic” factor. Clearly, the second factor can be named “population,” and thethird, less clearly, “death rates.”

The economic and population factors account for 80% (49.19 + 30.81) of the totalvariance, so a plot of the scores for these factors should be useful for characterizingdifferences among the countries. The third factor accounts for 13% of the totalvariance, a much smaller amount than the other two factors. Notice, too, that only 7%of the total variance is not accounted for by these three factors.

Revisiting the Correlation Matrix

Let’s examine the correlation matrix for these variables. In an effort to group thevariables contributing to each factor, we order the variables according to their factorloadings for the factor on which they load the highest. The input is:

CORR USE ourworld LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990, pop_2020) = L10(@) LET (mil,educ) = SQR(@) PEARSON gdp_cap gnp_86 birth_rt educ literacy mil urban , pop_1990 pop_1986 pop_1983 pop_2020 b_to_d death_rt

Factor Loadings Plot

GDP_CAPGNP_86

BIRTH_RT

EDUC

LITERACY

MIL

URBAN

B_TO_D

POP_1990POP_1986POP_1983

POP_2020

DEATH_RT

Page 380: Statistics I

I-360

Chapter 12

The resulting matrix is:

Use an editor to insert the dotted lines.The top triangle of the matrix shows the correlations of the variables within the

“economic” factor. BIRTH_RT has strong negative correlations with the other variables. Correlations of the population variables with the economic variables are displayed in the four rows below this top portion, and correlations of the death rates variables with the economic variables are in the next two rows. Correlations within the population factor are displayed in the top triangle of the bottom panel. The correlation between the variables in factor 3 (B_TO_D and DEATH_RT) is –0.434 and is smaller than any of the other within-factor correlations.

Factor Scores

Look at the scores just stored in PCASCORE. First, merge the name of each country and the grouping variable GROUP$ with the scores. The values of GROUP$ identify each country as Europe, Islamic, or New World. Next, plot factor 2 against factor 1 (labeling points with country names) and factor 3 against factor 1 (labeling points with the first letter of their group membership). Finally, use SPLOMs to display the scores, adding 75% confidence ellipses for each subgroup in the plots and normal curves for the univariate distributions. Repeat the latter using kernel density estimators.

Pearson correlation matrix GDP_CAP GNP_86 BIRTH_RT EDUC LITERACY MIL URBAN GDP_CAP 1.000 GNP_86 0.974 1.000 BIRTH_RT -0.919 -0.879 1.000 EDUC 0.900 0.921 -0.753 1.000 LITERACY 0.834 0.840 -0.930 0.687 1.000 MIL 0.866 0.851 -0.755 0.887 0.642 1.000 URBAN 0.764 0.775 -0.800 0.624 0.800 0.645 1.000------------------------------------------------------------------------------- POP_1990 -0.002 -0.045 -0.025 -0.051 -0.053 0.173 0.170 POP_1986 0.025 -0.021 -0.052 -0.031 -0.033 0.194 0.190 POP_1983 0.058 0.009 -0.084 -0.006 -0.005 0.221 0.213 POP_2020 -0.212 -0.248 0.188 -0.255 -0.236 -0.034 0.005 B_TO_D -0.529 -0.441 0.511 -0.525 -0.274 -0.618 -0.307 DEATH_RT -0.401 -0.452 0.511 -0.215 -0.660 -0.148 -0.513

POP_1990 POP_1986 POP_1983 POP_2020 B_TO_D DEATH_RT POP_1990 1.000 POP_1986 0.999 1.000 POP_1983 0.997 0.998 1.000 POP_2020 0.967 0.960 0.953 1.000-------------------------------------------------- B_TO_D -0.107 -0.136 -0.153 0.062 1.000 DEATH_RT 0.028 0.029 0.015 0.074 -0.434 1.000

Page 381: Statistics I

I-361

Factor Analysis

The input is:

The output is:

MERGE "C:\SYSTAT\PCASCORE.SYD"(FACTOR(1) FACTOR(2) FACTOR(3)), "C:\SYSTAT\DATA\OURWORLD.SYD"(GROUP$ COUNTRY$)PLOT FACTOR(2)*FACTOR(1) / XLABEL=’Economic’ , YLABEL=’Population’ SYMBOL=4,2,3, SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.250PLOT FACTOR(3)*FACTOR(1) / XLABEL=’Economic’ , YLABEL=’Death Rate’ COLOR=2,1,10, SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY, DENSITY=NORMAL ELL =0.750, COLOR=2,1,10 SYMBOL=4,2,3, DASH=1,1,4SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY, DENSITY=KERNEL COLOR=2,1,10, SYMBOL=4,2,3 DASH=1,1,4

-2 -1 0 1 2Economic

-3

-2

-1

0

1

2

Pop

ulat

ion

Mali

Gambia

Bangladesh

Ethiopia

SomaliaHaitiYemen

Sudan

Pakistan

SenegalGuatemala

BoliviaHondurasElSalvador DominicanR.

Ecuador

Turkey

CostaRica

AlgeriaColombia

Sweden

Norway

Canada

WGermany

Denmark

Netherlands

Switzerland

FranceUK

Finland

Italy

Austria

Spain

Greece

Ireland

Poland

Hungary

Barbados

Uruguay

Portugal

Argentina

Trinidad

Chile

Venezuela

Panama

Brazil

Malaysia

Peru

Jamaica

-2 -1 0 1 2Economic

-3

-2

-1

0

1

2D

eath

Rat

e E

E EEE

EE

E

EE

EE

E

EE

E

E

I

III

I

I

I

I I

II

I

NN

N

N

N

N

N

N N

N

N

NNN

N

N

N NN

N

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

EuropeIslamicNewWorld

GROUP$

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

FACTOR(1)

FA

CT

OR

(1)

FACTOR(2) FACTOR(3)

FA

CT

OR

(1)

FA

CT

OR

(2) F

AC

TO

R(2)

FACTOR(1)

FA

CT

OR

(3)

FACTOR(2) FACTOR(3)

FA

CT

OR

(3)

EuropeIslamicNewWorld

GROUP$

Page 382: Statistics I

I-362

Chapter 12

High loadings on the “economic” factor show countries that are strong economically (Germany, Canada, Netherlands, Sweden, Switzerland, Denmark, and Norway) relative to those with low loadings (Bangladesh, Ethiopia, Mali, and Gambia). Not surprisingly, the population factor identifies Barbados as the smallest and Bangladesh, Pakistan, and Brazil as largest. The questionable third factor (death rate) does help to separate the New World countries from the others.

In each SPLOM, the dashed lines marking curves, ellipses, and kernel contours identify New World countries. The kernel contours in the plot of factor 3 against factor 1 identify a pocket of Islamic countries within the New World group.

Computation

Algorithms

Provisional methods are used for computing covariance or correlation matrices (see Correlations for references). Components are computed by using a Householder tridiagonalization and implicit QL iterations. Rotations are computed with a variant of Kaiser’s iterative algorithm, described in Mulaik (1972).

Missing Data

Ordinarily, Factor Analysis and other multivariate procedures delete all cases having missing values on any variable selected for analysis. This is listwise deletion. For data with many missing values, you may end up with too few complete cases for analysis. Select Pairwise deletion if you want covariances or correlations computed separately for each pair of variables selected for analysis. Pairwise deletion takes more time than the standard listwise deletion because all possible pairs of variances and covariances are computed. The same option is offered for Correlations, should you decide to create a symmetric matrix for use in factor analysis that way. Also notice that Correlation provides an EM algorithm for estimating correlation or covariance matrices when data are missing.

Be careful. When you use pairwise deletion, you can end up with negative eigenvalues for principal components or be unable to compute common factors at all. With either method, it is desirable that the pattern of missing data be random. Otherwise, the factor structure you compute will be influenced systematically by the pattern of how values are missing.

Page 383: Statistics I

I-363

Factor Analysis

References

Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.: Lifetime Learning Publications.

Clarkson, D. B. and Jennrich, R. I. (1988). Quartic rotation criteria and algorithms, Psychometrika, 53, 251–259.

Dixon, W. J. et al. (1985). BMDP statistical software manual. Berkeley: University of California Press.

Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate observations. New York: John Wiley & Sons, Inc.

Harman, H. H. (1976). Modern factor analysis, 3rd ed. Chicago: University of Chicago Press.

Jackson, J. E. (1991). A user’s guide to principal components. New York: John Wiley & Sons, Inc.

Jennrich, R. I. and Robinson, S. M. (1969). A newton-raphson algorithm for maximum likelihood factor analysis. Psychometrika, 34, 111–123.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. London: Academic Press.

Morrison, D. F. (1976). Multivariate statistical methods, 2nd ed. New York: McGraw-Hill. Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill.Rozeboom, W. W. (1982). The determinacy of common factors in large item domains.

Psychometrika, 47, 281–295.Steiger, J. H. (1979). Factor indeterminacy in the 1930’s and 1970’s: some interesting

parallels. Psychometrika, 44, 157–167.

Page 384: Statistics I
Page 385: Statistics I

I-365

Chapte r

13Linear Models

Each chapter in this manual normally has its own statistical background section. In this part, however, Regression, ANOVA, and General Linear Models are grouped together. There are two reasons for doing this. First, while some introductory textbooks treat regression and analysis of variance as distinct, statisticians know that they are based on the same underlying mathematical model. When you study what these procedures do, therefore, it is helpful to understand that model and learn the common terminology underlying each method. Second, although SYSTAT has three commands (REGRESS, ANOVA, and GLM) and menu settings, it is a not-so-well-guarded secret that these all lead to the same program, originally called MGLH (for Multivariate General Linear Hypothesis). Having them organized this way means that SYSTAT can use tools designed for one approach (for example, dummy variables in ANOVA) in another (such as computing within-group correlations in multivariate regression). This synergy is not usually available in packages that treat these models independently.

Simple Linear Models

Linear models are models based on lines. More generally, they are based on linear surfaces, such as lines, planes, and hyperplanes. Linear models are widely applied because lines and planes often appear to describe well the relations among variables measured in the real world. We will begin by examining the equation for a straight line, and then move to more complex linear models.

Page 386: Statistics I

I-366

Chapter 13

Equation for a Line

A linear model looks like this:

This is the equation for a straight line that you learned in school. The quantities in this equation are:

Variables are quantities that can vary (have different numerical values) in the same equation. The remaining quantities are called parameters. A parameter is a quantity that is constant in a particular equation, but that can be varied to produce other equations in the same general family. The parameters are:

Let’s look at an example. Here are some data showing the yearly earnings a partner should theoretically get in a certain large law firm, based on annual personal billings over quota (both in thousands of dollars):

y a dependent variable

x an independent variable

a The value of y when x is 0. This is sometimes called a y-intercept (where a line intersects the y axis in a graph when x is 0).

b The slope of the line, or the number of units y changes when x changes by one unit.

EARNINGS BILLINGS

60 2070 4080 6090 80

100 100120 140140 180150 200175 250190 280

y a bx+=

Page 387: Statistics I

I-367

Linear Models

We can plot these data with EARNINGS on the vertical axis (dependent variable) and BILLINGS on the horizontal (independent variable). Notice in the following figure that all the points lie on a straight line.

What is the equation for this line? Look at the vertical axis value on the sloped line where the independent variable has a value of 0. Its value is 50. A lawyer is paid $50,000 even when billing nothing. Thus, a is 50 in our equation. What is b? Notice that the line rises by $10,000 when billings change by $20,000. The line rises half as fast as it runs. You can also look at the data and see that the earnings change by $1 as billing changes by $2. Thus, b is 0.5, or a half, in our equation.

Why bother with all these calculations? We could use the table to determine a lawyer’s compensation, but the formula and the line graph allow us to determine wages not found in the table. For example, we now know that $30,000 in billings would yield earnings of $65,000:

When we do this, however, we must be sure that we can use the same equation on these new values. We must be careful when interpolating, or estimating, wages for billings between the ones we have been given. Does it make sense to compute earnings for $25,000 in billings, for example? It probably does. Similarly, we must be careful when extrapolating, or estimating from units outside the domain of values we have been given. What about negative billings, for example? Would we want to pay an embezzler? Be careful. Equations and graphs usually are meaningful only within or close to the range of y values and domain of x values in the data.

EARNINGS 50000 0.5 30000 65000=×+=

Page 388: Statistics I

I-368

Chapter 13

Regression

Data are seldom this clean unless we design them to be that way. Law firms typically fine tune their partners’ earnings according to many factors. Here are the real billings and earnings for our law firm (these lawyers predate Reagan, Bush, Clinton, and Gates):

Our techniques for computing a linear equation won’t work with these data. Look at the following graph. There is no way to draw a straight line through all the data.

Given the irregularities in our data, the line drawn in the figure is a compromise. How do we find a best fitting line? If we are interested in predicting earnings from the billing data values rather well, a reasonable method would be to place a line through the points so that the vertical deviations between the points and the line (errors in predicting

EARNINGS BILLINGS

86 2067 4095 60

105 8086 10082 140

140 180145 200144 250184 280

Page 389: Statistics I

I-369

Linear Models

earnings) are as small as possible. In other words, these deviations (absolute discrepancies, or residuals) should be small, on average, for a good-fitting line.

The procedure of fitting a line or curve to data such that residuals on the dependent variable are minimized in some way is called regression. Because we are minimizing vertical deviations, the regression line often appears to be more horizontal than we might place it by eye, especially when the points are fairly scattered. It “regresses” toward the mean value of y across all the values of x, namely, a horizontal line through the middle of all the points. The regression line is not intended to pass through as many points as possible. It is for predicting the dependent variable as accurately as possible, given each value of the independent variable.

Least Squares

There are several ways to draw the line so that, on average, the deviations are small. We could minimize the mean, the median, or some other measure of the typical behavior of the absolute values of the residuals. Or we can minimize the sum (or mean) of the squared residuals, which yields almost the same line in most cases. Using squared instead of absolute residuals gives more influence to points whose y value is farther from the average of all y values. This is not always desirable, but it makes the mathematics simpler. This method is called ordinary least squares.

By specifying EARNINGS as the dependent variable and BILLINGS as the independent variable in a MODEL statement, we can compute the ordinary least-squares regression y-intercept as $62,800 and the slope as 0.375. These values do not predict any single lawyer’s earnings exactly. They describe the whole firm well, in the sense that, on the average, the line predicts a given earnings value fairly closely from a given billings value.

Estimation and Inference

We often want to do more with such data than draw a line on a picture. In order to generalize, formulate a policy, or test a hypothesis, we need to make an inference. Making an inference implies that we think a model describes a more general population from which our data have been randomly sampled. In the present example, this population is all possible lawyers who might work for this firm. To make an inference about compensation, we need to construct a linear model for our population that includes a parameter for random error. In addition, we need to change our notation

Page 390: Statistics I

I-370

Chapter 13

to avoid confusion later. We are going to use Greek to denote parameters and italic Roman letters for variables. The error parameter is usually called ε.

Notice that ε is a random variable. It varies like any other variable (for example, x), but it varies randomly, like the tossing of a coin. Since ε is random, our model forces y to be random as well because adding fixed values ( and ) to a random variable produces another random variable. In ordinary language, we are saying with our model that earnings are only partly predictable from billings. They vary slightly according to many other factors, which we assume are random.

We do not know all of the factors governing the firm’s compensation decisions, but we assume:

� All the salaries are derived from the same linear model.

� The error in predicting a particular salary from billings using the model is independent of (not in any way predictable from) the error in predicting other salaries.

� The errors in predicting all the salaries come from the same random distribution.

Our model for predicting in our population contains parameters, but unlike our perfect straight line example, we cannot compute these parameters directly from the data. The data we have are only a small sample from a much larger population, so we can only estimate the parameter values using some statistical method on our sample data. Those of you who have heard this story before may not be surprised that ordinary least squares is one reasonable method for estimating parameters when our three assumptions are appropriate. Without going into all the details, we can be reasonably assured that if our population assumptions are true and if we randomly sample some cases (that is, each case has an equal chance of being picked) from the population, the least-squares estimates of and will, on average, be close to their values in the population.

So far, we have done what seems like a sleight of hand. We delved into some abstruse language and came up with the same least-squares values for the slope and intercept as before. There is something new, however. We have now added conditions that define our least-squares values as sample estimates of population values. We now regard our sample data as one instance of many possible samples. Our compensation model is like Plato’s cave metaphor; we think it typifies how this law firm makes compensation decisions about any lawyer, not just the ones we sampled. Before, we were computing descriptive statistics about a sample. Now, we are computing inferential statistics about a population.

y α βx ε+ +=

α βx

α β

Page 391: Statistics I

I-371

Linear Models

Standard Errors

There are several statistics relevant to the estimation of and . Perhaps most important is a measure of how variable we could expect our estimates to be if we continued to sample data from our population and used least squares to get our estimates. A statistic calculated by SYSTAT shows what we could expect this variation to be. It is called, appropriately, the standard error of estimate, or Std Error in the output. The standard error of the y-intercept, or regression constant, is in the first row of the coefficients: 10.440. The standard error of the billing coefficient or slope is 0.065. Look for these numbers in the following output:

Hypothesis Testing

From these standard errors, we can construct hypothesis tests on these coefficients. Suppose a skeptic approached us and said, “Your estimates look as if something is going on here, but in this firm, salaries have nothing to do with billings. You just happened to pick a sample that gives the impression that billings matter. It was the luck of the draw that provided you with such a misleading picture. In reality, is 0 in the population because billings play no role in determining earnings.”

We can reply, “If salaries had nothing to do with billings but are really just a mean value plus random error for any billing level, then would it be likely for us to find a coefficient estimate for at least this different from 0 in a sample of 10 lawyers?”

To represent these alternatives as a bet between us and the skeptic, we must agree on some critical level for deciding who will win the bet. If the likelihood of a sample result at least this extreme occurring by chance is less than or equal to this critical level (say, five times out of a hundred), we win; otherwise, the skeptic wins.

This logic might seem odd at first because, in almost every case, our skeptic’s null hypothesis would appear ridiculous, and our alternative hypothesis (that the skeptic is wrong) seems plausible. Two scenarios are relevant here, however. The first is the

Dep Var: EARNINGS N: 10 Multiple R: 0.897 Squared multiple R: 0.804 Adjusted squared multiple R: 0.779 Standard error of estimate: 17.626 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 62.838 10.440 0.0 . 6.019 0.000BILLINGS 0.375 0.065 0.897 1.000 5.728 0.000 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 10191.109 1 10191.109 32.805 0.000Residual 2485.291 8 310.661

α β

β

β

Page 392: Statistics I

I-372

Chapter 13

lawyer’s. We are trying to make a case here. The only way we will prevail is if we convince our skeptical jury beyond a reasonable doubt. In statistical practice, that reasonable doubt level is relatively liberal: fewer than five times in a hundred. The second scenario is the scientist’s. We are going to stake our reputation on our model. If someone sampled new data and failed to find nonzero coefficients, much less coefficients similar to ours, few would pay attention to us in the future.

To compute probabilities, we must count all possibilities or refer to a mathematical probability distribution that approximates these possibilities well. The most widely used approximation is the normal curve, which we reviewed briefly in Chapter 1. For large samples, the regression coefficients will tend to be normally distributed under the assumptions we made above. To allow for smaller samples, however, we will add the following condition to our list of assumptions:

� The errors in predicting the salaries come from a normal distribution.

If we estimate the standard errors of the regression coefficients from the data instead of knowing them in advance, then we should use the t distribution instead of the normal. The two-tail value for the probability represents the area under the theoretical t probability curve corresponding to coefficient estimates whose absolute values are more extreme than the ones we obtained. For both parameters in the model of lawyers’ earnings, these values (given as P(2 tail)) are less than 0.001, leading us to reject our null hypothesis at well below the 0.05 level.

At the bottom of our output, we get an analysis of variance table that tests the goodness of fit of our entire model. The null hypothesis corresponding to the F ratio (32.805) and its associated p value is that the billing variable coefficient is equal to 0. This test overwhelmingly rejects the null hypothesis that both and are 0.

Multiple Correlation

In the same output is a statistic called the squared multiple correlation. This is the proportion of the total variation in the dependent variable (EARNINGS) accounted for by the linear prediction using BILLINGS. The value here (0.804) tells us that approximately 80% of the variation in earnings can be accounted for by a linear prediction from billings. The rest of the variation, as far as this model is concerned, is random error. The square root of this statistic is called, not surprisingly, the multiple correlation. The adjusted squared multiple correlation (0.779) is what we would expect the squared multiple correlation to be if we used the model we just estimated on a new sample of 10 lawyers in the firm. It is smaller than the squared multiple correlation because the coefficients were optimized for this sample rather than for the new one.

α β

Page 393: Statistics I

I-373

Linear Models

Regression Diagnostics

We do not need to understand the mathematics of how a line is fitted in order to use regression. You can fit a line to any x-y data by the method of least squares. The computer doesn’t care where the numbers come from. To have a model and estimates that mean something, however, you should be sure the assumptions are reasonable and that the sample data appear to be sampled from a population that meets the assumptions.

The sample analogues of the errors in the population model are the residuals—the differences between the observed and predicted values of the dependent variable. There are many diagnostics you can perform on the residuals. Here are the most important ones:

The errors are normally distributed. Draw a normal probability plot (PPLOT) of the residuals.

The residuals should fall approximately on a diagonal straight line in this plot. When the sample size is small, as in our law example, the line may be quite jagged. It is difficult to tell by any method whether a small sample is from a normal population. You can also plot a histogram or stem-and-leaf diagram of the residuals to see if they are lumpy in the middle with thin, symmetric tails.

The errors have constant variance. Plot the residuals against the estimated values. The following plot shows studentized residuals (STUDENT) against estimated values (ESTIMATE). Studentized residuals are the true “external” kind discussed in Velleman

-40 -30 -20 -10 0 10 20RESIDUAL

-2

-1

0

1

2

Exp

e cte

d V

alue

for

Nor

ma l

Dis

trib

utio

n

Page 394: Statistics I

I-374

Chapter 13

and Welsch (1981). Use these statistics to identify outliers in the dependent variable space. Under normal regression assumptions, they have a t distribution with

degrees of freedom, where N is the total sample size and p is the number of predictors (including the constant). Large values (greater than 2 or 3 in absolute magnitude) indicate possible problems.

Our residuals should be arranged in a horizontal band within two or three units around 0 in this plot. Again, since there are so few observations, it is difficult to tell whether they violate this assumption in this case. There is only one particularly large residual, and it is toward the middle of the values. This lawyer billed $140,000 and is earning only $80,000. He or she might have a gripe about supporting a higher share of the firm’s overhead.

The errors are independent. Several plots can be done. Look at the plot of residuals against estimated values above. Make sure that the residuals are randomly scattered above and below the 0 horizontal and that they do not track in a snaky way across the plot. If they look as if they were shot at the plot by a horizontally moving machine gun, then they are probably not independent of each other. You may also want to plot residuals against other variables, such as time, orientation, or other ways that might influence the variability of your dependent measure. ACF PLOT in SERIES measures whether the residuals are serially correlated. Here is an autocorrelation plot:

N p– 1–( )

50 100 150 200ESTIMATE

-3

-2

-1

0

1

2

ST

UD

EN

T

Page 395: Statistics I

I-375

Linear Models

All the bars should be within the confidence bands if each residual is not predictable from the one preceding it, and the one preceding that, and the one preceding that, and so on.

All the members of the population are described by the same linear model. Plot Cook’s distance (COOK) against the estimated values.

Cook’s distance measures the influence of each sample observation on the coefficient estimates. Observations that are far from the average of all the independent variable values or that have large residuals tend to have a large Cook’s distance value (say, greater than 2). Cook’s D actually follows closely an F distribution, so aberrant values depend on the sample size. As a rule of thumb, under the normal regression assumptions, COOK can be compared to an F distribution with p and N – p degrees of freedom. We don’t want to find a large Cook’s D value for an observation because it would mean that the coefficient estimates would change substantially if we deleted that

50 100 150 200ESTIMATE

0.0

0.1

0.2

0.3

0.4

0.5

CO

OK

Page 396: Statistics I

I-376

Chapter 13

observation. While none of the COOK values are extremely large in our example, could it be that the largest one in the upper right corner is the founding partner in the firm? Despite large billings, this partner is earning more than the model predicts.

Another diagnostic statistic useful for assessing the model fit is leverage, discussed in Belsley, Kuh, and Welsch (1980) and Velleman and Welsch (1981). Leverage helps to identify outliers in the independent variable space. Leverage has an average value of , where p is the number of estimated parameters (including the constant) and N is the number of cases. What is a high value of leverage? In practice, it is useful to examine the values in a stem-and-leaf plot and identify those that stand apart from the rest of the sample. However, various rules of thumb have been suggested. For example, values of leverage less than 0.2 appear to be safe; between 0.2 and 0.5, risky; and above 0.5, to be avoided. Another says that if p > 6 and (N – p) > 12, use as a cutoff. SYSTAT uses an F approximation to determine this value for warnings (Belsley, Kuh, and Welsch, 1980).

In conclusion, keep in mind that all our diagnostic tests are themselves a form of inference. We can assess theoretical errors only through the dark mirror of our observed residuals. Despite this caveat, testing assumptions graphically is critically important. You should never publish regression results until you have examined these plots.

Multiple Regression

A multiple linear model has more than one independent variable; that is:

This is the equation for a plane in three-dimensional space. The parameter a is still an intercept term. It is the value of y when x and z are 0. The parameters b and c are still slopes. One gives the slope of the plane along the x dimension; the other, along the z dimension.

The statistical model has the same form:

p N⁄

3p( ) N⁄

y a bx cz+ +=

y α βx γz ε+ + +=

Page 397: Statistics I

I-377

Linear Models

Before we run out of letters for independent variables, let’s switch to a more frequently used notation:

Notice that we are still using Greek letters for unobservables and Roman letters for observables.

Now, let’s look at our law firm data again. We have learned that there is another variable that appears to determine earnings—the number of hours billed per year by each lawyer. Here is an expanded listing of the data:

For our model, is the coefficient for BILLINGS, and is the coefficient for HOURS. Let’s look first at its graphical representation. The following figure shows the plane fit by least squares to the points representing each lawyer. Notice how the plane slopes upward on both variables. BILLINGS and HOURS both contribute positively to EARNINGS in our sample.

EARNINGS BILLINGS HOURS

86 20 177167 40 155695 60 1749

105 80 175486 100 159482 140 1400

140 180 1780145 200 1737144 250 1645184 280 1863

y β0 β1x β2x2 ε+ + +=

β1 β2

Page 398: Statistics I

I-378

Chapter 13

Fitting this model involves no more work than fitting the simple regression model. Wespecify one dependent and two independent variables and estimate the model asbefore. Here is the result:

This time, we have one more row in our regression table for HOURS. Notice that itscoefficient (0.124) is smaller than that for BILLINGS (0.333). This is due partly to thedifferent scales of the variables. HOURS are measured in larger numbers thanBILLINGS. If we wish to compare the influence of each independent of scales, weshould look at the standardized coefficients. Here, we still see that BILLINGS (0.797)play a greater role in predicting EARNINGS than do HOURS (0.449). Notice also thatboth coefficients are highly significant and that our overall model is highly significant,as shown in the analysis of variance table.

Dep Var:EARNINGS N: 10 MULTIPLE R: .998 Squared Multiple R: .996

Adjusted squared Multiple R: .995 Standard Error of Estimate: 2.678

Variable Coefficient Std Error Std Coef Tolerance T P(2 tail)

CONSTANT –139.925 11.116 0.000 . -12.588 0.000BILLINGS 0.333 0.010 0.797 .9510698 32.690 0.000HOURS 0.124 0.007 0.449 .9510698 18.429 0.000

Analysis of Variance

Source Sum-of-Squares DF Mean Square F-ratio P

Regression 12626.210 2 6313.105 880.493 0.000Residual 50.190 7 7.170

Page 399: Statistics I

I-379

Linear Models

Variable Selection

In applications, you may not know which subset of predictor variables in a larger set constitute a “good” model. Strategies for identifying a good subset are many and varied: forward selection, backward elimination, stepwise (either a forward or backward type), and all subsets. Forward selection begins with the “best” predictor, adds the next “best” and continues entering variables to improve the fit. Backward selection begins with all candidate predictors in an equation and removes the least useful one at a time as long as the fit is not substantially “worsened.” Stepwise begins as either forward or backward, but allows “poor” predictors to be removed from the candidate model or “good” predictors to re-enter the model at any step. Finally, all subsets methods compute all possible subsets of predictors for each model of a given size (number of predictors) and choose the “best” one.

Bias and variance tradeoff. Submodel selection is a tradeoff between bias and variance. By decreasing the number of parameters in the model, its predictive capability is enhanced. This is because the variance of the parameter estimates decreases. On the other side, bias may increase because the “true model” may have a higher dimension. So we’d like to balance smaller variance against increased bias. There are two aspects to variable selection: selecting the dimensionality of the submodel (how many variables to include) and evaluating the model selected. After you determine the dimension, there may be several alternative subsets that perform equally well. Then, knowledge of the subject matter, how accurately individual variables are measured, and what a variable “communicates” may guide selection of the model to report.

A strategy. If you are in an exploratory phase of research, you might try this version of backwards stepping. First, fit a model using all candidate predictors. Then identify the least “useful” variable, remove it from the model list, and fit a smaller model. Evaluate your results and select another variable to remove. Continue removing variables. For a given size model, you may want to remove alternative variables (that is, first remove variable A, evaluate results, replace A and remove B, etc.).

Entry and removal criteria. Decisions about which variable to enter or remove should be based on statistics and diagnostics in the output, especially graphical displays of these values, and your knowledge of the problem at hand.

You can specify your own alpha-to-enter and alpha-to-remove values (do not make alpha-to-remove less than alpha-to-enter), or you can cycle variables in and out of the equation (stepping automatically stops if this happens). The default values for these options are Enter = 0.15 and Remove = 0.15. These values are appropriate for predictor

Page 400: Statistics I

I-380

Chapter 13

variables that are relatively independent. If your predictor variables are highly correlated, you should consider lowering the Enter and Remove values well below 0.05.

When there are high correlations among the independent variables, the estimates of the regression coefficients can become unstable. Tolerance is a measure of this condition. It is ; that is, one minus the squared multiple correlation between a predictor and the other predictors included in the model. (Note that the dependent variable is not used.) By setting a minimum tolerance value, variables highly correlated with others already in the model are not allowed to enter.

As a rough guideline, consider models that include only variables that have absolute t values well above 2.0 and “tolerance” values greater than 0.1. (We use quotation marks here because t and other statistics do not have their usual distributions when you are selecting subset models.)

Evaluation criteria. There is no one test to identify the dimensionality of the best submodel. Recent research by Leo Breiman emphasizes the usefulness of cross-validation techniques involving 80% random subsamples. Sample 80% of your file, fit a model, use the resulting coefficients on the remaining 20% to obtain predicted values, and then compute for this smaller sample. In over-fitting situations, the discrepancy between the for the 80% sample and the 20% sample can be dramatic.

A warning. If you do not have extensive knowledge of your variables and expect this strategy to help you to find a “true” model, you can get into a lot of trouble. Automatic stepwise regression programs cannot do your work for you. You must be able to examine graphics and make intelligent choices based on theory and prior knowledge; otherwise, you will be arriving at nonsense.

Moreover, if you are thinking of testing hypotheses after automatically fitting a subset model, don’t bother. Stepwise regression programs are the most notorious source of “pseudo” p values in the field of automated data analysis. Statisticians seem to be the only ones who know these are not “real” p values. The automatic stepwise option is provided to select a subset model for prediction purposes. It should never be used without cross-validation.

If you still want some sort of confidence estimate on your subset model, you might look at tables in Wilkinson (1979), Rencher and Pun (1980), and Wilkinson and Dallal (1982). These tables provide null hypothesis values for selected subsets given the number of candidate predictors and final subset size. If you don’t know this literature already, you will be surprised at how large multiple correlations from stepwise regressions on random data can be. For a general summary of these and other problems, see Hocking (1983). For more specific discussions of variable selection

1 R2–( )

R2

R2

R2

Page 401: Statistics I

I-381

Linear Models

problems, see the previous references and Flack and Chang (1987), Freedman (1983), and Lovell (1983). Stepwise regression is probably the most abused computerized statistical technique ever devised. If you think you need automated stepwise regression to solve a particular problem, it is almost certain that you do not. Professional statisticians rarely use automated stepwise regression because it does not necessarily find the “best” fitting model, the “real” model, or alternative “plausible” models. Furthermore, the order in which variables enter or leave a stepwise program is usually of no theoretical significance. You are always better off thinking about why a model could generate your data and then testing that model.

Using an SSCP, a Covariance, or a Correlation Matrix as Input

Normally for a regression analysis, you use a cases-by-variables data file. You can, however, use a covariance or correlation matrix saved (from Correlations) as input. If you use a matrix as input, specify the sample size that generated the matrix where the number you type is an integer greater than 2.

You can enter an SSCP, a covariance, or a correlation matrix by typing it into the Data Editor Worksheet, by using BASIC, or by saving it in a SYSTAT file. Be sure to include the dependent as well as independent variables.

SYSTAT needs the sample size to calculate degrees of freedom, so you need to enter the original sample size. Linear Regression determines the type of matrix (SSCP, covariance, etc.) and adjusts appropriately. With a correlation matrix, the raw and standardized coefficients are the same. Therefore, the Include constant option is disabled when using SSCP, covariance, or correlation matrices. Because these matrices are centered, the constant term has already been removed.

The following two analyses of the same data file produce identical results (except that you don’t get residuals with the second). In the first, we use the usual cases-by-variables data file. In the second, we use the CORR command to save a covariance matrix and then analyze that matrix file with the REGRESS command.

Here are the usual instructions for a regression analysis:

REGRESS USE filename MODEL Y = CONSTANT + X(1) + X(2) + X(3) ESTIMATE

Page 402: Statistics I

I-382

Chapter 13

Here, we compute a covariance matrix and use it in the regression analysis:

The triangular matrix input facility is useful for “meta-analysis” of published data and missing-value computations. There are a few warnings, however. First, if you input correlation matrices from textbooks or articles, you may not get the same regression coefficients as those printed in the source. Because of round-off error, printed and raw data can lead to different results. Second, if you use pairwise deletion with CORR, the degrees of freedom for hypotheses will not be appropriate. You may not even be able to estimate the regression coefficients because of singularities.

In general, when an incomplete data procedure is used to estimate the correlation matrix, the estimate of regression coefficients and hypothesis tests produced from it are optimistic. You can correct for this by specifying a sample size smaller than the number of actual observations (preferably, set it equal to the smallest number of cases used for any pair of variables), but this is a crude guess that you could refine only by doing Monte Carlo simulations. There is no simple solution. Beware, especially, of multivariate regressions (or MANOVA, etc.) with missing data on the dependent variables. You can usually compute coefficients, but results from hypothesis tests are particularly suspect.

Analysis of Variance

Often, you will want to examine the influence of categorical variables (such as gender, species, country, and experimental group) on continuous variables. The model equations for this case, called analysis of variance, are equivalent to those used in linear regression. However, in the latter, you have to figure out a numerical coding for categories so that you can use the codes in an equation as the independent variable(s).

CORR USE filename1 SAVE filename2 COVARIANCE X(1) X(2) X(3) Y

REGRESS USE filename2 MODEL Y = X(1) + X(2) + X(3) / N=40 ESTIMATE

Page 403: Statistics I

I-383

Linear Models

Effects Coding

The following data file, EARNBILL, shows the breakdown of lawyers sampled by sex. Because SEX is a categorical variable (numerical values assigned to MALE or FEMALE are arbitrary), a code variable with the values 1 or –1 is used. It doesn’t matter which group is assigned –1, as long as the other is assigned 1.

There is nothing wrong with plotting earnings against the code variable, as long as you realize that the slope of the line is arbitrary because it depends on how you assign your codes. By changing the values of the code variable, you can change the slope. Here is a plot with the least-squares regression line superimposed.

EARNINGS SEX CODE

86 female –167 female –195 female –1

105 female –186 female –182 male 1

140 male 1145 male 1144 male 1184 male 1

Page 404: Statistics I

I-384

Chapter 13

Let’s do a regression on the data using these codes. Here are the coefficients as computed by ANOVA:

Notice that Constant (113.4) is the mean of all the data. It is also the regression intercept because the codes are symmetrical about 0. The coefficient for Code (25.6) is the slope of the line. It is also one half the difference between the means of the groups. This is because the codes are exactly two units apart. This slope is often called an effect in the analysis of variance because it represents the amount that the categorical variable SEX affects BILLINGS. In other words, the effect of SEX can be represented by the amount that the mean for males differs from the overall mean.

Means Coding

The effects coding model is useful because the parameters (constant and slope) can be interpreted as an overall level and as the effect(s) of treatment, respectively. Another model, however, that yields the means of the groups directly is called the means model. Here are the codes for this model:.

Notice that CODE1 is nonzero for all females, and CODE2 is nonzero for all males. To estimate a regression model with these codes, you must leave out the constant. With

Variable Coefficients

Constant 113.400 Code 25.600

EARNINGS SEX CODE1 CODE2

86 female 1 067 female 1 095 female 1 0

105 female 1 086 female 1 082 male 0 1

140 male 0 1145 male 0 1144 male 0 1184 male 0 1

Page 405: Statistics I

I-385

Linear Models

only two groups, only two distinct pieces of information are needed to distinguish them. Here are the coefficients for these codes in a model without a constant:

Notice that the coefficients are now the means of the groups.

Models

Let’s look at the algebraic models for each of these codings. Recall that the regression model looks like this:

For the effects model, it is convenient to modify this notation as follows:

When x (the code variable) is –1, αj is equivalent to α1; when x is 1, αj is equivalent to α2. This shorthand will help you later when dealing with models with many categories. For this model, the µ parameter stands for the grand (overall) mean, and the α parameter stands for the effect. In this model, our best prediction of the score of a group member is derived from the grand mean plus or minus the deviation of that group from this grand mean.

The means model looks like this:

In this model, our best prediction of the score of a group member is the mean of that group.

Variable Coefficient

Code1 87.800 Code2 139.000

y β0 β1x1 ε+ +=

yj µ= α j ε+ +

yj µj= ε+

Page 406: Statistics I

I-386

Chapter 13

Hypotheses

As with regression, we are usually interested in testing hypotheses concerning the parameters of the model. Here are the hypotheses for the two models:

H0: α1 = α2 = 0 (effects model)H0: µ1 = µ2 (means model)

The tests of this hypothesis compare variation between the means to variation within each group, which is mathematically equivalent to testing the significance of coefficients in the regression model. In our example, the F ratio in the analysis of variance table tells you that the coefficient for SEX is significant at p = 0.019, which is less than the conventional 0.05 value. Thus, on the basis of this sample and the validity of our usual regression assumptions, you can conclude that women earn significantly less than men in this firm.

The nice thing about realizing that ANOVA is specially-coded regression is that the usual assumptions and diagnostics are appropriate in this context. You can plot residuals against estimated values, for example, to check for homogeneity of variance.

Multigroup ANOVA

When there are more groups, the coding of categories becomes more complex. For the effects model, there are one fewer coding variables than number of categories. For two categories, you need only one coding variable; for three categories, you need two coding variables:

Dep Var:earnings N: 10 Multiple R: .719 Squared Multiple R: .517

Analysis Of Variance

Source Sum-of-squares Df Mean-square F-ratio P

Sex 6553.600 1 6553.600 8.563 0.019

Error 6122.800 8 765.350

Category Code

1 1 02 0 13 –1 –1

Page 407: Statistics I

I-387

Linear Models

For the means model, the extension is straightforward:

For multigroup ANOVA, the models have the same form as for the two-group ANOVA above. The corresponding hypotheses for testing whether there are differences between means are:

H0: α1 = α2 = α3 =0 (effects model)H0: µ1 = µ2 = µ3 (means model)

You do not need to know how to produce coding variables to do ANOVA. SYSTAT does this for you automatically. All you need is a single variable that contains different values for each group. SYSTAT translates these values into different codes. It is important to remember, however, that regression and analysis of variance are not fundamentally different models. They are both instances of the general linear model.

Factorial ANOVA

It is possible to have more than one categorical variable in ANOVA. When this happens, you code each categorical variable exactly the same way as you do with multi-group ANOVA. The coded design variables are then added as a full set of predictors in the model.

ANOVA factors can interact. For example, a treatment may enhance bar pressing by male rats, yet suppress bar pressing by female rats. To test for this possibility, you can add (to your model) variables that are the product of the main effect variables already coded. This is similar to what you do when you construct polynomial models. For example, this is a model without an interaction:

This is a model that contains interaction:

Category Code

1 1 0 02 0 1 03 0 0 1

y = CONSTANT + treat + sex

y = CONSTANT + treat + sex + treat*sex

Page 408: Statistics I

I-388

Chapter 13

If the hypothesis test of the coefficients for the TREAT*SEX term is significant, then you must qualify your conclusions by referring to the interaction. You might say, “It works one way for males and another for females.”

Data Screening and Assumptions

Most analyses have assumptions. If your data do not meet the necessary assumptions, then the resulting probabilities for the statistics may be suspect. Before an ANOVA, look for:

� Violations of the equal variance assumption. Your groups should have the same dispersion or spread (their shapes do not differ markedly).

� Symmetry. The mean of each group should fall roughly in the middle of the spread (the within-group distributions are not extremely skewed).

� Independence of the group means and standard deviations (the size of the group means is not related to the size of their standard deviations).

� Gross outliers (no values stand apart from the others in the batch).

Graphical displays are useful for checking assumptions. For analysis of variance, try dit plots, box-and-whisker displays, or bar charts with standard error bars.

Levene Test

Analysis of variance assumes that the data within cells are independent and normally distributed with equal variances. This is the ANOVA equivalent of the regression assumptions for residuals. When the homogeneous variance part of the assumptions is false, it is sometimes possible to adjust the degrees of freedom to produce approximately distributed F statistics.

Levene (1960) proposed a test for unequal variances. You can use this test to determine whether you need an unequal variance F test. Simply fit your model in ANOVA and save residuals. Then transform the residuals into their absolute values. Merge these with your original grouping variable(s). Then redo your ANOVA on the absolute residuals. If it is significant, then you should consider using the separate variances test.

Before doing all this work, you should do a box plot by groups to see whether the distributions differ. If you see few differences in the spread of the boxes, Levene’s test is unlikely to be significant.

Page 409: Statistics I

I-389

Linear Models

Pairwise Mean Comparisons

The results in an ANOVA table serve only to indicate whether means differ significantly or not. They do not indicate which mean differs from another.

To report which pairs of means differ significantly, you might think of computing a two-sample t test for each pair; however, do not do this. The probability associated with the two-sample t test assumes that only one test is performed. When several means are tested pairwise, the probability of finding one significant difference by chance alone increases rapidly with the number of pairs. If you use a 0.05 significance level to test that means A and B are equal and to test that means C and D are equal, the overall acceptance region is now 0.95 x 0.95, or 0.9025. Thus, the acceptance region for two independent comparisons carried out simultaneously is about 90%, and the critical region is 10% (instead of the desired 5%). For six pairs of means tested at the 0.05 significance level, the probability of a difference falling in the critical region is not 0.05 but

1 – (0.95)6 = 0.265

For 10 pairs, this probability increases to 0.40. The result of following such a strategy is to declare differences as significant when they are not.

As an alternative to the situation described above, SYSTAT provides four techniques to perform pairwise mean comparisons: Bonferroni, Scheffe, Tukey, and Fisher’s LSD. The first three methods provide protection for multiple tests. To determine significant differences, simply look for pairs with probabilities below your critical value (for example, 0.05 or 0.01).

There is an abundance of literature covering multiple comparisons (see Miller, 1985); however, a few points are worth noting here:

� If you have a small number of groups, the Bonferroni pairwise procedure will often be more powerful (sensitive). For more groups, consider the Tukey method. Try all the methods in ANOVA (except Fisher’s LSD) and pick the best one.

� All possible pairwise comparisons are a waste of power. Think about a meaningful subset of comparisons and test this subset with Bonferroni levels. To do this, divide your critical level, say 0.05, by the number of comparisons you are making. You will almost always have more power than with any other pairwise multiple comparison procedures.

Page 410: Statistics I

I-390

Chapter 13

� Some popular multiple comparison procedures are not found in SYSTAT. Duncan’s test, for example, does not maintain its claimed protection level. Other stepwise multiple range tests, such as Newman-Keuls, have not been conclusively demonstrated to maintain overall protection levels for all possible distributions of means.

Linear and Quadratic Contrasts

Contrasts are used to test relationships among means. A contrast is a linear combination of means µi with coefficients αi:

α1µ1 + α2µ2 + … + αkµk= 0

where α1+ α2 + … + αk = 0. In SYSTAT, hypotheses can be specified about contrasts and tests performed. Typically, the hypothesis has the form:

H0: α1µ1 + α2µ2 + … + αkµk= 0

The test statistic for a contrast is similar to that for a two-sample t test; the result of the contrast (a relation among means, such as mean A minus mean B) is in the numerator of the test statistic, and an estimate of within-group variability (the pooled variance estimate or the error term from the ANOVA) is part of the denominator.

You can select contrast coefficients to test:

� Pairwise comparisons (test for a difference between two particular means)

� A linear combination of means that are meaningful to the study at hand (compare two treatments versus a control mean)

� Linear, quadratic, or the like increases (decreases) across a set of ordered means (that is, you might test a linear increase in sales by comparing people with no training, those with moderate training, and those with extensive training)

Many experimental design texts place coefficients for linear and quadratic contrasts for three groups, four groups, and so on, in a table. SYSTAT allows you to type your contrasts or select a polynomial option. A polynomial contrast of order 1 is linear; of order 2, quadratic; of order 3, cubic; and so on.

Page 411: Statistics I

I-391

Linear Models

Unbalanced Designs

An unbalanced factorial design occurs when the numbers of cases in cells are unequal and not proportional across rows or columns. The following is an example of a

design:

Unbalanced designs require a least-squares procedure like the General Linear Model because the usual maximum likelihood method of adding up sums of squared deviations from cell means and the grand mean does not yield maximum likelihood estimates of effects. The General Linear Model adjusts for unbalanced designs when you get an ANOVA table to test hypotheses.

However, the estimates of effects in the unbalanced design are no longer orthogonal (and thus statistically independent) across factors and their interactions. This means that the sum of squares associated with one factor depends on the sum of squares for another or its interaction.

Analysts accustomed to using multiple regression have no problem with this situation because they assume that their independent variables in a model are correlated. Experimentalists, however, often have difficulty speaking of a main effect conditioned on another. Consequently, there is extensive literature on hypothesis testing methodology for unbalanced designs (for example, Speed and Hocking, 1976, and Speed, Hocking, and Hackney, 1978), and there is no consensus on how to test hypotheses with non-orthogonal designs.

Some statisticians advise you to do a series of hierarchical tests beginning with interactions. If the highest-order interactions are insignificant, drop them from the model and recompute the analysis. Then, examine the lower-order interactions. If they are insignificant, recompute the model with main effects only. Some computer programs automate this process and print sums of squares and F tests according to the hierarchy (ordering of effects) you specify in the model. SAS and SPSS GLM, for example, calls these Type I sums of squares.

B1 B2

A112

534

A2

67984

2153

2 2×

Page 412: Statistics I

I-392

Chapter 13

This procedure is analogous to stepwise regression in which hierarchical subsets of models are tested. This example assumes you have specified the following model:

The hierarchical approach tests the following models:

The problem with this approach, however, is that plausible subsets of effects are ignored if you examine only one hierarchy. The following model, which may be the best fit to the data, is never considered:

Furthermore, if you decide to examine all the other plausible subsets, you are really doing all possible subsets regression, and you should use Bonferroni confidence levels before rejecting a null hypothesis. The example above has 127 possible subset models (excluding ones without a CONSTANT). Interactive stepwise regression allows you to explore subset models under your control.

If you have done an experiment and have decided that higher-order effects (interactions) are of enough theoretical importance to include in your model, you should condition every test on all other effects in the model you selected. This is the classical approach of Fisher and Yates. It amounts to using the default F values on the ANOVA output, which are the same as the SAS and SPSS Type III sums of squares.

Probably the most important reason to stay with one model is that if you eliminate a series of effects that are not quite significant (for example, p = 0.06), you could end up with an incorrect subset model because of the dependencies among the sums of squares. In summary, if you want other sums of squares, compute them. You can supply the mean square error to customize sums of squares by using a hypothesis test in GLM, selecting MSE, and specifying the mean square error and degrees of freedom.

Y = CONSTANT + a + b + c + a∗ b + a∗ c + b∗ c + a∗ b∗ c

Y = CONSTANT + a + b + c + a∗ b + a∗ c + b∗ c + a∗ b∗ cY = CONSTANT + a + b + c + a∗ b + a∗ c + b∗ cY = CONSTANT + a + b + c + a∗ b + a∗ cY = CONSTANT + a + b + c + a∗ b Y = CONSTANT + a + b + c Y = CONSTANT + a + bY = CONSTANT + a

Y = CONSTANT + a + b + a∗ b

Page 413: Statistics I

I-393

Linear Models

Repeated Measures

In factorial ANOVA designs, each subject is measured once. For example, the assumption of independence would be violated if a subject is measured first as a control group member and later as a treatment group member. However, in a repeated measures design, the same variable is measured several times for each subject (case). A paired-comparison t test is the most simple form of a repeated measures design (for example, each subject has a before and after measure).

Usually, it is not necessary for you to understand how SYSTAT carries out calculations; however, repeated measures is an exception. It is helpful to understand the quantities SYSTAT derives from your data. First, remember how to calculate a paired-comparison t test by hand:

� For each subject, compute the difference between the two measures.

� Calculate the average of the differences.

� Calculate the standard deviation of the differences.

� Calculate the test statistic using this mean and standard deviation.

SYSTAT derives similar values from your repeated measures and uses them in analysis-of-variance computations to test changes across the repeated measures (within subjects) as well as differences between groups of subjects (between subjects.) Tests of the within-subjects values are called polynomial tests of order 1, 2,..., up to k, where k is one less than the number of repeated measures. The first polynomial is used to test linear changes (for example, do the repeated responses increase (or decrease) around a line with a significant slope?). The second polynomial tests if the responses fall along a quadratic curve, and so on.

For each case, SYSTAT uses orthogonal contrast coefficients to derive one number for each polynomial. For the coefficients of the linear polynomial, SYSTAT uses (–1, 0, 1) when there are three measures; (–3, –1, 1, 3) when there are four measures; and so on. When there are three repeated measures, SYSTAT multiplies the first by –1, the second by 0, and the third by 1, and sums these products (this sum is then multiplied by a constant to make the sum of squares of the coefficients equal to 1). Notice that when the responses are the same, the result of the polynomial contrast is 0; when the responses fall closely along a line with a steep slope, the polynomial differs markedly from 0.

For the coefficients of the quadratic polynomial, SYSTAT uses (1, –2, 1) when there are three measures; (1, –1, –1, 1) when there are four measures; and so on. The cubic and higher-order polynomials are computed in a similar way.

Page 414: Statistics I

I-394

Chapter 13

Let’s continue the discussion for a design with three repeated measures. Assume that you record body weight once a month for three months for rats grouped by diet. (Diet A includes a heavy concentration of alcohol and Diet B consists of normal lab chow.) For each rat, SYSTAT computes a linear component and a quadratic component. SYSTAT also sums the weights to derive a total response. These derived values are used to compute two analysis of variance tables:

� The total response is used to test between-group differences; that is, the total is used as the dependent variable in the usual factorial ANOVA computations. In the example, this test compares total weight for Diet A against that for Diet B. This is analogous to a two-sample t test using total weight as the dependent variable.

� The linear and quadratic components are used to test changes across the repeated measures (within subjects) and also to test the interaction of the within factor with the grouping factor. If the test for the linear component is significant, you can report a significant linear increase in weight over the three months. If the test for the quadratic component is also significant (but much less so than the linear component), you might report that growth is predominantly linear, but there is a significant curve in the upward trend.

� A significant interaction between Diet (the between-group factor) and the linear component across time might indicate that the slopes for Diet A and Diet B differ. This test may be the most important one for the experiment.

Assumptions in Repeated Measures

SYSTAT computes both univariate and multivariate statistics. Like all standard ANOVA procedures, the univariate repeated measures approach requires that the distributions within cells be normal. The univariate repeated measures approach also requires that the covariances between all possible pairs of repeated measures be equal. (Actually, the requirement is slightly less restrictive, but this difference is of little practical importance.) Of course, the usual ANOVA requirement that all variances within cells are equal still applies; thus, the covariance matrix of the measures should have a constant diagonal and equal elements off the diagonal. This assumption is called compound symmetry.

The multivariate analysis does not require compound symmetry. It requires that the covariance matrices within groups (there is only one group in this example) be equivalent and that they be based on multivariate normal distributions. If the classical assumptions hold, then you should generally ignore the multivariate tests at the bottom

Page 415: Statistics I

I-395

Linear Models

of the output and stay with the classical univariate ANOVA table because the multivariate tests will be generally less powerful.

There is a middle approach. The Greenhouse-Geisser and Huynh-Feldt statistics are used to adjust the probability for the classical univariate tests when compound symmetry fails. (Huynh-Feldt is a more recent adjustment to the conservative Greenhouse-Geiser statistic.) If the Huynh-Feldt p values are substantially different from those under the column directly to the right of the F statistic, then you should be aware that compound symmetry has failed. In this case, compare the adjusted p values under Huynh-Feldt to those for the multivariate tests.

If all else fails, single degree-of-freedom polynomial tests can always be trusted. If there are several to examine, however, remember that you may want to use Bonferroni adjustments to the probabilities; that is, divide the normal value (for example, 0.05) by the number of polynomial tests you want to examine. You need to make a Bonferroni adjustment only if you are unable to use the summary univariate or multivariate tests to protect the overall level; otherwise, you can examine the polynomials without penalty if the overall test is significant.

Issues in Repeated Measures Analysis

Repeated measures designs can be generated in SYSTAT with a single procedure. You need not worry about weighting cases in unbalanced designs or selecting error terms. The program does this automatically; however, you should keep the following in mind:

� The sums of squares for the univariate F tests are pooled across subjects within groups and their interactions with trials. This means that the traditional analysis method has highly restrictive assumptions. You must assume that the variances within cells are homogeneous and that the covariances across all pairs of cells are equivalent (compound symmetry). There are some mathematical exceptions to this requirement, but they rarely occur in practice. Furthermore, the compound symmetry assumption rarely holds for real data.

� Compound symmetry is not required for the validity of the single degree-of-freedom polynomial contrasts. These polynomials partition sums of squares into orthogonal components. You should routinely examine the magnitude of these sums of squares relative to the hypothesis sum of squares for the corresponding univariate repeated measures F test when your trials are ordered on a scale.

� Think of the repeated measures output as an expanded traditional ANOVA table. The effects are printed in the same order as they appear in Winer (1971) and other texts, but they include the single degree-of-freedom and multivariate tests to

Page 416: Statistics I

I-396

Chapter 13

protect you from false conclusions. If you are satisfied that both are in agreement, you can delete the additional lines in the output file.

� You can test any hypothesis after you have estimated a repeated measures design and examined the output. For example, you can use polynomial contrasts to test single degree-of-freedom components in an unevenly spaced design. You can also use difference contrasts to do post hoc tests on adjacent trials.

Types of Sums of Squares

Some other statistics packages print several types of sums of squares for testing hypotheses. The following names for these sums of squares are not statistical terms, but they were popularized originally by SAS GLM.

Type I. Type I sums of squares are computed from the difference between the residual sums of squares of two different models. The particular models needed for the computation depend on the order of the variables in the MODEL statement. For example, if the model is

then the sum of squares for A∗ B is produced from the difference between SSE (sum of squared error) in the two following models:

Similarly, the Type I sums of squares for B in this model are computed from the difference in SSE between the following models:

Finally, the Type I sums of squares for A is computed from the difference in residual sums of squares for the following:

In summary, to compute sums of squares, move from right to left and construct models which differ by the right-most term only.

MODEL y = CONSTANT + a + b + a*b

MODEL y = CONSTANT + a + bMODEL y = CONSTANT + a + b + a*b

MODEL y = CONSTANT + aMODEL y = CONSTANT + a + b

MODEL y = CONSTANTMODEL y = CONSTANT + a

Page 417: Statistics I

I-397

Linear Models

Type II. Type II sums of squares are computed similarly to Type I except that main effects and interactions determine the ordering of differences instead of the MODEL statement order. For the above model, Type II sums of squares for the interaction are computed from the difference in residual sums of squares for the following models:

For the B effect, difference the following models:

For the A effect, difference the following (this is not the same as for Type I):

In summary, include interactions of the same order as well as all lower order interactions and main effects when differencing to get an interaction. When getting sums of squares for a main effect, difference against all other main effects only.

Type III. Type III sums of squares are the default for ANOVA and are much simpler to understand. Simply difference from the full model, leaving out only the term in question. For example, the Type III sum of squares for A is taken from the following two models:

Type IV. Type IV sums of squares are designed for missing cells designs and are not easily presented in the above terminology. They are produced by balancing over the means of nonmissing cells not included in the current hypothesis.

SYSTAT’s Sums of Squares

Printing more than one sum of squares in a table is potentially confusing to users. There is a strong temptation to choose the most significant sum of squares without understanding the hypothesis being tested.

A Type I test is produced by first estimating the full models and noting the error term. Then, each effect is entered sequentially and tested with the error term from the

MODEL y = CONSTANT + a + bMODEL y = CONSTANT + a + b + a*b

MODEL y = CONSTANT + a + bMODEL y = CONSTANT + a

MODEL y = CONSTANT + a + bMODEL y = CONSTANT + b

MODEL y = CONSTANT + b + a*bMODEL y = CONSTANT + a + b + a*b

Page 418: Statistics I

I-398

Chapter 13

full model. Later, effects are conditioned on earlier effects, but earlier effects are not conditioned on later effects. A Type II test is produced most easily with interactive stepping (STEP). Type III is printed in the regression and ANOVA table. Finally, Type IV is produced by the careful use of SPECIFY in testing means models. The advantage of this approach is that the user is always aware that sums of squares depend on explicit mathematical models rather than additions and subtractions of dimensionless quantities.

Page 419: Statistics I

I-399

Chapte r

14Linear Models I: Linear Regression

Leland Wilkinson and Mark Coward

The model for simple linear regression is:

where y is the dependent variable, x is the independent variable, and the β’s are the regression parameters (the intercept and the slope of the line of best fit). The model for multiple linear regression is:

Both Regression and General Linear Model can estimate and test simple and multiple linear regression models. Regression is easier to use than General Linear Model when you are doing simple regression, multiple regression, or stepwise regression because it has fewer options. To include interaction terms in your model or for mixture models, use General Linear Model. With Regression, all independent variables must be continuous; in General Linear Model, you can identify categorical independent variables and SYSTAT will generate a set of design variables for each. Both General Linear Model and Regression allow you to save residuals. In addition, you can test a variety of hypotheses concerning the regression coefficients using General Linear Model.

The ability to do stepwise regression is available in three ways: use the default values, specify your own selection criteria, or at each step, interactively select a variable to add or remove from the model.

For each model you fit in REGRESS, SYSTAT reports , adjusted , the standard error of the estimate, and an ANOVA table for assessing the fit of the model.

y x= + +β β ε0 1

y x x xp p= + + + + +β β β β ε0 1 1 2 2 . ..

R2 R2

Page 420: Statistics I

I-400

Chapter 14

For each variable in the model, the output includes the estimate of the regression coefficient, the standard error of the coefficient, the standardized coefficient, tolerance, and a t statistic for measuring the usefulness of the variable in the model.

Linear Regression in SYSTAT

Regression Main Dialog Box

To obtain a regression analysis, from the menus choose:

StatisticsRegression

Linear…

The following options can be specified:

Include constant. Includes the constant in the regression equation. Deselect this option to remove the constant. You almost never want to remove the constant, and you should be familiar with no-constant regression terminology before considering it.

Cases. If your data are in the form of a correlation matrix, enter the number of cases used to compute the correlation matrix.

Save. You can save residuals and other data to a new data file. The following alternatives are available:

� Residuals. Saves predicted values, residuals, Studentized residuals, leverage for each observation, Cook’s distance measure, and the standard error of predicted values.

� Residuals/Data. Saves the residual statistics given by Residuals plus all the variables in the working data file, including any transformed data values.

Page 421: Statistics I

I-401

Linear Models I : Linear Regression

� Partial. Saves partial residuals. Suppose your model is:

Y=CONSTANT + X1 + X2 + X3

The saved file contains:

� Partial/Data. Saves partial residuals plus all the variables in the working data file, including any transformed data values.

� Model. Saves statistics given in Residuals and the variables used in the model.

� Coefficients. Saves the estimates of the regression coefficients.

Regression Options

To open the Options dialog box, click Options in the Regression dialog box.

You can specify a tolerance level, select complete or stepwise entry, and specify entry and removal criteria.

YPARTIAL(1): Residual of Y = CONSTANT + X2 + X3XPARTIAL(1): Residual of X1 = CONSTANT + X2 + X3YPARTIAL(2): Residual of Y = CONSTANT + X1 + X3XPARTIAL(2): Residual of X2 = CONSTANT + X1 + X3YPARTIAL(3): Residual of Y = CONSTANT + X1 + X2XPARTIAL(3): Residual of X3 = CONSTANT + X1 + X2

Page 422: Statistics I

I-402

Chapter 14

Tolerance. Prevents the entry of a variable that is highly correlated with the independent variables already included in the model. Enter a value between 0 and 1. Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required to exclude a variable.

Estimation. Controls the method used to enter and remove variables from the equation.

� Complete. All independent variables are entered in a single step.

� Mixture model. Constrains the independent variables to sum to a constant.

� Stepwise. Variables are entered or removed from the model one at a time.

The following alternatives are available for stepwise entry and removal:

� Backward. Begins with all candidate variables in the model. At each step, SYSTAT removes the variable with the largest Remove value.

� Forward. Begins with no variables in the model. At each step, SYSTAT adds the variable with the smallest Enter value.

� Automatic. For Backward, at each step SYSTAT automatically removes a variable from your model. For Forward, SYSTAT automatically adds a variable to the model at each step.

� Interactive. At each step in the model building, you select the variable to enter or remove from the model.

You can also control the criteria used to enter and remove variables from the model:

� Enter. Enters a variable into the model if its alpha value is less than the specified value. Enter a value between 0 and 1.

� Remove. Removes a variable from the model if its alpha value is greater than the specified value. Enter a value between 0 and 1.

� Force. Force the first n variables listed in your model to remain in the equation.

� FEnter. F-to-enter limit. Variables with F greater than the specified value are entered into the model if Tolerance permits.

� FRemove. F-to-remove limit. Variables with F less than the specified value are removed from the model.

� Max step. Maximum number of steps.

Page 423: Statistics I

I-403

Linear Models I : Linear Regression

Using Commands

First, specify your data with USE filename. Continue with:

For hypothesis testing commands, see Chapter 16.

Usage Considerations

Types of data. Input can be the usual cases-by-variables data file or a covariance, correlation, or sum of squares and cross-products matrix. Using matrix input requires specification of the sample size which generated the matrix.

Print options. Using PRINT = MEDIUM, the output includes eigenvalues of X’X, condition indices, and variance proportions. PRINT = LONG adds the correlation matrix of the regression coefficients to this output.

Quick Graphs. SYSTAT plots the residuals against the predicted values.

Saving files. You can save the results of the analysis (predicted values, residuals, and diagnostics that identify unusual cases) for further use in examining assumptions.

BY groups. Separate regressions result for each level of any BY variables.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. REGRESS uses the FREQ variable to duplicate cases. This inflates the degrees of freedom to be the sum of the number of frequencies.

Case weights. REGRESS weights cases using the WEIGHT variable for rectangular data. You can perform cross-validation if the weight variable is binary and coded 0 or 1. SYSTAT computes predicted values for cases with zero weight even though they are not used to estimate the regression parameters.

REGRESS MODEL var=CONSTANT + var1 + var2 + … / N=n SAVE filename / COEF MODEL RESID DATA PARTIAL ESTIMATE / TOL=n(use START instead of ESTIMATE for stepwise model building) START / FORWARD BACKWARD TOL=n ENTER=p REMOVE=p , FENTER=n FREMOVE=n FORCE=n STEP / AUTO ENTER=p REMOVE=p FENTER=n FREMOVE=n STOP

Page 424: Statistics I

I-404

Chapter 14

Examples

Example 1 Simple Linear Regression

In this example, we explore the relation between gross domestic product per capita (GDP_CAP) and spending on the military (MIL) for 57 countries that report this information to the United Nations—we want to determine whether a measure of the financial well being of a country is useful for predicting its military expenditures. Our model is:

Initially. we plot the dependent variable against the independent variable. Such a plot may reveal outlying cases or suggest a transformation before applying linear regression. The input is:

The scatterplot follows:

USE ourworldPLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500 , YLABEL=’Military Spending’, SYMBOL=4 SIZE= 1.500 LABEL=NAME$ , CSIZE=2.000

ε+β+β= cap_gdpmil 10

Page 425: Statistics I

I-405

Linear Models I : Linear Regression

To obtain the scatterplot, we created a new variable, NAME$, that had missing values for all countries except Libya and Iraq. We then used the new variable to label plot points.

Iraq and Libya stand apart from the other countries—they spend considerably more for the military than countries with similar GDP_CAP values. The smoother indicates that the relationship between the two variables is fairly linear. Distressing, however, is the fact that many points clump in the lower left corner. Many data analysts would want to study the data after log-transforming both variables. We do this in another example, but now we estimate the coefficients for the data as recorded.

To fit a simple linear regression model to the data, the input is:

The output is:

REGRESS USE ourworld MODEL mil = CONSTANT + gdp_cap ESTIMATE

1 case(s) deleted due to missing data.

Dep Var: MIL N: 56 Multiple R: 0.646 Squared multiple R: 0.417 Adjusted squared multiple R: 0.407 Standard error of estimate: 136.154

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 41.857 24.838 0.0 . 1.685 0.098GDP_CAP 0.019 0.003 0.646 1.000 6.220 0.000

Effect Coefficient Lower < 95%> Upper CONSTANT 41.857 -7.940 91.654 GDP_CAP 0.019 0.013 0.025 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 717100.891 1 717100.891 38.683 0.000Residual 1001045.288 54 18537.876-------------------------------------------------------------------------------

*** WARNING ***Case 22 is an outlier (Studentized Residual = 6.956)Case 30 is an outlier (Studentized Residual = 4.348)

Durbin-Watson D Statistic 2.046First Order Autocorrelation -0.032

Page 426: Statistics I

I-406

Chapter 14

SYSTAT reports that data are missing for one case. In the next line, it reports that 56 cases are used (N = 56). In the regression calculations, SYSTAT uses only the cases that have complete data for the variables in the model. However, when only the dependent variable is missing, SYSTAT computes a predicted value, its standard error, and a leverage diagnostic for the case. In this sample, Afghanistan did not report military spending.

When there is only one independent variable, Multiple R (0.646) is the simple correlation between MIL and GDP_CAP. Squared multiple R (0.417) is the square of this value, and it is the proportion of the total variation in the military expenditures accounted for by GDP_CAP (GDP_CAP explains 41.7% of the variability of MIL). Use Sum-of-Squares in the analysis of variance table to compute it:

717100.891 / (717100.891 + 1001045.288)

Adjusted squared multiple R is of interest for models with more than one independent variable. Standard error of estimate (136.154) is the square root of the residual mean square (18537.876) in the ANOVA table.

The estimates of the regression coefficients are 41.857 and 0.019, so the equation is:

mil = 41.857 + 0.019 * gdp_cap

The standard errors (Std Error) of the estimated coefficients are in the next column and the standardized coefficients (Std Coef) follow. The latter are called beta weights by some social scientists. Tolerance is not relevant when there is only one predictor.

Page 427: Statistics I

I-407

Linear Models I : Linear Regression

Next are t statistics (t)—the first (1.685) tests the significance of the difference of the constant from 0 and the second (6.220) tests the significance of the slope, which is equivalent to testing the significance of the correlation between military spending and GDP_CAP.

F-ratio in the analysis of variance table is used to test the hypothesis that the slope is 0 (or, for multiple regression, that all slopes are 0). The F is large when the independent variable(s) helps to explain the variation in the dependent variable. Here, there is a significant linear relation between military spending and GDP_CAP. Thus, we reject the hypothesis that the slope of the regression line is zero (F-ratio = 38.683, p value (P) < 0.0005).

It appears from the results above that GDP_CAP is useful for predicting spending on the military—that is, countries that are financially sound tend to spend more on the military than poorer nations. These numbers, however, do not provide the complete picture. Notice that SYSTAT warns us that two countries (Iraq and Libya) with unusual values could be distorting the results. We recommend that you consider transforming the data and that you save the residuals and other diagnostic statistics.

Example 2 Transformations

The data in the scatterplot in the simple linear regression example are not well suited for linear regression, as the heavy concentration of points in the lower left corner of the graph shows. Here are the same data plotted in log units:

REGRESSUSE ourworldPLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500, XLABEL=’GDP per capita’, XLOG=10 YLABEL=’Military Spending’ YLOG=10, SYMBOL=4,2,3, SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.450

Page 428: Statistics I

I-408

Chapter 14

The scatterplot is:

Except possibly for Iraq and Libya, the configuration of these points is better for linear modeling than that for the untransformed data.

We now transform both the y and x variables and refit the model. The input is:

The output follows:

REGRESSUSE ourworldLET log_mil = L10(mil)LET log_gdp = L10(gdp_cap)MODEL log_mil = CONSTANT + log_gdpESTIMATE

1 case(s) deleted due to missing data. Dep Var: LOG_MIL N: 56 Multiple R: 0.857 Squared multiple R: 0.734 Adjusted squared multiple R: 0.729 Standard error of estimate: 0.346 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -1.308 0.257 0.0 . -5.091 0.000LOG_GDP 0.909 0.075 0.857 1.000 12.201 0.000

Effect Coefficient Lower < 95%> Upper CONSTANT -1.308 -1.822 -0.793 LOG_GDP 0.909 0.760 1.058

Page 429: Statistics I

I-409

Linear Models I : Linear Regression

The Squared multiple R for the variables in log units is 0.734 (versus 0.417 for the untransformed values). That is, we have gone from explaining 41.7% of the variability of military spending to 73.4% by using the log transformations. The F-ratio is now 148.876—it was 38.683. Notice that we now have only one outlier (Iraq).

The Calculator

But what is the estimated model now?

log_mil = –1.308 + 0.909 ∗ log_gdp

However, many people don’t think in “log units.” Let’s transform this equation (exponentiate each side of the equation):

Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 17.868 1 17.868 148.876 0.000Residual 6.481 54 0.120*** WARNING ***Case 22 is an outlier (Studentized Residual = 4.004) Durbin-Watson D Statistic 1.810First Order Autocorrelation 0.070

Page 430: Statistics I

I-410

Chapter 14

We used the calculator to compute 0.049. Type:

CALC 10^-1.308

and SYSTAT returns 0.049.

Example 3 Residuals and Diagnostics for Simple Linear Regression

In this example, we continue with the transformations example and save the residuals and diagnostics along with the data. Using the saved statistics, we create stem-and-leaf plots of the residuals and Studentized residuals. In addition, let’s plot the Studentized residuals (to identify outliers in the y space) against leverage (to identify outliers in the x space) and use Cook’s distance measure to scale the size of each plot symbol. In a second plot, we display the corresponding country names. The input is:

REGRESSUSE ourworldLET log_mil = L10(mil)LET log_gdp = L10(gdp_cap)MODEL log_mil = CONSTANT + log_gdpSAVE myresult / DATA RESIDESTIMATEUSE myresultSTATSSTEM residual studentPLOT STUDENT*LEVERAGE / SYMBOL=4,2,3 SIZE=cookPLOT student*leverage / LABEL=country$ SYMBOL=4,2,3

909.0

)gdp*log(909.0308.1

)gdp*log(909.0308.1

)cap_gdp(*049.0mil

10*10mil

10mil

)gdplog_*909.0308.1(^10millog_^10

==

=+−=

+−

Page 431: Statistics I

I-411

Linear Models I : Linear Regression

The output is:

In the stem-and-leaf plots, Iraq’s residual is 1.216 and is identified as an Outside Value. The value of its Studentized residual is 4.004, which is very extreme for the t distribution.

The case with the most influence on the estimates of the regression coefficients stands out at the top left (that is, it has the largest plot symbol). From the second plot, we identify this country as Iraq. Its value of Cook’s distance measure is large because its Studentized residual is extreme. On the other hand, Ethiopia (furthest to the right),

Stem and Leaf Plot of variable: Stem and Leaf Plot of variable: RESIDUAL, N = 56 STUDENT, N = 56 Minimum: -0.644 Minimum: -1.923 Lower hinge: -0.246 Lower hinge: -0.719 Median: -0.031 Median: -0.091 Upper hinge: 0.203 Upper hinge: 0.591 Maximum: 1.216 Maximum: 4.004 -6 42 -1 986 -5 6 -1 32000 -4 42 -0 H 88877766555 -3 554000 -0 M 443322111000 -2 H 65531 0 M 000022344 -1 9876433 0 H 555889999 -0 M 98433200 1 0223 0 222379 1 5 1 1558 2 3 2 H 009 * * * Outside Values * * * 3 0113369 4 0 4 27 1 cases with missing values excluded from plot. 5 1 6 7 7 * * * Outside Values * * * 12 11 cases with missing values excluded from plot.

Page 432: Statistics I

I-412

Chapter 14

the case with the next most influence, has a large value of Cook’s distance because its value of leverage is large. Gambia has the third largest Cook value, and Libya, the fourth.

Deleting an Outlier

Residual plots identify Iraq as the case with the greatest influence on the estimated coefficients. Let’s remove this case from the analysis and check SYSTAT’s warnings. The input is:

The output follows:

Now there are no warnings about outliers.

REGRESSUSE ourworldLET log_mil = L10(mil)LET log_gdp = L10(gdp_cap)SELECT mil < 700MODEL log_mil = CONSTANT + log_gdpESTIMATESELECT

Dep Var: LOG_MIL N: 55 Multiple R: 0.886 Squared multiple R: 0.785 Adjusted squared multiple R: 0.781 Standard error of estimate: 0.306 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -1.353 0.227 0.0 . -5.949 0.000LOG_GDP 0.916 0.066 0.886 1.000 13.896 0.000 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 18.129 1 18.129 193.107 0.000Residual 4.976 53 0.094------------------------------------------------------------------------------- Durbin-Watson D Statistic 1.763First Order Autocorrelation 0.086

Page 433: Statistics I

I-413

Linear Models I : Linear Regression

Printing Residuals and Diagnostics

Let’s look at some of the values in the MYRESULT file. We use the country name as the ID variable for the listing. The input is:

The output is:

The value of MIL for Afghanistan is missing, so Cook’s distance measure and Studentized residuals are not available (periods are inserted for these values in the listing).

Example 4 Multiple Linear Regression

In this example, we build a multiple regression model to predict total employment using values of six independent variables. The data were originally used by Longley (1967) to test the robustness of least-squares packages to multicollinearity and other sources of ill-conditioning. SYSTAT can print the estimates of the regression coefficients with more “correct” digits than the solution provided by Longley himself if you adjust the number of decimal places. By default, the first three digits after the decimal are displayed. After the output is displayed, you can use General Linear Model to test hypotheses involving linear combinations of regression coefficients.

USE myresultIDVAR = country$FORMAT 10 3LIST cook leverage student mil gdp_cap

* Case ID * COOK LEVERAGE STUDENT MIL GDP_CAP Ireland 0.013 0.032 -0.891 95.833 8970.885 Austria 0.023 0.043 -1.011 127.237 13500.299 Belgium 0.000 0.044 -0.001 283.939 13724.502 Denmark 0.000 0.045 -0.119 269.608 14363.064

(etc.) Libya 0.056 0.022 2.348 640.513 4738.055 Somalia 0.009 0.072 0.473 8.846 201.798 Afghanistan . 0.075 . . 189.128

(etc.)

Page 434: Statistics I

I-414

Chapter 14

The input is:

The output follows:

REGRESS USE longley PRINT = LONG MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time ESTIMATE

Eigenvalues of unit scaled X’X 1 2 3 4 5 6.861 0.082 0.046 0.011 0.000 6 7 0.000 0.000

Condition indices 1 2 3 4 5 1.000 9.142 12.256 25.337 230.424 6 7 1048.080 43275.046

Variance proportions 1 2 3 4 5 CONSTANT 0.000 0.000 0.000 0.000 0.000 DEFLATOR 0.000 0.000 0.000 0.000 0.457 GNP 0.000 0.000 0.000 0.001 0.016 UNEMPLOY 0.000 0.014 0.001 0.065 0.006 ARMFORCE 0.000 0.092 0.064 0.427 0.115 POPULATN 0.000 0.000 0.000 0.000 0.010 TIME 0.000 0.000 0.000 0.000 0.000 6 7 CONSTANT 0.000 1.000 DEFLATOR 0.505 0.038 GNP 0.328 0.655 UNEMPLOY 0.225 0.689 ARMFORCE 0.000 0.302 POPULATN 0.831 0.160 TIME 0.000 1.000

Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995 Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863GNP -0.036 0.033 -1.014 0.001 -1.070 0.313UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826TIME 1829.151 455.478 2.480 0.001 4.016 0.003

Page 435: Statistics I

I-415

Linear Models I : Linear Regression

SYSTAT computes the eigenvalues by scaling the columns of the X matrix so that the diagonal elements of X’X are 1’s and then factoring the X’X matrix. In this example, most of the eigenvalues of X’X are nearly 0, showing that the predictor variables comprise a relatively redundant set.

Condition indices are the square roots of the ratios of the largest eigenvalue to each successive eigenvalue. A condition index greater than 15 indicates a possible problem, and an index greater than 30 suggests a serious problem with collinearity (Belsley, Kuh, and Welsh, 1980). The condition indices in the Longley example show a tremendous collinearity problem.

Variance proportions are the proportions of the variance of the estimates accounted for by each principal component associated with each of the above eigenvalues. You should begin to worry about collinearity when a component associated with a high condition index contributes substantially to the variance of two or more variables. This is certainly the case with the last component of the Longley data. TIME, GNP, and UNEMPLOY load highly on this component. See Belsley, Kuh, and Welsch (1980) for more information about these diagnostics.

Effect Coefficient Lower < 95%> Upper CONSTANT -3482258.635 -5496529.488 -1467987.781 DEFLATOR 15.062 -177.029 207.153 GNP -0.036 -0.112 0.040 UNEMPLOY -2.020 -3.125 -0.915 ARMFORCE -1.033 -1.518 -0.549 POPULATN -0.051 -0.563 0.460 TIME 1829.151 798.788 2859.515 Correlation matrix of regression coefficients CONSTANT DEFLATOR GNP UNEMPLOY ARMFORCE CONSTANT 1.000 DEFLATOR -0.205 1.000 GNP 0.816 -0.649 1.000 UNEMPLOY 0.836 -0.555 0.946 1.000 ARMFORCE 0.550 -0.349 0.469 0.619 1.000 POPULATN -0.411 0.659 -0.833 -0.758 -0.189 TIME -1.000 0.186 -0.802 -0.824 -0.549 POPULATN TIME POPULATN 1.000 TIME 0.388 1.000

Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 1.84172E+08 6 3.06954E+07 330.285 0.000Residual 836424.056 9 92936.006------------------------------------------------------------------------------- Durbin-Watson D Statistic 2.559First Order Autocorrelation -0.348

Page 436: Statistics I

I-416

Chapter 14

Adjusted squared multiple R is 0.992. The formula for this statistic is:

where n is the number of cases and p is the number of predictors, including the constant.

Notice the extremely small tolerances in the output. Tolerance is 1 minus the multiple correlation between a predictor and the remaining predictors in the model. These tolerances signal that the predictor variables are highly intercorrelated—a worrisome situation. This multicollinearity can inflate the standard errors of the coefficients, thereby attenuating the associated F statistics, and can threaten computational accuracy.

Finally, SYSTAT produces the Correlation matrix of regression coefficients. In the Longley data, these estimates are highly correlated, further indicating that there are too many correlated predictors in the equation to provide stable estimates.

Scatterplot Matrix

Examining a scatterplot matrix of the variables in the model is often a beneficial first step in any multiple regression analysis. Nonlinear relationships and correlated predictors, both of which cause problems for multiple linear regression, can be uncovered before fitting the model. The input is:

USE longleySPLOM DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL / HALF DENSITY=HIST

adj sq multiple R Rp

n pR. .

( )

( )* ( )= − −

−−2 21

1

Page 437: Statistics I

I-417

Linear Models I : Linear Regression

The plot follows:

Notice the severely nonlinear distributions of ARMFORCE with the other variables, as well as the near perfect correlations among several of the predictors. There is also a sharp discontinuity between post-war and 1950’s behavior on ARMFORCE.

Example 5 Automatic Stepwise Regression

Following is an example of forward automatic stepping using the LONGLEY data. The input is:

REGRESS USE longley MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time START / FORWARD STEP / AUTO STOP

DE

FLA

TO

RG

NP

UN

EM

PLO

YA

RM

FO

RC

EP

OP

ULA

TN

TIM

E

DEFLATOR

TO

TA

L

GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL

Page 438: Statistics I

I-418

Chapter 14

The output is:

Step # 0 R = 0.000 R-Square = 0.000 Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant Out Part. Corr.___ 2 DEFLATOR 0.971 . . 1.00000 1 230.089 0.000 3 GNP 0.984 . . 1.00000 1 415.103 0.000 4 UNEMPLOY 0.502 . . 1.00000 1 4.729 0.047 5 ARMFORCE 0.457 . . 1.00000 1 3.702 0.075 6 POPULATN 0.960 . . 1.00000 1 166.296 0.000 7 TIME 0.971 . . 1.00000 1 233.704 0.000-------------------------------------------------------------------------------

Step # 1 R = 0.984 R-Square = 0.967Term entered: GNP Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP 0.035 0.002 0.984 1.00000 1 415.103 0.000 Out Part. Corr.___ 2 DEFLATOR -0.187 . . 0.01675 1 0.473 0.504 4 UNEMPLOY -0.638 . . 0.63487 1 8.925 0.010 5 ARMFORCE 0.113 . . 0.80069 1 0.167 0.689 6 POPULATN -0.598 . . 0.01774 1 7.254 0.018 7 TIME -0.432 . . 0.00943 1 2.979 0.108------------------------------------------------------------------------------- Step # 2 R = 0.990 R-Square = 0.981Term entered: UNEMPLOY Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP 0.038 0.002 1.071 0.63487 1 489.314 0.000 4 UNEMPLOY -0.544 0.182 -0.145 0.63487 1 8.925 0.010 Out Part. Corr.___ 2 DEFLATOR -0.073 . . 0.01603 1 0.064 0.805 5 ARMFORCE -0.479 . . 0.48571 1 3.580 0.083 6 POPULATN -0.164 . . 0.00563 1 0.334 0.574 7 TIME 0.308 . . 0.00239 1 1.259 0.284-------------------------------------------------------------------------------

Page 439: Statistics I

I-419

Linear Models I : Linear Regression

Step # 3 R = 0.993 R-Square = 0.985Term entered: ARMFORCE Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP 0.041 0.002 1.154 0.31838 1 341.684 0.000 4 UNEMPLOY -0.797 0.213 -0.212 0.38512 1 13.942 0.003 5 ARMFORCE -0.483 0.255 -0.096 0.48571 1 3.580 0.083 Out Part. Corr.___ 2 DEFLATOR 0.163 . . 0.01318 1 0.299 0.596 6 POPULATN -0.376 . . 0.00509 1 1.813 0.205 7 TIME 0.830 . . 0.00157 1 24.314 0.000-------------------------------------------------------------------------------

Step # 4 R = 0.998 R-Square = 0.995Term entered: TIME Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033 4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000 5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000 7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000 Out Part. Corr.___ 2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658 6 POPULATN -0.150 . . 0.00443 1 0.230 0.642-------------------------------------------------------------------------------

Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995 Adjusted squared multiple R: 0.994 Standard error of estimate: 279.396 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -3598729.374 740632.644 0.0 . -4.859 0.001GNP -0.040 0.016 -1.137 0.002 -2.440 0.033UNEMPLOY -2.088 0.290 -0.556 0.071 -7.202 0.000ARMFORCE -1.015 0.184 -0.201 0.318 -5.522 0.000TIME 1887.410 382.766 2.559 0.002 4.931 0.000 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 1.84150E+08 4 4.60375E+07 589.757 0.000Residual 858680.406 11 78061.855-------------------------------------------------------------------------------

Page 440: Statistics I

I-420

Chapter 14

The steps proceed as follows:

� At step 0, no variables are in the model. GNP has the largest simple correlation and F, so SYSTAT enters it at step 1. Note at this step that the partial correlation, Part. Corr., is the simple correlation of each predictor with TOTAL.

� With GNP in the equation, UNEMPLOY is now the best candidate.

� The F for ARMFORCE is 3.58 when GNP and UNEMPLOY are included in the model.

� SYSTAT finishes by entering TIME.

In four steps, SYSTAT entered four predictors. None was removed, resulting in a final equation with a constant and four predictors. For this final model, SYSTAT uses all cases with complete data for GNP, UNEMPLOY, ARMFORCE, and TIME. Thus, when some values in the sample are missing, the sample size may be larger here than for the last step in the stepwise process (there, cases are omitted if any value is missing among the six candidate variables). If you don’t want to stop here, you could move more variables in (or out) using interactive stepping.

Example 6 Interactive Stepwise Regression

Interactive stepping helps you to explore model building in more detail. With data that are as highly intercorrelated as the LONGLEY data, interactive stepping reveals the dangers of thinking that the automated result is the only acceptable subset model. In this example, we use interactive stepping to explore the LONGLEY data further. That is, after specifying a model that includes all of the candidate variables available, we request backward stepping by selecting Stepwise, Backward, and Interactive in the Regression Options dialog box. After reviewing the results at each step, we use Step to move a variable in (or out) of the model. When finished, we select Stop for the final model. To begin interactive stepping, the input is:

REGRESS USE longley MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time START / BACK

Page 441: Statistics I

I-421

Linear Models I : Linear Regression

The output is:

We begin with all variables in the model. We remove DEFLATOR because it has an unusually low tolerance and F value.

Type:

The output is:

Step # 0 R = 0.998 R-Square = 0.995 Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 2 DEFLATOR 15.062 84.915 0.046 0.00738 1 0.031 0.863 3 GNP -0.036 0.033 -1.014 0.00056 1 1.144 0.313 4 UNEMPLOY -2.020 0.488 -0.538 0.02975 1 17.110 0.003 5 ARMFORCE -1.033 0.214 -0.205 0.27863 1 23.252 0.001 6 POPULATN -0.051 0.226 -0.101 0.00251 1 0.051 0.826 7 TIME 1829.151 455.478 2.480 0.00132 1 16.127 0.003 Out Part. Corr.___ none-------------------------------------------------------------------------------

STEP deflator

Dependent Variable TOTAL Minimum tolerance for entry into model = 0.000000Backward stepwise with Alpha-to-Enter=0.150 and Alpha-to-Remove=0.150 Step # 1 R = 0.998 R-Square = 0.995Term removed: DEFLATOR Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP -0.032 0.024 -0.905 0.00097 1 1.744 0.216 4 UNEMPLOY -1.972 0.386 -0.525 0.04299 1 26.090 0.000 5 ARMFORCE -1.020 0.191 -0.202 0.31723 1 28.564 0.000 6 POPULATN -0.078 0.162 -0.154 0.00443 1 0.230 0.642 7 TIME 1814.101 425.283 2.459 0.00136 1 18.196 0.002 Out Part. Corr.___ 2 DEFLATOR 0.059 . . 0.00738 1 0.031 0.863-------------------------------------------------------------------------------

Page 442: Statistics I

I-422

Chapter 14

POPULATN has the lowest F statistic and, again, a low tolerance.

Type:

The output is:

GNP and TIME both have low tolerance values. They could be highly correlated with one another, so we will take each out and examine the behavior of the other when we do.

Type:

The output is:

STEP populatn

Step # 2 R = 0.998 R-Square = 0.995Term removed: POPULATN Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033 4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000 5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000 7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000 Out Part. Corr.___ 2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658 6 POPULATN -0.150 . . 0.00443 1 0.230 0.642-------------------------------------------------------------------------------

STEP timeSTEP timeSTEP gnp

Step # 3 R = 0.993 R-Square = 0.985Term removed: TIME Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP 0.041 0.002 1.154 0.31838 1 341.684 0.000 4 UNEMPLOY -0.797 0.213 -0.212 0.38512 1 13.942 0.003 5 ARMFORCE -0.483 0.255 -0.096 0.48571 1 3.580 0.083 Out Part. Corr.___ 2 DEFLATOR 0.163 . . 0.01318 1 0.299 0.596 6 POPULATN -0.376 . . 0.00509 1 1.813 0.205 7 TIME 0.830 . . 0.00157 1 24.314 0.000-------------------------------------------------------------------------------

Page 443: Statistics I

I-423

Linear Models I : Linear Regression

We are comfortable with the tolerance values in both models with three variables. With TIME in the model, the smallest F is 17.671, and with GNP in the model, the smallest F is 3.580. Furthermore, with TIME, the squared multiple correlation is 0.993, and with GNP, it is 0.985. Let’s stop the stepping and view more information about the last model.

Type:

The output is:

Step # 4 R = 0.998 R-Square = 0.995Term entered: TIME Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 3 GNP -0.040 0.016 -1.137 0.00194 1 5.953 0.033 4 UNEMPLOY -2.088 0.290 -0.556 0.07088 1 51.870 0.000 5 ARMFORCE -1.015 0.184 -0.201 0.31831 1 30.496 0.000 7 TIME 1887.410 382.766 2.559 0.00157 1 24.314 0.000 Out Part. Corr.___ 2 DEFLATOR 0.143 . . 0.01305 1 0.208 0.658 6 POPULATN -0.150 . . 0.00443 1 0.230 0.642-------------------------------------------------------------------------------

Step # 5 R = 0.996 R-Square = 0.993Term removed: GNP Effect Coefficient Std Error Std Coef Tol. df F ’P’ In___ 1 Constant 4 UNEMPLOY -1.470 0.167 -0.391 0.30139 1 77.320 0.000 5 ARMFORCE -0.772 0.184 -0.153 0.44978 1 17.671 0.001 7 TIME 956.380 35.525 1.297 0.25701 1 724.765 0.000 Out Part. Corr.___ 2 DEFLATOR -0.031 . . 0.01385 1 0.011 0.920 3 GNP -0.593 . . 0.00194 1 5.953 0.033 6 POPULATN -0.505 . . 0.00889 1 3.768 0.078-------------------------------------------------------------------------------

STOP

Dep Var: TOTAL N: 16 Multiple R: 0.996 Squared multiple R: 0.993 Adjusted squared multiple R: 0.991 Standard error of estimate: 332.084 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -1797221.112 68641.553 0.0 . -26.183 0.000UNEMPLOY -1.470 0.167 -0.391 0.301 -8.793 0.000ARMFORCE -0.772 0.184 -0.153 0.450 -4.204 0.001TIME 956.380 35.525 1.297 0.257 26.921 0.000

Page 444: Statistics I

I-424

Chapter 14

Our final model includes only UNEMPLOY, ARMFORCE, and TIME. Notice that its multiple correlation (0.996) is not significantly smaller than that for the automated stepping (0.998). Following are the commands we used:

Example 7 Testing whether a Single Coefficient Equals Zero

Most regression programs print tests of significance for each coefficient in an equation. SYSTAT has a powerful additional feature—post hoc tests of regression coefficients. To demonstrate these tests, we use the LONGLEY data and examine whether the DEFLATOR coefficient differs significantly from 0. The input is:

Effect Coefficient Lower < 95%> Upper CONSTANT -1797221.112 -1946778.208 -1647664.016 UNEMPLOY -1.470 -1.834 -1.106 ARMFORCE -0.772 -1.173 -0.372 TIME 956.380 878.978 1033.782 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 1.83685E+08 3 6.12285E+07 555.209 0.000Residual 1323360.743 12 110280.062-------------------------------------------------------------------------------

REGRESS USE longley MODEL total=constant + deflator + gnp + unemploy +, armforce + populatn + time START / BACK STEP deflator STEP populatn STEP time STEP time STEP gnp STOP

REGRESS USE longley MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time ESTIMATE / TOL=.00001 HYPOTHESIS EFFECT = deflator TEST

Page 445: Statistics I

I-425

Linear Models I : Linear Regression

The output is:

Notice that the error sum of squares (836424.056) is the same as the output residual sum of squares at the bottom of the ANOVA table. The probability level (0.863) is the same also. This probability level (> 0.05) indicates that the regression coefficient for DEFLATOR does not differ from 0.

You can test all of the coefficients in the equation this way, individually, or choose All to generate separate hypothesis tests for each predictor or type:

Dep Var: TOTAL N: 16 Multiple R: 0.998 Squared multiple R: 0.995 Adjusted squared multiple R: 0.992 Standard error of estimate: 304.854 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT -3482258.635 890420.384 0.0 . -3.911 0.004DEFLATOR 15.062 84.915 0.046 0.007 0.177 0.863GNP -0.036 0.033 -1.014 0.001 -1.070 0.313UNEMPLOY -2.020 0.488 -0.538 0.030 -4.136 0.003ARMFORCE -1.033 0.214 -0.205 0.279 -4.822 0.001POPULATN -0.051 0.226 -0.101 0.003 -0.226 0.826TIME 1829.151 455.478 2.480 0.001 4.016 0.003 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 1.84172E+08 6 3.06954E+07 330.285 0.000Residual 836424.056 9 92936.006-------------------------------------------------------------------------------

Test for effect called: DEFLATOR Test of Hypothesis Source SS df MS F P Hypothesis 2923.976 1 2923.976 0.031 0.863 Error 836424.056 9 92936.006 -------------------------------------------------------------------------------

HYPOTHESIS ALL TEST

Page 446: Statistics I

I-426

Chapter 14

Example 8 Testing whether Multiple Coefficients Equal Zero

You may wonder why you need to bother with testing when the regression output gives you hypothesis test results. Try the following hypothesis test:

The hypothesis output is:

Here, the error sum of squares is the same as that for the model, but the hypothesis sum of squares is different. We just tested the hypothesis that the DEFLATOR and GNP coefficients simultaneously are 0.

The A matrix printed above the test specifies the hypothesis that we tested. It has two degrees of freedom (see the F statistic) because the A matrix has two rows—one for each coefficient. If you know some matrix algebra, you can see that the matrix product AB using this A matrix and B as a column matrix of regression coefficients picks up only two coefficients: DEFLATOR and GNP. Notice that our hypothesis had the following matrix equation: AB = 0, where 0 is a null matrix.

If you don’t know matrix algebra, don’t worry; the ampersand method is equivalent. You can ignore the A matrix in the output.

REGRESS USE longley MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time ESTIMATE / TOL=.00001 HYPOTHESIS EFFECT = deflator & gnp TEST

Test for effect called: DEFLATOR and GNP A Matrix 1 2 3 4 5 1 0.0 1.000 0.0 0.0 0.0 2 0.0 0.0 1.000 0.0 0.0 6 7 1 0.0 0.0 2 0.0 0.0 Test of Hypothesis Source SS df MS F P Hypothesis 149295.592 2 74647.796 0.803 0.478 Error 836424.056 9 92936.006 -------------------------------------------------------------------------------

Page 447: Statistics I

I-427

Linear Models I : Linear Regression

Two Coefficients with an A Matrix

If you are experienced with matrix algebra, however, you can specify your own matrix by using AMATRIX. When typing the matrix, be sure to separate cells with spaces and press Enter between rows. The following simultaneously tests that DEFLATOR = 0 and GNP = 0:

You get the same output as above.Why bother with AMATRIX when the you can use EFFECT? Because in the A matrix,

you can use any numbers, not just 0’s and 1’s. Here is a bizarre matrix:

You may not want to test this kind of hypothesis on the LONGLEY data, but there are important applications in the analysis of variance where you might.

Example 9 Testing Nonzero Null Hypotheses

You can test nonzero null hypotheses with a D matrix, often in combination using CONTRAST or AMATRIX. Here, we test whether the DEFLATOR coefficient significantly differs from 30:

HYPOTHESISAMATRIX [0 1 0 0 0 0 0; 0 0 1 0 0 0 0]TEST

1.0 3.0 0.5 64.3 3.0 2.0 0.0

REGRESS USE longley MODEL total = CONSTANT + deflator + gnp + unemploy +, armforce + populatn + time ESTIMATE / TOL=.00001 HYPOTHESIS AMATRIX [0 1 0 0 0 0 0] DMATRIX [30] TEST

Page 448: Statistics I

I-428

Chapter 14

The output is:

The commands that test whether DEFLATOR differs from 30 can be performed more efficiently using SPECIFY:

Example 10 Regression with Ecological or Grouped Data

If you have aggregated data, weight the regression by a count variable. This variable should represent the counts of observations (n) contributing to the ith case. If n is not an integer, SYSTAT truncates it to an integer before using it as a weight. The regression results are identical to those produced if you had typed in each case.

We use, for this example, an ecological or grouped data file, PLANTS. The input is:

The output is:

Hypothesis. A Matrix 1 2 3 4 5 0.0 1.000 0.0 0.0 0.0 6 7 0.0 0.0Null hypothesis value for D 30.000Test of Hypothesis Source SS df MS F P Hypothesis 2876.128 1 2876.128 0.031 0.864 Error 836424.056 9 92936.006

HYPOTHESIS SPECIFY deflator=30 TEST

REGRESS USE plants FREQ=count MODEL co2 = CONSTANT + species ESTIMATE

Dep Var: CO2 N: 76 Multiple R: 0.757 Squared multiple R: 0.573 Adjusted squared multiple R: 0.567 Standard error of estimate: 0.729 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 13.738 0.204 0.0 . 67.273 0.000SPECIES -0.466 0.047 -0.757 1.000 -9.961 0.000

Page 449: Statistics I

I-429

Linear Models I : Linear Regression

Example 11 Regression without the Constant

To regress without the constant (intercept) term, or through the origin, remove the constant from the list of independent variables. REGRESS adjusts accordingly. The input is:

Some users are puzzled when they see a model without a constant having a higher multiple correlation than a model that includes a constant. How can a regression with fewer parameters predict “better” than another? It doesn’t. The total sum of squares must be redefined for a regression model with zero intercept. It is no longer centered about the mean of the dependent variable. Other definitions of sums of squares can lead to strange results, such as negative multiple correlations. If your constant is actually near 0, then including or excluding the constant makes little difference in the output. Kvålseth (1985) discusses the issues involved in summary statistics for zero-intercept regression models. The definition of used in SYSTAT is Kvålseth’s formula 7. This was chosen because it retains its PRE (percentage reduction of error) interpretation and is guaranteed to be in the (0,1) interval.

How, then, do you test the significance of a constant in a regression model? Include a constant in the model as usual and look at its test of significance.

If you have a zero-intercept model where it is appropriate to compute a coefficient of determination and other summary statistics about the centered data, use General Linear Model and select Mixture model. This option provides Kvålseth’s formula 1 for

and uses centered total sum of squares for other summary statistics.

Effect Coefficient Lower < 95%> Upper CONSTANT 13.738 13.331 14.144 SPECIES -0.466 -0.559 -0.372 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 52.660 1 52.660 99.223 0.000Residual 39.274 74 0.531-------------------------------------------------------------------------------

REGRESS MODEL dependent = var1 + var2 ESTIMATE

R2

R2

Page 450: Statistics I

I-430

Chapter 14

References

Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley & Sons, Inc.

Flack, V. F. and Chang, P. C. (1987). Frequency of selecting noise variables in subset regression analysis: A simulation study. The American Statistician, 41, 84–86.

Freedman, D. A. (1983). A note on screening regression equations. The American Statistician, 37, 152–155.

Hocking, R. R. (1983). Developments in linear regression methodology: 1959–82. Technometrics, 25, 219–230.

Lovell, M. C. (1983). Data Mining. The Review of Economics and Statistics, 65, 1–12.Rencher, A. C. and Pun, F. C. (1980). Inflation of R-squared in best subset regression.

Technometrics, 22, 49–54.Velleman, P. F. and Welsch, R. E. (1981). Efficient computing of regression diagnostics.

The American Statistician, 35, 234–242.Weisberg, S. (1985). Applied linear regression. New York: John Wiley & Sons, Inc.Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,

86, 168–174.Wilkinson, L. and Dallal, G. E. (1982). Tests of significance in forward selection

regression with an F-to-enter stopping rule. Technometrics, 24, 25–28.

Page 451: Statistics I

I-431

Chapte r

15Linear Models II: Analysis of Variance

Leland Wilkinson and Mark Coward

SYSTAT handles a wide variety of balanced and unbalanced analysis of variance designs. The Analysis of Variance (ANOVA) procedure includes all interactions in the model and tests them automatically; it also provides analysis of covariance, and repeated measures designs. After you have estimated your ANOVA model, it is easy to test post hoc pairwise differences in means or to test any contrast across cell means, including simple effects.

For models with fixed and random effects, you can define error terms for specific hypotheses. You can also do stepwise ANOVA (that is, Type I sums of squares). Categorical variables are entered or deleted in blocks, and you can examine interactively or automatically all combinations of interactions and main effects.

The General Linear Model (GLM) procedure is used for randomized block designs, incomplete block designs, fractional factorials, Latin square designs, and analysis of covariance with one or more covariates. GLM also includes repeated measures, split plot, and crossover designs. It includes both univariate and multivariate approaches to repeated measures designs.

Moreover, GLM also features the means model for missing cells designs. Widely favored for this purpose by statisticians (Searle, 1987; Hocking, 1985; Milliken and Johnson, 1984), the means model allows tests of hypotheses in missing cells designs (using what are often called Type IV sums of squares). Furthermore, the means model allows direct tests of simple hypotheses (for example, within levels of other factors). Finally, the means model allows easier use of population weights to reflect differences in subclass sizes.

For both ANOVA and GLM, group sizes can be unequal for combinations of grouping factors; but for repeated measures designs, each subject must have complete data. You can use numeric or character values to code grouping variables.

You can store results of the analysis (predicted values and residuals) for further study and graphical display. In ANCOVA, you can save adjusted cell means.

Page 452: Statistics I

I-432

Chapter 15

Analysis of Variance in SYSTAT

ANOVA: Estimate Model

To obtain an analysis of variance, from the menus choose:

StatisticsAnalysis of Variance (ANOVA)

Estimate Model…

Dependent. The variable(s) you want to examine. The dependent variable(s) should be continuous and numeric (for example, INCOME). For MANOVA (multivariate analysis of variance), select two or more dependent variables.

Factor. One or more categorical variables (grouping variables) that split your cases into two or more groups.

Missing values. Includes a separate category for cases with a missing value for the variable(s) identified with Factor.

Covariates. A covariate is a quantitative independent variable that adds unwanted variability to the dependent variable. An analysis of covariance (ANCOVA) adjusts or removes the variability in the dependent variable due to the covariate (for example, variability in cholesterol level might be removed by using AGE as a covariate).

Page 453: Statistics I

I-433

Linear Models I I : Analysis of Variance

Post hoc Tests. Post hoc tests determine which pairs of means differ significantly. The following alternatives are available:

� Bonferroni. Multiple comparison test based on Student’s t statistic. Adjusts the observed significance level for the fact that multiple comparisons are made.

� Tukey. Uses the Studentized range statistic to make all pairwise comparisons between groups and sets the experimentwise error rate to the error rate for the collection for all pairwise comparisons. When testing a large number of pairs of means, Tukey is more powerful than Bonferroni. For a small number of pairs, Bonferroni is more powerful.

� LSD. Least significant difference pairwise multiple comparison test. Equivalent to multiple t tests between all pairs of groups. The disadvantage of this test is that no attempt is made to adjust the observed significance level for multiple comparisons.

� Scheffé. The significance level of Scheffé’s test is designed to allow all possible linear combinations of group means to be tested, not just pairwise comparisons available in this feature. The result is that Scheffé’s test is more conservative than other tests, meaning that a larger difference between means is required for significance.

Save file. You can save residuals and other data to a new data file. The following alternatives are available:

� Residuals. Saves predicted values, residuals, Studentized residuals, leverages, Cook’s D, and the standard error of predicted values. Only the predicted values and residuals are appropriate for ANOVA.

� Residuals/Data. Saves the statistics given by Residuals plus all of the variables in the working data file, including any transformed data values.

� Adjusted. Saves adjusted cell means from analysis of covariance.

� Adjusted/Data. Saves adjusted cell means plus all of the variables in the working data file, including any transformed data values.

� Model. Saves statistics given in Residuals and the variables used in the model.

� Coefficients. Saves estimates of the regression coefficients.

Page 454: Statistics I

I-434

Chapter 15

ANOVA: Hypothesis Test

Contrasts are used to test relationships among cell means. The Post hoc Tests on the ANOVA dialog box are the most simple form because they compare two means at a time. Use Specify or Contrast to define contrasts involving two or more means—for example, contrast the average responses for two treatment groups against that for a control group; or test if average income increases linearly across cells ordered by education (dropouts, high school graduates, college graduates). The coefficients for the means of the first contrast might be (1,1,–2) for a contrast of 1* Treatment A plus 1* Treatment B minus 2 * Control. The coefficients for the second contrast would be (–1,0,1).

To define contrasts among the cell means, from the menus choose:

StatisticsAnalysis of Variance (ANOVA)

Hypothesis Test…

An ANOVA model must be estimated before any hypothesis tests can be performed.

Contrasts can be defined across the categories of a grouping factor or across the levels of a repeated measure.

� Effects. Specify the factor (that is, grouping variable) to which the contrast applies. Selecting All yields a separate test of the effect of each factor in the ANOVA model, as well as tests of all interactions between those factors.

� Within. Use when specifying a contrast across the levels of a repeated measures factor. Enter the name assigned to the set of repeated measures.

Page 455: Statistics I

I-435

Linear Models I I : Analysis of Variance

Specify

To specify hypothesis test coefficients, click Specify in the ANOVA Hypothesis Test dialog box.

To specify coefficients for a hypothesis test, use cell identifiers. Common hypothesis tests include contrasts across marginal means or tests of simple effects. For a two-way factorial ANOVA design with DISEASE (four categories) and DRUG (three categories), you could contrast the marginal mean for the first level of drug against the third level by specifying:

Note that square brackets enclose the value of the category (for example, for GENDER$, specify GENDER$[male]). For the simple contrast of the first and third levels of DRUG for the second disease only:

The syntax also allows statements like:

You have two error term options for hypothesis tests:

� Pooled. Uses the error term from the current model.

� Separate. Generates a separate variances error term.

DRUG[1] = DRUG[3]

DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]

-3*DRUG[1] - 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]

Page 456: Statistics I

I-436

Chapter 15

Contrast

To specify contrasts, click Contrast in the ANOVA Hypothesis Test dialog box.

Contrast generates a contrast for a grouping factor or a repeated measures factor. SYSTAT offers six types of contrasts.

� Custom. Enter your own custom coefficients. If your factor has, say, four ordered categories (or levels), you can specify your own coefficients, such as –3 –1 1 3, by typing these values in the Custom text box.

� Difference. Compare each level with its adjacent level.

� Polynomial. Generate orthogonal polynomial contrasts (to test linear, quadratic, cubic trends across ordered categories or levels).

� Order. Enter 1 for linear, 2 for quadratic, and so on.

� Metric. Use Metric when the ordered categories are not evenly spaced. For example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as the metric.

� Sum. In a repeated measures ANOVA, total the values for each subject.

Repeated Measures

In a repeated measures design, the same variable is measured several times for each subject (case). A paired-comparison t test is the most simple form of a repeated measures design (for example, each subject has a before and after measure).

Page 457: Statistics I

I-437

Linear Models I I : Analysis of Variance

SYSTAT derives values from your repeated measures and uses them in analysis of variance computations to test changes across the repeated measures (within subjects) as well as differences between groups of subjects (between subjects). Tests of the within-subjects values are called polynomial test of order 1, 2, ..., up to k, where k is one less than the number of repeated measures. The first polynomial is used to test linear changes: do the repeated responses increase (or decrease) around a line with a significant slope? The second polynomial tests whether the responses fall along a quadratic curve, and so on.

To obtain a repeated measures analysis of variance, from the menus choose:

StatisticsAnalysis of Variance (ANOVA)

Estimate Model…

and click Repeated.

The following options are available:

Perform repeated measures analysis. Treats the dependent variables as a set of repeated measures.

Optionally, you can assign a name for each set of repeated measures, specify the number of levels, and specify the metric for unevenly spaced repeated measures.

� Name. Name that identifies each set of repeated measures.

� Levels. Number of repeated measures in the set. For example, if you have three dependent variables that represent measurements at different times, the number of levels is 3.

Page 458: Statistics I

I-438

Chapter 15

� Metric. Metric that indicates the spacing between unevenly spaced measurements. For example, if measurements were taken at the third, fifth, and ninth weeks, the metric would be 3, 5, 9.

Using Commands

To use ANOVA for analysis of covariance, insert COVARIATE before ESTIMATE.

After estimating a model, use HYPOTHESIS to test its parameters. Begin each test with HYPOTHESIS and end with TEST.

Usage Considerations

Types of data. ANOVA requires a rectangular data file.

Print options. If PRINT=SHORT, output includes an ANOVA table. The MEDIUM length adds least-squares means to the output. LONG adds estimates of the coefficients.

Quick Graphs. ANOVA plots the group means against the groups.

Saving files. ANOVA can save predicted values, residuals, Studentized residuals, leverages, Cook’s D, standard error of predicted values, adjusted cell means, and estimates of the coefficients.

BY groups. ANOVA performs separate analyses for each level of any BY variables.

ANOVA USE filename CATEGORY / MISS DEPEND / REPEAT NAMES BONF or TUKEY or LSD or SCHEFFE SAVE filename / ADJUST, MODEL, RESID, DATA ESTIMATE

HYPOTHESIS EFFECT or WITHIN ERROR POST / LSD or TUKEY or BONF or SCHEFFE POOLED or SEPARATE or CONTRAST / DIFFERENCE or POLYNOMIAL or SUM or ORDER or METRIC or SPECIFY / POOLED or SEPARATE AMATRIX CMATRIX TEST

Page 459: Statistics I

I-439

Linear Models I I : Analysis of Variance

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. You can use a FREQUENCY variable to duplicate cases.

Case weights. ANOVA uses a WEIGHT variable, if present, to duplicate cases.

Examples

Example 1 One-Way ANOVA

How does equipment influence typing performance? This example uses a one-way design to compare average typing speed for three groups of typists. Fourteen beginning typists were randomly assigned to three types of machines and given speed tests. Following are their typing speeds in words per minute:

The data are stored in the SYSTAT data file named TYPING. The average speeds for the typists in the three groups are 50.4, 46.5, and 69.8 words per minute, respectively. To test the hypothesis that the three samples have the same population average speed, the input is:

The output follows:

Electric Plain old Word processor

52 52 6747 43 7351 47 7049 44 7553 64

USE typingANOVA CATEGORY equipmnt$ DEPEND speed ESTIMATE

Dep Var: SPEED N: 14 Multiple R: 0.95 Squared multiple R: 0.91

Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio PEQUIPMNT$ 1469.36 2 734.68 53.52 0.00Error 151.00 11 13.73

Page 460: Statistics I

I-440

Chapter 15

For the dependent variable SPEED, SYSTAT reads 14 cases. The multiple correlation (Multiple R) for SPEED with the two design variables for EQUIPMNT$ is 0.952. The square of this correlation (Squared multiple R) is 0.907. The grouping structure explains 90.7% of the variability of SPEED.

The layout of the ANOVA table is standard in elementary texts; you will find formulas and definitions there. F-ratio is the Mean-Square for EQUIPMNT$ divided by the Mean-Square for Error. The distribution of the F ratio is sensitive to the assumption of equal population group variances. The p value is the probability of exceeding the F ratio when the group means are equal. The p value printed here is 0.000, so it is less than 0.0005. If the population means are equal, it would be very unusual to find sample means that differ as much as these—you could expect such a large F ratio fewer than five times out of 10,000.

The Quick Graph illustrates this finding. Although the typists using electric and plain old typewriters have similar average speeds (50.4 and 46.5, respectively), the word processor group has a much higher average speed.

Pairwise Mean Comparisons

An analysis of variance indicates whether (at least) one of the groups differs from the others. However, you cannot determine which group(s) differ based on ANOVA results. To examine specific group differences, use post hoc tests.

Page 461: Statistics I

I-441

Linear Models I I : Analysis of Variance

In this example, we use the Bonferroni method for the typing speed data used in the one-way ANOVA example. As an aid in interpretation, we order the equipment categories from least to most advanced. The input is:

SYSTAT assigns a number to each of the three groups and uses those numbers in the output panels that follow:

In the first column, you can read differences in average typing speed for the group using plain old typewriters. In the second row, you see that they average 3.9 words per minute fewer than those using electric typewriters; but in the third row, you see that they average 23.3 minutes fewer than the group using word processors. To see whether these differences are significant, look at the probabilities in the corresponding locations at the bottom of the table.

The probability associated with 3.9 is 0.43, so you are unable to detect a difference in performance between the electric and plain old groups. The probabilities in the third row are both 0.00, indicating that the word processor group averages significantly more words per minute than the electric and plain old groups.

USE typingORDER equipmnt$ / SORT=’plain old’ ’electric’, ’word process’ANOVA CATEGORY equipmnt$ DEPEND speed / BONF ESTIMATE

COL/ROW EQUIPMNT$ 1 plain old 2 electric 3 word processUsing least squares means.Post Hoc test of SPEED------------------------------------------------------------------------ Using model MSE of 13.727 with 11 df.Matrix of pairwise mean differences: 1 2 3 1 0.0 2 3.90 0.0 3 23.30 19.40 0.0 Bonferroni Adjustment.Matrix of pairwise comparison probabilities: 1 2 3 1 1.00 2 0.43 1.00 3 0.00 0.00 1.00

Page 462: Statistics I

I-442

Chapter 15

Example 2 ANOVA Assumptions and Contrasts

An important assumption in analysis of variance is that the population variances are equal—that is, that the groups have approximately the same spread. When variances differ markedly, a transformation may remedy the problem. For example, sometimes it helps to take the square root of each value of the outcome variable (or log transform each value) and use the transformed value in the analysis.

In this example, we use a subset of the cases from the SURVEY2 data file to address the question, “For males, does average income vary by education?” We focus on those who:

� Did not graduate from high school (HS dropout)

� Graduated from high school (HS grad)

� Attended some college (Some college)

� Graduated from college (College grad)

� Have an M.A. or Ph.D. (Degree +)

For each male subject (case) in the SURVEY2 data file, use the variables INCOME and EDUC$. The means, standard deviations, and sample sizes for the five groups are shown below:

Visually, as you move across the groups, you see that average income increases. But considering the variability within each group, you might wonder if the differences are significant. Also, there is a relationship between the means and standard deviations—as the means increase, so do the standard deviations. They should be independent. If you take the square root of each income value, there is less variability among the standard deviations, and the relation between the means and standard deviations is weaker:

HS dropout HS grad Some college College grad Degree +

mean $13,389 $21,231 $29,294 $30,937 $38,214sd 10,639 13,176 16,465 16,894 18,230

n 18 39 17 16 14

HS dropout HS grad Some college College grad Degree +

mean 3.371 4.423 5.190 5.305 6.007sd 1.465 1.310 1.583 1.725 1.516

Page 463: Statistics I

I-443

Linear Models I I : Analysis of Variance

A bar chart for the data will show the effect of the transformation. The input is:

The charts follow:

In the chart on the left, you can see a relation between the height of the bars (means) and the length of the error bars (standard errors). The smaller means have shorter error bars than the larger means. After transformation, there is less difference in length among the error bars. The transformation aids in eliminating the dependency between the group and the standard deviation.

To test for differences among the means:

USE survey2 SELECT sex$ = ’Male’ LABEL educatn / 1,2=’HS dropout’, 3=’HS grad’ 4=’Some college’, 5=’College grad’ 6,7=’Degree +’ CATEGORY educatn BEGIN BAR income * educatn / SERROR FILL=.5 LOC=-3IN,0IN BAR income * educatn / SERROR FILL=.35 YPOW=.5, LOC=3IN,0IN END

ANOVA LET sqrt_inc = SQR(income) DEPEND sqrt_inc ESTIMATE

HS dropout

HS grad

Some college

College grad

Degree +

EDUCATN

0

10

20

30

40

50

60

70

INC

OM

E

HS dropout

HS grad

Some college

College grad

Degree +

EDUCATN

10

20

30

40

50

6070

INC

OM

E

Page 464: Statistics I

I-444

Chapter 15

The output is:

The ANOVA table using the transformed income as the dependent variable suggests a significant difference among the four means (p < 0.0005).

Tukey Pairwise Mean Comparisons

Which means differ? This example uses the Tukey method to identify significant differences in pairs of means. Hopefully, you reach the same conclusions using either the Tukey or Bonferroni methods. However, when the number of comparisons is very large, the Tukey procedure may be more sensitive in detecting differences; when the number of comparisons is small, Bonferroni may be more sensitive. The input is:

The output follows:

Dep Var: SQRT_INC N: 104 Multiple R: 0.49 Squared multiple R: 0.24 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P EDUCATN 68.62 4 17.16 7.85 0.00 Error 216.26 99 2.18

ANOVA LET sqrt_inc = SQR(income) DEPEND sqrt_inc / TUKEY ESTIMATE

COL/ROW EDUCATN 1 HS dropout 2 HS grad 3 Some college 4 College grad 5 Degree +Using least squares means.Post Hoc test of SQRT_INC------------------------------------------------------------------------------- Using model MSE of 2.184 with 99 df.Matrix of pairwise mean differences: 1 2 3 4 5 1 0.0 2 1.052 0.0 3 1.819 0.767 0.0 4 1.935 0.883 0.116 0.0 5 2.636 1.584 0.817 0.701 0.0Tukey HSD Multiple Comparisons.Matrix of pairwise comparison probabilities: 1 2 3 4 5 1 1.000 2 0.100 1.000 3 0.004 0.387 1.000 4 0.002 0.268 0.999 1.000 5 0.000 0.007 0.545 0.694 1.000

Page 465: Statistics I

I-445

Linear Models I I : Analysis of Variance

The layout of the output panels for the Tukey method is the same as that for the Bonferroni method. Look first at the probabilities at the bottom of the table. Four of the probabilities indicate significant differences (they are less than 0.05). In the first column, row 3, the average income for high school dropouts differs from those with some college (p = 0.004), from college graduates (p = 0.002), and also from those with advanced degrees (p < 0.0005). The fifth row shows that the differences between those with advanced degrees and the high school graduates are significant (p = 0.008).

Contrasts

In this example, the five groups are ordered by their level of education, so you use these coefficients to test linear and quadratic contrasts:

Then you ask, “Is there a linear increase in average income across the five ordered levels of education?” “A quadratic change?” The input follows:

Linear –2 –1 0 1 2Quadratic 2 –1 –2 –1 2

HYPOTHESIS NOTE ’Test of linear contrast’, ’across ordered group means’ EFFECT = educatn CONTRAST [–2 –1 0 1 2] TEST

HYPOTHESIS NOTE 'Test of quadratic contrast', 'across ordered group means' EFFECT = educatn CONTRAST [2 –1 –2 –1 2] TEST SELECT

Page 466: Statistics I

I-446

Chapter 15

The resulting output is:

The F statistic for testing the linear contrast is 29.089 (p value < 0.0005); for testing the quadratic contrast, it is 1.008 (p value = 0.32). Thus, you can report that there is a highly significant linear increase in average income across the five levels of education and that you have not found a quadratic component in this increase.

Test of linear contrast across ordered group means

Test for effect called: EDUCATN A Matrix 1 2 3 4 5 0.0 -4.00 -3.00 -2.00 -1.00Test of Hypothesis Source SS df MS F P Hypothesis 63.54 1 63.54 29.09 0.00 Error 216.26 99 2.18 ------------------------------------------------------------------------------- Test of quadratic contrast across ordered group means

Test for effect called: EDUCATN A Matrix 1 2 3 4 5 0.0 0.0 -3.00 -4.00 -3.00Test of Hypothesis Source SS df MS F P Hypothesis 2.20 1 2.20 1.01 0.32 Error 216.26 99 2.18

Page 467: Statistics I

I-447

Linear Models I I : Analysis of Variance

Example 3 Two-Way ANOVA

Consider the following two-way analysis of variance design from Afifi and Azen (1972), cited in Kutner (1974), and reprinted in BMDP manuals. The dependent variable, SYSINCR, is the change in systolic blood pressure after administering one of four different drugs to patients with one of three different diseases. Patients were assigned randomly to one of the possible drugs. The data are stored in the SYSTAT file AFIFI.

To obtain a least-squares two-way analysis of variance:

Because this is a factorial design, ANOVA automatically generates an interaction term (DRUG * DISEASE). The output follows:

USE afifi ANOVA CATEGORY drug disease DEPEND sysincr SAVE myresids / RESID DATA ESTIMATE

Dep Var: SYSINCR N: 58 Multiple R: 0.675 Squared multiple R: 0.456 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P DRUG 2997.472 3 999.157 9.046 0.000DISEASE 415.873 2 207.937 1.883 0.164DRUG*DISEASE 707.266 6 117.878 1.067 0.396 Error 5080.817 46 110.453

Page 468: Statistics I

I-448

Chapter 15

In two-way ANOVA, begin by examining the interaction. If the interaction is significant, you must condition your conclusions about a given factor’s effects on the level of the other factor. The DRUG * DISEASE interaction is not significant (p = 0.396), so shift your focus to the main effects.

The DRUG effect is significant (p < 0.0005), but the DISEASE effect is not (p = 0.164). Thus, at least one of the drugs differs from the others with respect to blood pressure change, but blood pressure change does not vary significantly across diseases.

For each factor, SYSTAT produces a plot of the average value of the dependent variable for each level of the factor. For the DRUG plot, drugs 1 and 2 yield similar average blood pressure changes. However, the average blood pressure change for

q

1

1 2 3 4DRUG$

-3

8

19

30

41

SY

SIN

CR

2

1 2 3 4DRUG$

-3

8

19

30

41

SY

SIN

CR

3

1 2 3 4DRUG$

-3

8

19

30

41

SY

SIN

CR

Least Squares Means

Page 469: Statistics I

I-449

Linear Models I I : Analysis of Variance

drugs 3 and 4 are much lower. ANOVA tests for significance the differences illustrated in this plot.

For the DISEASE plot, we see a gradual decrease in blood pressure change across the three diseases. However, this effect is not significant; there is not enough variation among these means to overcome the variation due to individual differences.

In addition the plot for each factor, SYSTAT also produces plots of the average blood pressure change at each level of DRUG for each level of disease. Use these plots to illustrate interaction effects. Although the interaction effect is not significant in this example, we can still examine these plots.

In general, we see a decline in blood pressure change across drugs. (Keep in mind that the drugs are only artificially ordered. We could reorder the drugs, and although the ANOVA results wouldn’t change, the plots would differ.) The similarity of the plots illustrates the nonsignificant interaction.

A close correspondence exists between the factor plots and the interaction plots. The means plotted in the factor plot for DISEASE correspond to the weighted average of the four points in each of the interaction plots. Similarly, each mean plotted in the DRUG factor plot corresponds to the weighted average of the three corresponding points across interaction plots. Consequently, the significant DRUG effect can be seen in the differing means in each interaction plot. Can you see the nonsignificant DISEASE effect in the interaction plots?

Least-Squares ANOVA

If you have an orthogonal design (equal number of cases in every cell), you will find that the ANOVA table is the same one you get with any standard program. SYSTAT can handle non-orthogonal designs, however (as in the present example). To understand the sources for sums of squares, you must know something about least-squares ANOVA.

As with one-way ANOVA, your specifying factor levels causes SYSTAT to create dummy variables out of the classifying input variable. SYSTAT creates one fewer dummy variables than categories specified.

Coding of the dummy variables is the classic analysis of variance parameterization, in which the sum of effects estimated for a classifying variable is 0 (Scheffé, 1959). In

Page 470: Statistics I

I-450

Chapter 15

our example, DRUG has four categories; therefore, SYSTAT creates three dummy variables with the following scores for subjects at each level:

Because DISEASE has three categories, SYSTAT creates two dummy variables to be coded as follows:

Now, because there are no continuous predictors in the model (unlike the analysis of covariance), you have a complete design matrix of dummy variables as follows (DRUG is labeled with an a, DISEASE with a b, and the grand mean with an m):

This example is used to explain how SYSTAT gets an error term for the ANOVA table. Because it is a least-squares program, the error term is taken from the residual sum of squares in the regression onto the above dummy variables. For non-orthogonal designs, this choice is identical to that produced by BMDP2V and SPSS GLM with Type III sums of squares. These, in general, will be the hypotheses you want to test on unbalanced

1 0 0 for DRUG = 1 subject0 1 0 for DRUG = 2 subjects0 0 1 for DRUG = 3 subjects

–1 –1 –1 for DRUG = 4 subjects

1 0 for DISEASE = 1 subject0 1 for DISEASE = 2 subjects

–1 –1 for DISEASE = 3 subjects

Treatment Mean DRUG DISEASE Interaction

A B m a1 a2 a3 b1 b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b21 1 1 1 0 0 1 0 1 0 0 0 0 01 2 1 1 0 0 0 1 0 1 0 0 0 01 3 1 1 0 0 –1 –1 –1 –1 0 0 0 02 1 1 0 1 0 1 0 0 0 1 0 0 02 2 1 0 1 0 0 1 0 0 0 1 0 02 3 1 0 1 0 –1 –1 0 0 –1 –1 0 03 1 1 0 0 1 1 0 0 0 0 0 1 03 2 1 0 0 1 0 1 0 0 0 0 0 13 3 1 0 0 1 –1 –1 0 0 0 0 –1 –14 1 1 –1 –1 –1 1 0 –1 0 –1 0 –1 04 2 1 –1 –1 –1 0 1 0 –1 0 –1 0 –14 3 1 –1 –1 –1 –1 –1 1 1 1 1 1 1

Page 471: Statistics I

I-451

Linear Models I I : Analysis of Variance

experimental data. You can construct other types of sums of squares by using an A matrix or by running your ANOVA model using the Stepwise options in GLM. Consult the references if you do not already know what these sums of squares mean.

Post Hoc Tests

It is evident that only the main effect for DRUG is significant; therefore, you might want to test some contrasts on the DRUG effects. A simple way would be to use the Bonferroni method to test all pairwise comparisons of marginal drug means. However, to compare three or more means, you must specify the particular contrast of interest. Here, we compare the first and third drugs, the first and fourth drugs, and the first two drugs with the last two drugs. The input is:

You need four numbers in each contrast because DRUG has four levels. You cannot use CONTRAST to specify coefficients for interaction terms. It creates an A matrix only for main effects. Following are the results of the above hypothesis tests:

HYPOTHESIS EFFECT = drug CONTRAST [1 0 –1 0] TEST

HYPOTHESIS EFFECT = drug CONTRAST [1 0 0 –1] TEST

HYPOTHESIS EFFECT = drug CONTRAST [1 1 -1 –1] TEST

Test for effect called: DRUG A Matrix 1 2 3 4 5 0.0 1.000 0.0 -1.000 0.0 6 7 8 9 10 0.0 0.0 0.0 0.0 0.0 11 12 0.0 0.0Test of Hypothesis Source SS df MS F P Hypothesis 1697.545 1 1697.545 15.369 0.000 Error 5080.817 46 110.453

-------------------------------------------------------------------------------

Page 472: Statistics I

I-452

Chapter 15

Notice the A matrices in the output. SYSTAT automatically takes into account the degree of freedom lost in the design coding. Also, notice that you do not need to normalize contrasts or rows of the A matrix to unit vector length, as in some ANOVA programs. If you use (2 0 –2 0) or (0.707 0 –0.707 0) instead of (1 0 –1 0), you get the same sum of squares.

For the comparison of the first and third drugs, the F statistic is 15.369 (p value < 0.0005), indicating that these two drugs differ. Looking at the Quick Graphs produced earlier, we see that the change in blood pressure was much smaller for the third drug.

Notice that in the A matrix created by the contrast of the first and fourth drugs, you get (2 1 1) in place of the three design variables corresponding to the appropriate columns of the A matrix. Because you selected the reduced form for coding of design variables in which sums of effects are 0, you have the following restriction for the DRUG effects:

Test for effect called: DRUG A Matrix 1 2 3 4 5 0.0 2.000 1.000 1.000 0.0 6 7 8 9 10 0.0 0.0 0.0 0.0 0.0 11 12 0.0 0.0Test of Hypothesis Source SS df MS F P Hypothesis 1178.892 1 1178.892 10.673 0.002 Error 5080.817 46 110.453

------------------------------------------------------------------------------- Test for effect called: DRUG A Matrix 1 2 3 4 5 0.0 2.000 2.000 0.0 0.0 6 7 8 9 10 0.0 0.0 0.0 0.0 0.0 11 12 0.0 0.0Test of Hypothesis Source SS df MS F P Hypothesis 2982.934 1 2982.934 27.006 0.000 Error 5080.817 46 110.453

α α α α1 2 3 4 0+ + + =

Page 473: Statistics I

I-453

Linear Models I I : Analysis of Variance

where each value is the effect for that level of DRUG. This means that

and the contrast DRUG(1) – DRUG(4) is equivalent to

which is

For the final contrast, SYSTAT transforms the (1 1 –1 –1) specification into contrast coefficients of (2 2 0) for the dummy coded variables. The p value (< 0.0005) indicates that the first two drugs differ from the last two drugs.

Simple Effects

You can do simple contrasts between drugs within levels of disease (although the lack of a significant DRUG * DISEASE interaction does not justify it). To show how it is done, consider a contrast between the first and third levels of DRUG for the first DISEASE only. You must specify the contrast in terms of the cell means. Use the terminology:

You want to contrast cell means M{1,1} and M{3,1}. These are composed of:

Therefore the difference between the two means is:

Now, if you consider the coding of the variables, you can construct an A matrix that picks up the appropriate columns of the design matrix. Here are the column labels of

MEAN (DRUG index, DISEASE index) = M{i,j}

α α α α4 1 2 3= − + +( )

α α α α1 1 2 3 0− − + + =[ ( )]

2 1 2 3α α α+ +

M

M

{ , }

{ , }

11

311 1 11

3 1 31

= + + += + + +

µ α β αβµ α β αβ

M M{ , } { , }11 31 1 3 11 31− = − + −α α αβ αβ

Page 474: Statistics I

I-454

Chapter 15

the design matrix (a means DRUG and b means DISEASE) to serve as a column ruler over the A matrix specified in the hypothesis.

The corresponding input is:

The output follows:

After you understand how SYSTAT codes design variables and how the model sentence orders them, you can take any standard ANOVA text like Winer (1971) or Scheffé (1959) and construct an A matrix for any linear contrast.

Contrasting Marginal and Cell Means

Now look at how to contrast cell means directly without being concerned about how they are coded. Test the first level of DRUG against the third (contrasting the marginal means) with the following input:

m a1 a2 a3 b1 b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b2

0 1 0 –1 0 0 1 0 0 0 –1 0

HYPOTHESIS AMATRIX [0 1 0 –1 0 0 1 0 0 0 –1 0] TEST

Hypothesis. A Matrix 1 2 3 4 5 0.0 1.000 0.0 -1.000 0.0 6 7 8 9 10 0.0 1.000 0.0 0.0 0.0 11 12 -1.000 0.0Test of Hypothesis Source SS df MS F P Hypothesis 338.000 1 338.000 3.060 0.087 Error 5080.817 46 110.453

HYPOTHESIS SPECIFY drug[1] = drug[3] TEST

Page 475: Statistics I

I-455

Linear Models I I : Analysis of Variance

To contrast the first against the fourth:

Finally, here is the simple contrast of the first and third levels of DRUG for the first DISEASE only:

Screening Results

Let’s examine the AFIFI data in more detail. To analyze the residuals to examine the ANOVA assumptions, first plot the residuals against estimated values (cell means) to check for homogeneity of variance. Use the Studentized residuals to reference them against a t distribution. In addition, stem-and-leaf plots of the residuals and boxplots of the dependent variable aid in identifying outliers. The input is:

HYPOTHESIS SPECIFY drug[1] = drug[4] TEST

HYPOTHESIS SPECIFY drug[1] disease[1] = drug[3] disease[1] TEST

USE afifi ANOVA CATEGORY drug disease DEPEND sysincr SAVE myresids / RESID DATA ESTIMATEDENSITY sysincr * drug / BOX USE myresidsPLOT student*estimate / SYM=1 FILL=1 STATISTICS STEM student

Page 476: Statistics I

I-456

Chapter 15

The plots suggest the presence of an outlier. The smallest value in the stem-and-leaf plot seems to be out of line. A t statistic value of –2.647 corresponds to p < 0.01, and you would not expect a value this small to show up in a sample of only 58 independent values. In the scatterplot, the point corresponding to this value appears at the bottom and badly skews the data in its cell (which happens to be DRUG1, DISEASE3). The outlier in the first group clearly stands out in the boxplot, too.

To see the effect of this outlier, delete the observation with the outlying Studentized residual. Then, run the analysis again. Following is the ANOVA output for the revised data:

The differences are not substantial. Nevertheless, notice that the DISEASE effect is substantially attenuated when only one case out of 58 is deleted. Daniel (1960) gives an example in which one outlying case alters the fundamental conclusions of a designed experiment. The F test is robust to certain violations of assumptions, but factorial ANOVA is not robust against outliers. You should routinely do these plots for ANOVA.

Stem and Leaf Plot of variable: STUDENT, N = 58 Minimum: -2.647 Lower hinge: -0.761 Median: 0.101 Upper hinge: 0.698 Maximum: 1.552 -2 6 -2 -1 987666 -1 410 -0 H 9877765 -0 4322220000 0 M 001222333444 0 H 55666888 1 011133444 1 55

Dep Var: SYSINCR N: 57 Multiple R: .710 Squared Multiple R: .503

Analysis of Variance

Source Sum-of-Squares DF Mean-Square F-Ratio P

DRUG 3344.064 3 1114.688 11.410 0.000DISEASE 232.826 2 116.413 1.192 0.313DRUG*DISEASE 676.865 6 112.811 1.155 0.347

Error 4396.367 45 97.697

Page 477: Statistics I

I-457

Linear Models I I : Analysis of Variance

Example 4 Single-Degree-of-Freedom Designs

The data in the REACT file involve yields of a chemical reaction under various combinations of four binary factors (A, B, C, and D). Two reactions were observed under each combination of experimental factors, so the number of cases per cell is two. To analyze these data in a four-way ANOVA:

You can see the advantage of ANOVA over GLM when you have several factors; you have to select only the main effects. With GLM, you have to specify the interactions and identify which variables are categorical (that is, A, B, C, and D). The following example is the full model using GLM:

The ANOVA output follows:

The output shows a significant main effect for the first factor (A) plus one significant interaction (A*C*D).

USE react ANOVA USE react CATEGORY a, b, c, d DEPEND yield ESTIMATE

MODEL yield = CONSTANT + a + b + c + d +, a*b + a*c + a*d + b*c + b*d + c*d +, a*b*c + a*b*d + a*c*d + b*c*d +, a*b*c*d

Dep Var: YIELD N: 32 Multiple R: 0.755 Squared multiple R: 0.570 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P A 369800.000 1 369800.000 4.651 0.047B 1458.000 1 1458.000 0.018 0.894C 5565.125 1 5565.125 0.070 0.795D 172578.125 1 172578.125 2.170 0.160A*B 87153.125 1 87153.125 1.096 0.311A*C 137288.000 1 137288.000 1.727 0.207A*D 328860.500 1 328860.500 4.136 0.059B*C 61952.000 1 61952.000 0.779 0.390B*D 3200.000 1 3200.000 0.040 0.844C*D 3160.125 1 3160.125 0.040 0.844A*B*C 81810.125 1 81810.125 1.029 0.326A*B*D 4753.125 1 4753.125 0.060 0.810A*C*D 415872.000 1 415872.000 5.230 0.036B*C*D 4.500 1 4.500 0.000 0.994A*B*C*D 15051.125 1 15051.125 0.189 0.669 Error 1272247.000 16 79515.437

Page 478: Statistics I

I-458

Chapter 15

Assessing Normality

Let’s look at the study more closely. Because this is a single-degree-of-freedom study (a 2n factorial), each effect estimate is normally distributed if the usual assumptions for the experiment are valid. All of the effects estimates, except the constant, have zero mean and common variance (because dummy variables were used in their computation). Thus, you can compare them to a normal distribution. SYSTAT remembers your last selections, so the input is:

This reestimates the model and saves the regression coefficients (effects). The file has one case with 16 variables (CONSTANT plus 15 effects). The effects are labeled X(1), X(2), and so on because they are related to the dummy variables, not the original variables A, B, C, and D. Let’s transpose this file into a new file containing only the 15 effects and create a probability plot of the effects. The input is:

The resulting plot is:

These effects are indistinguishable from a random normal variable. They plot almost on a straight line. What does it mean for the study and for the significant F tests?

SAVE effects / COEF ESTIMATE

USE effectsDROP constantTRANSPOSESELECT case > 1PPLOT col(1) / FILL=1 SYMBOL=1,

XLABEL=”Estimates of Effects”

Page 479: Statistics I

I-459

Linear Models I I : Analysis of Variance

It’s time to reveal that the data were produced by a random number generator.

� If you are doing a factorial analysis of variance, the p values you see on the output are not adjusted for the number of factors. If you do a three-way design, look at seven tests (excluding the constant). For a four-way design, examine 15 tests. Out of 15 F tests on random data, expect to find at least one test approaching significance. You have two significant and one almost significant, which is not far out of line. The probabilities for each separate F test need to be corrected for the experimentwise error rate. Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multifactorial design not corrected for the experimentwise error rate just demonstrated. Remember that a factorial design is a multiple comparison. If you have a single-degree-of-freedom study, use the procedure you used to draw a probability plot of the effects. Any effect that is really significant will become obvious.

� If you have a factorial study with more degrees of freedom on some factors, use the Bonferroni critical value for deciding which effects are significant. It guarantees that the Type I error rate for the study will be no greater than the level you choose. In the above example, this value is 0.05 / 15 (that is, 0.003).

� Multiple F tests based on a common denominator (mean-square error in this example) are correlated. This complicates the problem further. In general, the greater the discrepancy between numerator and denominator degrees of freedom and the smaller the denominator degrees of freedom, the greater the dependence of the tests. The Bonferroni tests are best in this situation, although Feingold and Korsog (1986) offer some useful alternatives.

Example 5 Mixed Models

Mixed models involve combinations of fixed and random factors in an ANOVA. Fixed factors are assumed to be composed of an exhaustive set of categories (for example, males and females), while random factors have category levels that are assumed to have been randomly sampled from a larger population of categories (for example, classrooms or word stems). Because of the mixing of fixed and random components, expected mean squares for certain effects are different from those for fully fixed or fully random designs. SYSTAT can handle mixed models because you can specify error terms for specific hypotheses.

Page 480: Statistics I

I-460

Chapter 15

For example, let’s analyze the AFIFI data with a mixed model instead of a fully fixed factorial. Here, you are interested in the four drugs as wide-spectrum disease killers. Because each drug is now thought to be effective against diseases in general, you have sampled three random diseases to assess the drugs. This implies that DISEASE is a random factor and DRUG remains a fixed factor. In this case, the error term for DRUG is the DRUG * DISEASE interaction. To begin, run the same analysis we performed in the two-way example to get the ANOVA table. To test for the DRUG effect, specify drug * disease as the error term in a hypothesis test. The input is:

The output is:

Notice that the SS, df, and MS for the error term in the hypothesis test correspond to the values for the interaction in the ANOVA table.

USE afifi ANOVA CATEGORY drug, disease DEPEND sysincr ESTIMATE

HYPOTHESIS EFFECT = drug ERROR = drug*disease TEST

Dep Var: SYSINCR N: 58 Multiple R: 0.675 Squared multiple R: 0.456 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P DRUG 2997.472 3 999.157 9.046 0.000DISEASE 415.873 2 207.937 1.883 0.164DRUG*DISEASE 707.266 6 117.878 1.067 0.396 Error 5080.817 46 110.453

Test for effect called: DRUG Test of Hypothesis Source SS df MS F P Hypothesis 2997.472 3 999.157 8.476 0.014 Error 707.266 6 117.878

Page 481: Statistics I

I-461

Linear Models I I : Analysis of Variance

Example 6 Separate Variance Hypothesis Tests

The data in the MJ20 data file are from Milliken and Johnson (1984). They are the results of a paired-associate learning task. GROUP describes the type of drug administered; LEARNING is the amount of material learned during testing. First we perform Levene’s test (Levene, 1960) to determine if the variances are equal across cells. The input is:

Following is the ANOVA table of the absolute residuals:

Notice that the F is significant, indicating that the separate variances test is advisable. Let’s do several single-degree-of-freedom tests, following Milliken and Johnson. The first is for comparing all drugs against the control; the second tests the hypothesis that groups 2 and 3 together are not significantly different from group 4. The input is:

USE mj20ANOVA SAVE mjresids / RESID DATA DEPEND learning CATEGORY group ESTIMATE USE mjresids LET residual = ABS(residual) CATEGORY group DEPEND residual ESTIMATE

Dep Var: RESIDUAL N: 29 Multiple R: 0.675 Squared multiple R: 0.455 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P GROUP 30.603 3 10.201 6.966 0.001 Error 36.608 25 1.464

USE mj20ANOVA CATEGORY group DEPEND learning ESTIMATE HYPOTHESIS SPECIFY 3*group[1] = group[2] +group[3] + group[4] / SEPARATE TEST HYPOTHESIS SPECIFY 2*group[4] = group[2] +group[3] / SEPARATE TEST

Page 482: Statistics I

I-462

Chapter 15

Following is the output. The ANOVA table has been omitted because it is not valid when variances are unequal.

Example 7 Analysis of Covariance

Winer (1971) uses the COVAR data file for an analysis of covariance in which X is the covariate and TREAT is the treatment. Cases do not need to be ordered by the grouping variable TREAT.

Before analyzing the data with an analysis of covariance model, be sure there is no significant interaction between the covariate and the treatment. The assumption of no interaction is often called the homogeneity of slopes assumption because it is tantamount to saying that the slope of the regression line of the dependent variable onto the covariate should be the same in all cells of the design.

Using separate variances estimate for error term.

Hypothesis. A Matrix 1 2 3 4 0.0 -4.000 0.0 0.0Null hypothesis value for D 0.0Test of Hypothesis Source SS df MS F P Hypoth 242.720 1 242.720 18.115 0.004Error 95.085 7.096 13.399 ------------------------------------------------------------------------------- Using separate variances estimate for error term.

> TESTHypothesis. A Matrix 1 2 3 4 0.0 2.000 3.000 3.000Null hypothesis value for D 0.0Test of Hypothesis Source SS df MS F P Hypoth 65.634 1 65.634 17.819 0.001Error 61.852 16.792 3.683

Page 483: Statistics I

I-463

Linear Models I I : Analysis of Variance

Parallelism is easy to test with a preliminary model. Use GLM to estimate this model with the interaction between treatment (TREAT) and covariate (X) in the model. The input is:

The output follows:

The probability value for the treatment by covariate interaction is 0.605, so the assumption of homogeneity of slopes is plausible.

Now, fit the usual analysis of covariance model by specifying:

For incomplete factorials and similar designs, you still must specify a model (using GLM) to do analysis of covariance.

The output follows:

USE covar GLM CATEGORY treat MODEL y = CONSTANT + treat + x + treat*x ESTIMATE

Dep Var: Y N: 21 Multiple R: 0.921 Squared multiple R: 0.849 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P TREAT 6.693 2 3.346 5.210 0.019X 15.672 1 15.672 24.399 0.000TREAT*X 0.667 2 0.334 0.519 0.605 Error 9.635 15 0.642

USE covar ANOVA PRINT=MEDIUM CATEGORY treat DEPEND y COVARIATE x ESTIMATE

Dep Var: Y N: 21 Multiple R: 0.916 Squared multiple R: 0.839 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P TREAT 16.932 2 8.466 13.970 0.000X 16.555 1 16.555 27.319 0.000 Error 10.302 17 0.606

-------------------------------------------------------------------------------

Page 484: Statistics I

I-464

Chapter 15

The treatment adjusted for the covariate is significant. There is a significant difference among the three treatment groups. Also, notice that the coefficient for the covariate is significant (F = 27.319, p < 0.0005). If it were not, the analysis of covariance could be taking away a degree of freedom without reducing mean-square error enough to help you.

SYSTAT computes the adjusted cell means the same way it computes estimates when saving residuals. Model terms (main effects and interactions) that do not contain categorical variables (covariates) are incorporated into the equation by adding the product of the coefficient and the mean of the term for computing estimates. The grand mean (CONSTANT) is included in computing the estimates.

Example 8 One-Way Repeated Measures

In this example, six rats were weighed at the end of each of five weeks. A plot of each rat’s weight over the duration of the experiment follows:

ANOVA is the simplest way to analyze this one-way model. Because we have no categorical variable(s), SYSTAT generates only the constant (grand mean) in the

Adjusted least squares means. Adj. LS Mean SE N TREAT =1 4.888 0.307 7 TREAT =2 7.076 0.309 7 TREAT =3 6.750 0.294 7

Trial

Mea

sure

WEIG

HT(1)

WEIG

HT(2)

WEIG

HT(3)

WEIG

HT(4)

WEIG

HT(5)0

2

4

6

8

10

12

Page 485: Statistics I

I-465

Linear Models I I : Analysis of Variance

model. To obtain individual single-degree-of-freedom orthogonal polynomials, the input is:

The output follows:

USE ratsANOVA DEPEND weight(1 .. 5) / REPEAT NAME=”Time” PRINT MEDIUM ESTIMATE

Number of cases processed: 6Dependent variable means WEIGHT(1) WEIGHT(2) WEIGHT(3) WEIGHT(4) WEIGHT(5) 2.500 5.833 7.167 8.000 8.333 ------------------------------------------------------------------------------- Univariate and Multivariate Repeated Measures Analysis Within Subjects--------------- Source SS df MS F P G-G H-F Time 134.467 4 33.617 16.033 0.000 0.004 0.002Error 41.933 20 2.097

Greenhouse-Geisser Epsilon: 0.3420Huynh-Feldt Epsilon : 0.4273------------------------------------------------------------------------------- Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P Time 114.817 1 114.817 38.572 0.002Error 14.883 5 2.977 Polynomial Test of Order 2 (Quadratic) Source SS df MS F P Time 18.107 1 18.107 7.061 0.045Error 12.821 5 2.564

Polynomial Test of Order 3 (Cubic) Source SS df MS F P Time 1.350 1 1.350 0.678 0.448Error 9.950 5 1.990

Page 486: Statistics I

I-466

Chapter 15

The Huynh-Feldt p value (0.002) does not differ from the p value for the F statistic to any significant degree. Compound symmetry appears to be satisfied and weight changes significantly over the five trials.

The polynomial tests indicate that most of the trials effect can be accounted for by a linear trend across time. In fact, the sum of squares for TIME is 134.467, and the sum of squares for the linear trend is almost as large (114.817). Thus, the linear polynomial accounts for roughly 85% of the change across the repeated measures.

Unevenly Spaced Polynomials

Sometimes the underlying metric of the profiles is not evenly spaced. Let’s assume that the fifth weight was measured after the tenth week instead of the fifth. In that case, the default polynomials have to be adjusted for the uneven spacing. These adjustments do not affect the overall repeated measures tests of each effect (univariate or multivariate), but they partition the sums of squares differently for the single-degree-of-freedom tests. The input is:

Alternatively, you could request a hypothesis test, specifying the metric for the polynomials:

Polynomial Test of Order 4 Source SS df MS F P Time 0.193 1 0.193 0.225 0.655Error 4.279 5 0.856 ------------------------------------------------------------------------------- Multivariate Repeated Measures Analysis Test of: Time Hypoth. df Error df F P Wilks’ Lambda= 0.011 4 2 43.007 0.023 Pillai Trace = 0.989 4 2 43.007 0.023 H-L Trace = 86.014 4 2 43.007 0.023

USE ratsANOVA DEPEND weight(1 .. 5) / REPEAT=5(1 2 3 4 10) NAME=”Time” PRINT MEDIUM ESTIMATE

HYPOTHESIS WITHIN='Time' CONTRAST / POLYNOMIAL METRIC=1,2,3,4,10 TEST

Page 487: Statistics I

I-467

Linear Models I I : Analysis of Variance

The last point has been spread out further to the right. The output follows:

The significance tests for the linear and quadratic trends differ from those for the evenly spaced polynomials. Before, the linear trend was strongest; now, the quadratic polynomial has the most significant results (F = 107.9, p < 0.0005).

You may have noticed that although the univariate F tests for the polynomials are different, the multivariate test is unchanged. The latter measures variation across all components. The ANOVA table for the combined components is not affected by the metric of the polynomials.

Univariate and Multivariate Repeated Measures Analysis Within Subjects--------------- Source SS df MS F P G-G H-F Time 134.467 4 33.617 16.033 0.000 0.004 0.002Error 41.933 20 2.097 Greenhouse-Geisser Epsilon: 0.3420Huynh-Feldt Epsilon : 0.4273------------------------------------------------------------------------------- Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P Time 67.213 1 67.213 23.959 0.004Error 14.027 5 2.805 Polynomial Test of Order 2 (Quadratic) Source SS df MS F P Time 62.283 1 62.283 107.867 0.000Error 2.887 5 0.577 (We omit the cubic and quartic polynomial output.) ------------------------------------------------------------------------------- Multivariate Repeated Measures Analysis Test of: Time Hypoth. df Error df F P Wilks’ Lambda= 0.011 4 2 43.007 0.023 Pillai Trace = 0.989 4 2 43.007 0.023 H-L Trace = 86.014 4 2 43.007 0.023

Page 488: Statistics I

I-468

Chapter 15

Difference Contrasts

If you do not want to use polynomials, you can specify a C matrix that contrasts adjacent weeks. After estimating the model, input the following:

The output is:

Notice the C matrix that this command generates. In this case, each of the univariate F tests covers the significance of the difference between the adjacent weeks indexed by the C matrix. For example, F = 17.241 shows that the first and second weeks differ significantly. The third and fourth weeks do not differ (F = 0.566). Unlike polynomials, these contrasts are not orthogonal.

HYPOTHESIS WITHIN=’Time’ CONTRAST / DIFFERENCE TEST

Hypothesis. C Matrix 1 2 3 4 5 1 1.000 -1.000 0.0 0.0 0.0 2 0.0 1.000 -1.000 0.0 0.0 3 0.0 0.0 1.000 -1.000 0.0 4 0.0 0.0 0.0 1.000 -1.000 Univariate F Tests Effect SS df MS F P 1 66.667 1 66.667 17.241 0.009 Error 19.333 5 3.867 2 10.667 1 10.667 40.000 0.001 Error 1.333 5 0.267 3 4.167 1 4.167 0.566 0.486 Error 36.833 5 7.367 4 0.667 1 0.667 2.500 0.175 Error 1.333 5 0.267 Multivariate Test Statistics Wilks’ Lambda = 0.011 F-Statistic = 43.007 df = 4, 2 Prob = 0.023 Pillai Trace = 0.989 F-Statistic = 43.007 df = 4, 2 Prob = 0.023 Hotelling-Lawley Trace = 86.014 F-Statistic = 43.007 df = 4, 2 Prob = 0.023

Page 489: Statistics I

I-469

Linear Models I I : Analysis of Variance

Summing Effects

To sum across weeks:

The output is:

In this example, you are testing whether the overall weight (across weeks) significantly differs from 0. Naturally, the F value is significant. Notice the C matrix that is generated. It is simply a set of 1’s that, in the equation BC' = 0, sum all the coefficients in B. In a group-by-trials design, this C matrix is useful for pooling trials and analyzing group effects.

Custom Contrasts

To test any arbitrary contrast effects between dependent variables, you can use C matrix, which has the same form (without a column for the CONSTANT) as A matrix. The following commands test a linear trend across the five trials:

The output is:

HYPOTHESIS WITHIN=’Time’ CONTRAST / SUM TEST

Hypothesis. C Matrix 1 2 3 4 5 1.000 1.000 1.000 1.000 1.000Test of Hypothesis Source SS df MS F P Hypothesis 6080.167 1 6080.167 295.632 0.000 Error 102.833 5 20.567

HYPOTHESIS CMATRIX [–2 –1 0 1 2] TEST

Hypothesis. C Matrix 1 2 3 4 5 -2.000 -1.000 0.0 1.000 2.000Test of Hypothesis Source SS df MS F P Hypothesis 1148.167 1 1148.167 38.572 0.002 Error 148.833 5 29.767

Page 490: Statistics I

I-470

Chapter 15

Example 9 Repeated Measures ANOVA for One Grouping Factor and One Within Factor with Ordered Levels

The following example uses estimates of population for 1983, 1986, and 1990 and projections for 2020 for 57 countries from the OURWORLD data file. The data are log transformed before analysis. Here you compare trends in population growth for European and Islamic countries. The variable GROUP$ contains codes for these groups plus a third code for New World countries (we exclude these countries from this analysis). To create a bar chart of the data after using YLOG to log transform them:

To perform a repeated measures analysis:

USE ourworldSELECT group$ <> ’NewWorld’BAR pop_1983 .. pop_2020 / REPEAT OVERLAY YLOG, GROUP=group$ SERROR FILL=.35, .8

USE ourworld ANOVA SELECT group$ <> ’NewWorld’ CATEGORY group$ LET(pop_1983, pop_1986, pop_1990, pop_2020) = L10(@) DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4 NAME=‘Time’ ESTIMATE

POP_1983

POP_1986

POP_1990

POP_2020

Trial

1.0

10.0

100.0

Mea

sure

EuropeIslamic

GROUP$

Page 491: Statistics I

I-471

Linear Models I I : Analysis of Variance

The output follows:

The within-subjects results indicate highly significant linear, quadratic, and cubic changes across time. The pattern of change across time for the two groups also differs significantly (that is, the TIME * GROUP$ interactions are highly significant for all three tests).

Notice that there is a larger gap in time between 1990 and 2020 than between the other values. Let’s incorporate “real time” in the analysis with the following specification:

Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P Time 0.675 1 0.675 370.761 0.000Time*GROUP$ 0.583 1 0.583 320.488 0.000Error 0.062 34 0.002 Polynomial Test of Order 2 (Quadratic) Source SS df MS F P Time 0.132 1 0.132 92.246 0.000Time*GROUP$ 0.128 1 0.128 89.095 0.000Error 0.049 34 0.001 Polynomial Test of Order 3 (Cubic) Source SS df MS F P Time 0.028 1 0.028 96.008 0.000Time*GROUP$ 0.027 1 0.027 94.828 0.000Error 0.010 34 0.000 ------------------------------------------------------------------------------- Multivariate Repeated Measures Analysis Test of: Time Hypoth. df Error df F P Wilks’ Lambda= 0.063 3 32 157.665 0.000 Pillai Trace = 0.937 3 32 157.665 0.000 H-L Trace = 14.781 3 32 157.665 0.000 Test of: Time*GROUP$ Hypoth. df Error df F P Wilks’ Lambda= 0.076 3 32 130.336 0.000 Pillai Trace = 0.924 3 32 130.336 0.000 H-L Trace = 12.219 3 32 130.336 0.000

DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4(83,86,90,120), NAME=‘TIME’

ESTIMATE

Page 492: Statistics I

I-472

Chapter 15

The results for the orthogonal polynomials are shown below:

When the values for POP_2020 are positioned on a real time line, the tests for quadratic and cubic polynomials are no longer significant. The test for the linear TIME * GROUP$ interaction, however, remains highly significant, indicating that the slope across time for the Islamic group is significantly steeper than that for the European countries.

Example 10 Repeated Measures ANOVA for Two Grouping Factors and One Within Factor

Repeated measures enables you to handle grouping factors automatically. The following example is from Winer (1971). There are two grouping factors (ANXIETY and TENSION) and one trials factor in the file REPEAT1. Following is a dot display of the average responses across trials for each of the four combinations of ANXIETY and TENSION.

Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P TIME 0.831 1 0.831 317.273 0.000TIME*GROUP$ 0.737 1 0.737 281.304 0.000Error 0.089 34 0.003 Polynomial Test of Order 2 (Quadratic) Source SS df MS F P TIME 0.003 1 0.003 4.402 0.043TIME*GROUP$ 0.001 1 0.001 1.562 0.220Error 0.025 34 0.001 Polynomial Test of Order 3 (Cubic) Source SS df MS F P TIME 0.000 1 0.000 1.653 0.207TIME*GROUP$ 0.000 1 0.000 1.733 0.197Error 0.006 34 0.000

Page 493: Statistics I

I-473

Linear Models I I : Analysis of Variance

The input is:

The model also includes an interaction between the grouping factors (ANXIETY * TENSION). The output follows:

USE repeat1ANOVADOT trial(1..4) / Group=anxiety,tension, LINE,REPEAT,SERROR CATEGORY anxiety tension DEPEND trial(1 .. 4) / REPEAT NAME=’Trial’ PRINT MEDIUM ESTIMATE

Univariate and Multivariate Repeated Measures Analysis Between Subjects----------------Source SS df MS F P ANXIETY 10.083 1 10.083 0.978 0.352TENSION 8.333 1 8.333 0.808 0.395ANXIETY*TENSION 80.083 1 80.083 7.766 0.024Error 82.500 8 10.313

1,1

TRIAL(1)

TRIAL(2)

TRIAL(3)

TRIAL(4)

Trial

0

5

10

15

20

Mea

sure

1,2

TRIAL(1)

TRIAL(2)

TRIAL(3)

TRIAL(4)

Trial

0

5

10

15

20

Mea

sure

2,1

TRIAL(1)

TRIAL(2)

TRIAL(3)

TRIAL(4)

Trial

0

5

10

15

20

Mea

sure

2,2

TRIAL(1)

TRIAL(2)

TRIAL(3)

TRIAL(4)

Trial

0

5

10

15

20

Mea

sure

1,1 1,2

Page 494: Statistics I

I-474

Chapter 15

Within Subjects--------------- Source SS df MS F P G-G H-F Trial 991.500 3 330.500 152.051 0.000 0.000 0.000Trial*ANXIETY 8.417 3 2.806 1.291 0.300 0.300 0.301Trial*TENSION 12.167 3 4.056 1.866 0.162 0.197 0.169Trial*ANXIETY*TENSION 12.750 3 4.250 1.955 0.148 0.185 0.155Error 52.167 24 2.174 Greenhouse-Geisser Epsilon: 0.5361Huynh-Feldt Epsilon : 0.9023------------------------------------------------------------------------------- Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P Trial 984.150 1 984.150 247.845 0.000Trial*ANXIETY 1.667 1 1.667 0.420 0.535Trial*TENSION 10.417 1 10.417 2.623 0.144Trial*ANXIETY*TENSION 9.600 1 9.600 2.418 0.159Error 31.767 8 3.971

Polynomial Test of Order 2 (Quadratic) Source SS df MS F P Trial 6.750 1 6.750 3.411 0.102Trial*ANXIETY 3.000 1 3.000 1.516 0.253Trial*TENSION 0.083 1 0.083 0.042 0.843Trial*ANXIETY*TENSION 0.333 1 0.333 0.168 0.692Error 15.833 8 1.979 Polynomial Test of Order 3 (Cubic) Source SS df MS F P Trial 0.600 1 0.600 1.051 0.335Trial*ANXIETY 3.750 1 3.750 6.569 0.033Trial*TENSION 1.667 1 1.667 2.920 0.126Trial*ANXIETY*TENSION 2.817 1 2.817 4.934 0.057Error 4.567 8 0.571 -------------------------------------------------------------------------------

Page 495: Statistics I

I-475

Linear Models I I : Analysis of Variance

In the within-subjects table, you see that the trial effect is highly significant (F = 152.1, p < 0.0005). Below that table, we see that the linear trend across trials (Polynomial Order 1) is highly significant (F = 247.8, p < 0.0005). The hypothesis sums of squares for the linear, quadratic, and cubic polynomials sum to the total hypothesis sum of squares for trials (that is, 984.15 + 6.75 + 0.60 = 991.5). Notice that the total sum of squares is 991.5, while that for the linear trend is 984.15. This means that the linear trend accounts for more than 99% of the variability across the four trials. The assumption of compound symmetry is not required for the test of linear trend—so you can report that there is a highly significant linear decrease across the four trials (F = 247.8, p < 0.0005).

Example 11 Repeated Measures ANOVA for Two Trial Factors

Repeated measures enables you to handle several trials factors, so we include an example with two trial factors. It is an experiment from Winer (1971), which has one grouping factor (NOISE) and two trials factors (PERIODS and DIALS). The trials factors must be sorted into a set of dependent variables (one for each pairing of the two factors groups). It is useful to label the levels with a convenient mnemonic. The file is set up with variables P1D1 through P3D3. Variable P1D2 indicates a score in the PERIODS = 1, DIALS = 2 cell. The data are in the file REPEAT2.

Multivariate Repeated Measures Analysis Test of: Trial Hypoth. df Error df F P Wilks’ Lambda= 0.015 3 6 127.686 0.000 Pillai Trace = 0.985 3 6 127.686 0.000 H-L Trace = 63.843 3 6 127.686 0.000 Test of: Trial Hypoth. df Error df F P *ANXIETY Wilks’ Lambda= 0.244 3 6 6.183 0.029 Pillai Trace = 0.756 3 6 6.183 0.029 H-L Trace = 3.091 3 6 6.183 0.029 Test of: Trial Hypoth. df Error df F P *TENSION Wilks’ Lambda= 0.361 3 6 3.546 0.088 Pillai Trace = 0.639 3 6 3.546 0.088 H-L Trace = 1.773 3 6 3.546 0.088 Test of: Trial Hypoth. df Error df F P *ANXIETY *TENSION Wilks’ Lambda= 0.328 3 6 4.099 0.067 Pillai Trace = 0.672 3 6 4.099 0.067 H-L Trace = 2.050 3 6 4.099 0.067

Page 496: Statistics I

I-476

Chapter 15

The input is:

Notice that REPEAT specifies that the two trials factors have three levels each. ANOVA assumes the subscript of the first factor will vary slowest in the ordering of the dependent variables. If you have two repeated factors (DAY with four levels and AMPM with two levels), you should select eight dependent variables and type Repeat=4,2. The repeated measures are selected in the following order:

From this indexing, it generates the proper main effects and interactions. When more than one trial factor is present, ANOVA lists each dependent variable and the associated level on each factor. The output follows:

USE repeat2ANOVA CATEGORY noise DEPEND p1d1 .. p3d3 / REPEAT=3,3 NAMES=’period’,’dial’ PRINT MEDIUM ESTIMATE

DAY1_AM DAY1_PM DAY2_AM DAY2_PM DAY3_AM DAY3_PM DAY4_AM DAY4_PM

Dependent variable means P1D1 P1D2 P1D3 P2D1 P2D2 48.000 52.000 63.000 37.167 42.167 P2D3 P3D1 P3D2 P3D3 54.167 27.000 32.500 42.500 ------------------------------------------------------------------------------- Univariate and Multivariate Repeated Measures Analysis Between Subjects---------------- Source SS df MS F P NOISE 468.167 1 468.167 0.752 0.435Error 2491.111 4 622.778 Within Subjects--------------- Source SS df MS F P G-G H-F period 3722.333 2 1861.167 63.389 0.000 0.000 0.000period*NOISE 333.000 2 166.500 5.671 0.029 0.057 0.029Error 234.889 8 29.361 Greenhouse-Geisser Epsilon: 0.6476Huynh-Feldt Epsilon : 1.0000dial 2370.333 2 1185.167 89.823 0.000 0.000 0.000dial*NOISE 50.333 2 25.167 1.907 0.210 0.215 0.210Error 105.556 8 13.194

Page 497: Statistics I

I-477

Linear Models I I : Analysis of Variance

Greenhouse-Geisser Epsilon: 0.9171Huynh-Feldt Epsilon : 1.0000period*dial 10.667 4 2.667 0.336 0.850 0.729 0.850period*dial*NOISE 11.333 4 2.833 0.357 0.836 0.716 0.836Error 127.111 16 7.944 Greenhouse-Geisser Epsilon: 0.5134Huynh-Feldt Epsilon : 1.0000------------------------------------------------------------------------------- Single Degree of Freedom Polynomial Contrasts--------------------------------------------- Polynomial Test of Order 1 (Linear) Source SS df MS F P period 3721.000 1 3721.000 73.441 0.001period*NOISE 225.000 1 225.000 4.441 0.103Error 202.667 4 50.667 dial 2256.250 1 2256.250 241.741 0.000dial*NOISE 6.250 1 6.250 0.670 0.459Error 37.333 4 9.333 period*dial 0.375 1 0.375 0.045 0.842period*dial*NOISE 1.042 1 1.042 0.125 0.742Error 33.333 4 8.333 Polynomial Test of Order 2 (Quadratic) Source SS df MS F P period 1.333 1 1.333 0.166 0.705period*NOISE 108.000 1 108.000 13.407 0.022Error 32.222 4 8.056 dial 114.083 1 114.083 6.689 0.061dial*NOISE 44.083 1 44.083 2.585 0.183Error 68.222 4 17.056 period*dial 3.125 1 3.125 0.815 0.418period*dial*NOISE 0.125 1 0.125 0.033 0.865Error 15.333 4 3.833 Polynomial Test of Order 3 (Cubic) Source SS df MS F P period*dial 6.125 1 6.125 0.750 0.435period*dial*NOISE 3.125 1 3.125 0.383 0.570Error 32.667 4 8.167 Polynomial Test of Order 4 Source SS df MS F P period*dial 1.042 1 1.042 0.091 0.778period*dial*NOISE 7.042 1 7.042 0.615 0.477Error 45.778 4 11.444-------------------------------------------------------------------------------

Page 498: Statistics I

I-478

Chapter 15

Using GLM, the input is:

Example 12 Repeated Measures Analysis of Covariance

To do repeated measures analysis of covariance, where the covariate varies within subjects, you would have to set up your model like a split plot with a different record for each measurement.

This example is from Winer (1971). This design has two trials (DAY1 and DAY2), one covariate (AGE), and one grouping factor (SEX). The data are in the file WINER.

Multivariate Repeated Measures Analysis Test of: period Hypoth. df Error df F P Wilks’ Lambda= 0.051 2 3 28.145 0.011 Pillai Trace = 0.949 2 3 28.145 0.011 H-L Trace = 18.764 2 3 28.145 0.011 Test of: period*NOISE Hypoth. df Error df F P Wilks’ Lambda= 0.156 2 3 8.111 0.062 Pillai Trace = 0.844 2 3 8.111 0.062 H-L Trace = 5.407 2 3 8.111 0.062 Test of: dial Hypoth. df Error df F P Wilks’ Lambda= 0.016 2 3 91.456 0.002 Pillai Trace = 0.984 2 3 91.456 0.002 H-L Trace = 60.971 2 3 91.456 0.002 Test of: dial*NOISE Hypoth. df Error df F P Wilks’ Lambda= 0.565 2 3 1.155 0.425 Pillai Trace = 0.435 2 3 1.155 0.425 H-L Trace = 0.770 2 3 1.155 0.425 Test of: period*dial Hypoth. df Error df F P Wilks’ Lambda= 0.001 4 1 331.445 0.041 Pillai Trace = 0.999 4 1 331.445 0.041 H-L Trace = 1325.780 4 1 331.445 0.041 Test of: period*dial Hypoth. df Error df F P *NOISE Wilks’ Lambda= 0.000 4 1 581.875 0.031 Pillai Trace = 1.000 4 1 581.875 0.031 H-L Trace = 2327.500 4 1 581.875 0.031

GLM USE repeat2 CATEGORY noise MODEL p1d1 .. p3d3 = CONSTANT + noise / REPEAT=3,3,

NAMES=’period’,’dial’ PRINT MEDIUM ESTIMATE

Page 499: Statistics I

I-479

Linear Models I I : Analysis of Variance

The input follows:

The output is:

The F statistics for the covariate and its interactions, namely AGE (13.587) and DAY * AGE (0.102), are not ordinarily published; however, they help you understand the adjustment made by the covariate.

This analysis did not test the homogeneity of slopes assumption. If you want to test the homogeneity of slopes assumption, run the following model in GLM first:

Then check to see if the SEX * AGE interaction is significant.

USE winer ANOVA CATEGORY sex DEPEND day(1 .. 2) / REPEAT NAME=’day’ COVARIATE age ESTIMATE

Dependent variable means DAY(1) DAY(2) 16.500 11.875 ------------------------------------------------------------------------------- Univariate Repeated Measures AnalysisBetween Subjects---------------- Source SS df MS F P SEX 44.492 1 44.492 3.629 0.115AGE 166.577 1 166.577 13.587 0.014Error 61.298 5 12.260 Within Subjects--------------- Source SS df MS F P G-G H-F day 22.366 1 22.366 17.899 0.008 . . day*SEX 0.494 1 0.494 0.395 0.557 . . day*AGE 0.127 1 0.127 0.102 0.763 . . Error 6.248 5 1.250 Greenhouse-Geisser Epsilon: . Huynh-Feldt Epsilon : .

MODEL day(1 .. 2) = CONSTANT + sex + age + sex*age / REPEAT

Page 500: Statistics I

I-480

Chapter 15

To use GLM:

Example 13 Multivariate Analysis of Variance

The data in the file MANOVA comprise a hypothetical experiment on rats assigned randomly to one of three drugs. Weight loss in grams was observed for the first and second weeks of the experiment. The data were analyzed in Morrison (1976) with a two-way multivariate analysis of variance (a two-way MANOVA.)

You can use ANOVA to set up the MANOVA model for complete factorials:

Notice that the only difference between an ANOVA and MANOVA model is that the latter has more than one dependent variable. The output includes:

GLM USE winer CATEGORY sex MODEL day(1 .. 2) = CONSTANT + sex + age / REPEAT NAME=’day’ ESTIMATE

USE manovaANOVA CATEGORY sex, drug DEPEND week(1 .. 2) ESTIMATE

Dependent variable means WEEK(1) WEEK(2) 9.750 8.667 -1Estimates of effects B = (X’X) X’Y WEEK(1) WEEK(2) CONSTANT 9.750 8.667 SEX 1 0.167 0.167 DRUG 1 -2.750 -1.417 DRUG 2 -2.250 -0.167 SEX 1 DRUG 1 -0.667 -1.167 SEX 1 DRUG 2 -0.417 -0.417

Page 501: Statistics I

I-481

Linear Models I I : Analysis of Variance

Notice that each column of the B matrix is now assigned to a separate dependent variable. It is as if we had done two runs of an ANOVA. The numbers in the matrix are the analysis of variance effects estimates.

You can also use GLM to set up the MANOVA model. With this approach, the design does not have to be a complete factorial. With commands:

Testing Hypotheses

With more than one dependent variable, you do not get a single ANOVA table; instead, each hypothesis is tested separately. Here are three hypotheses. Extended output for the second hypothesis is used to illustrate the detailed output.

Following are the collected results:

GLM USE manova CATEGORY sex, drug MODEL week(1 .. 2) = CONSTANT + sex + drug + sex*drug ESTIMATE

HYPOTHESIS EFFECT = sex TESTPRINT = LONGHYPOTHESIS EFFECT = drug TESTPRINT = SHORTHYPOTHESIS EFFECT = sex*drug TEST

Test for effect called: SEX Univariate F Tests Effect SS df MS F P WEEK(1) 0.667 1 0.667 0.127 0.726 Error 94.500 18 5.250 WEEK(2) 0.667 1 0.667 0.105 0.749 Error 114.000 18 6.333 Multivariate Test Statistics Wilks’ Lambda = 0.993 F-Statistic = 0.064 df = 2, 17 Prob = 0.938 Pillai Trace = 0.007 F-Statistic = 0.064 df = 2, 17 Prob = 0.938 Hotelling-Lawley Trace = 0.008 F-Statistic = 0.064 df = 2, 17 Prob = 0.938-------------------------------------------------------------------------------

Page 502: Statistics I

I-482

Chapter 15

Test for effect called: DRUG Null hypothesis contrast AB WEEK(1) WEEK(2) 1 -2.750 -1.417 2 -2.250 -0.167 -1Inverse contrast A(X’X) A’ 1 2 1 0.083 2 -0.042 0.083 -1 -1Hypothesis sum of product matrix H = B’A’(A(X’X) A’) AB WEEK(1) WEEK(2) WEEK(1) 301.000 WEEK(2) 97.500 36.333 Error sum of product matrix G = E’E WEEK(1) WEEK(2) WEEK(1) 94.500 WEEK(2) 76.500 114.000 Univariate F Tests Effect SS df MS F P WEEK(1) 301.000 2 150.500 28.667 0.000 Error 94.500 18 5.250 WEEK(2) 36.333 2 18.167 2.868 0.083 Error 114.000 18 6.333 Multivariate Test Statistics Wilks’ Lambda = 0.169 F-Statistic = 12.199 df = 4, 34 Prob = 0.000 Pillai Trace = 0.880 F-Statistic = 7.077 df = 4, 36 Prob = 0.000 Hotelling-Lawley Trace = 4.640 F-Statistic = 18.558 df = 4, 32 Prob = 0.000 THETA = 0.821 S = 2, M =-0.5, N = 7.5 Prob = 0.000 Test of Residual Roots Roots 1 through 2 Chi-Square Statistic = 36.491 df = 4 Roots 2 through 2 Chi-Square Statistic = 1.262 df = 1

Page 503: Statistics I

I-483

Linear Models I I : Analysis of Variance

Matrix formulas (that are something long) make explicit the hypothesis being tested. For MANOVA, hypotheses are tested with sums-of-squares and cross-products matrices. Before printing the multivariate tests, however, SYSTAT prints the univariate tests. Each of these F statistics is constructed in the same way as the ANOVA model. The sums of squares for hypothesis and error are taken from the diagonals of the respective sum of product matrices. The univariate F test for the WEEK(1) DRUG effect, for example, is computed from over , or hypothesis mean square divided by error mean square.

The next statistics printed are for the multivariate hypothesis. Wilks’ lambda (likelihood-ratio criterion) varies between 0 and 1. Schatzoff (1966) has tables for its percentage points. The following F statistic is Rao’s approximate (sometimes exact) F statistic corresponding to the likelihood-ratio criterion (see Rao, 1973). Pillai’s trace and its F approximation are taken from Pillai (1960). The Hotelling-Lawley trace and

Canonical Correlations 1 2 0.906 0.244Dependent variable canonical coefficients standardizedby conditional (within groups) standard deviations 1 2 WEEK(1) 1.437 -0.352 WEEK(2) -0.821 1.231 Canonical loadings (correlations between conditionaldependent variables and dependent canonical factors) 1 2 WEEK(1) 0.832 0.555 WEEK(2) 0.238 0.971 -------------------------------------------------------------------------------Test for effect called: SEX*DRUG Univariate F Tests Effect SS df MS F P WEEK(1) 14.333 2 7.167 1.365 0.281 Error 94.500 18 5.250 WEEK(2) 32.333 2 16.167 2.553 0.106 Error 114.000 18 6.333 Multivariate Test Statistics Wilks’ Lambda = 0.774 F-Statistic = 1.159 df = 4, 34 Prob = 0.346 Pillai Trace = 0.227 F-Statistic = 1.152 df = 4, 36 Prob = 0.348 Hotelling-Lawley Trace = 0.290 F-Statistic = 1.159 df = 4, 32 Prob = 0.347 THETA = 0.221 S = 2, M =-0.5, N = 7.5 Prob = 0.295

301.0 2⁄ 94.5 18⁄

Page 504: Statistics I

I-484

Chapter 15

its F approximation are documented in Morrison (1976). The last statistic is the largest root criterion for Roy’s union-intersection test (see Morrison, 1976). Charts of the percentage points of this statistic, found in Morrison and other multivariate texts, are taken from Heck (1960).

The probability value printed for THETA is not an approximation. It is what you find in the charts. In the first hypothesis, all the multivariate statistics have the same value for the F approximation because the approximation is exact when there are only two groups (see Hotelling’s in Morrison, 1976). In these cases, THETA is not printed because it has the same probability value as the F statistic.

Because we requested extended output for the second hypothesis, we get additional material.

Bartlett’s Residual Root (Eigenvalue) Test

The chi-square statistics follow Bartlett (1947). The probability value for the first chi-square statistic should correspond to that for the approximate multivariate F statistic in large samples. In small samples, they might be discrepant, in which case you should generally trust the F statistic more. The subsequent chi-square statistics are recomputed, leaving out the first and later roots until the last root is tested. These are sequential tests and should be treated with caution, but they can be used to decide how many dimensions (roots and canonical correlations) are significant. The number of significant roots corresponds to the number of significant p values in this ordered list.

Canonical Coefficients

Dimensions with insignificant chi-square statistics in the prior tests should be ignored in general. Corresponding to each canonical correlation is a canonical variate, whose coefficients have been standardized by the within-groups standard deviations (the default). Standardization by the sample standard deviation is generally used for canonical correlation analysis or multivariate regression when groups are not present to introduce covariation among variates. You can standardize these variates by the total (sample) standard deviations with:

inserted prior to TEST. Continue with the other test specifications described earlier.Finally, the canonical loadings are printed. These are correlations and, thus, provide

information different from the canonical coefficients. In particular, you can identify

STANDARDIZE = TOTAL

T 2

Page 505: Statistics I

I-485

Linear Models I I : Analysis of Variance

suppressor variables in the multivariate system by looking for differences in sign between the coefficients and the loadings (which is the case with these data). See Bock (1975) and Wilkinson (1975, 1977) for an interpretation of these variates.

Computation

Algorithms

Centered sums of squares and cross-products are accumulated using provisional algorithms. Linear systems, including those involved in hypothesis testing, are solved by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved with Householder tridiagonalization and implicit QL iterations. For further information, see Wilkinson and Reinsch (1971) or Chambers (1977).

References

Afifi, A. A. and Azen, S. P. (1972). Statistical analysis: A computer-oriented approach. New York: Academic Press.

Affifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.: Lifetime Learning Publications.

Bartlett, M. S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, Series B, 9, 176–197.

Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York: McGraw-Hill.

Cochran and Cox. (1957). Experimental designs. 2nd ed. New York: John Wiley & Sons, Inc.

Daniel, C. (1960). Locating outliers in factorial experiments. Technometrics, 2, 149–156.Feingold, M. and Korsog, P. E. (1986). The correlation and dependence between two f

statistics with the same denominator. The American Statistician, 40, 218–220.Heck, D. L. (1960). Charts of some upper percentage points of the distribution of the largest

characteristic root. Annals of Mathematical Statistics, 31, 625–642.Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.John, P. W. M. (1971). Statistical design and analysis of experiments. New York:

MacMillan, Inc.Kutner, M. H. (1974). Hypothesis testing in linear models (Eisenhart Model I). The

American Statistician, 28, 98–100.

Page 506: Statistics I

I-486

Chapter 15

Levene, H. (1960). Robust tests for equality of variance. I. Olkin, ed., Contributions to Probability and Statistics. Palo Alto, Calif.: Stanford University Press, 278–292.

Miller, R. (1985). Multiple comparisons. Kotz, S. and Johnson, N. L., eds., Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley & Sons, Inc., 679–689.

Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed Experiments. New York: Van Nostrand Reinhold Company.

Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.

Homewood, Ill.: Richard D. Irwin, Inc. Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The

Statistical Center, University of Phillipines.Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John

Wiley & Sons, Inc.Schatzoff, M. (1966). Exact distributions of Wilk’s likelihood ratio criterion. Biometrika,

53, 347–358.Scheffé, H. (1959). The analysis of variance. New York: John Wiley & Sons, Inc.Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley & Sons,

Inc.Speed, F. M. and Hocking, R. R. (1976). The use of the r( )- notation with unbalanced data.

The American Statistician, 30, 30–33.Speed, F. M., Hocking, R. R., and Hackney, O. P. (1978). Methods of analysis of linear

models with unbalanced data. Journal of the American Statistical Association, 73, 105–112.

Wilkinson, L. (1975). Response variable hypotheses in the multivariate analysis of variance. Psychological Bulletin, 82, 408–412.

Wilkinson, L. (1977). Confirmatory rotation of MANOVA canonical variates. Multivariate Behavioral Research, 12, 487–494.

Winer, B. J. (1971). Statistical principles in experimental design, 2nd ed. New York: McGraw-Hill.

Page 507: Statistics I

I-487

Chapte r

16Linear Models III: General Linear Models

Leland Wilkinson and Mark Coward

General Linear Model (GLM) can estimate and test any univariate or multivariate general linear model, including those for multiple regression, analysis of variance or covariance, and other procedures such as discriminant analysis and principal components. With the general linear model, you can explore randomized block designs, incomplete block designs, fractional factorial designs, Latin square designs, split plot designs, crossover designs, nesting, and more. The model is:

Y = XB + e

where Y is a vector or matrix of dependent variables, X is a vector or matrix of independent variables, B is a vector or matrix of regression coefficients, and e is a vector or matrix of random errors. See Searle (1971), Winer (1971), Neter, Wasserman, and Kutner (1985), or Cohen and Cohen (1983) for details.

In multivariate models, Y is a matrix of continuous measures. The X matrix can be either continuous or categorical dummy variables, according to the type of model. For discriminant analysis, X is a matrix of dummy variables, as in analysis of variance. For principal components analysis, X is a constant (a single column of 1’s). For canonical correlation, X is usually a matrix of continuous right-hand variables (and Y is the matrix of left-hand variables).

For some multivariate models, it may be easier to use ANOVA, which can handle models with multiple dependent variables and zero, one, or more categorical independent variables (that is, only the constant is present in the former). ANOVA automatically generates interaction terms for the design factor.

After the parameters of a model have been estimated, they can be tested by any general linear hypothesis of the following form:

Page 508: Statistics I

I-488

Chapter 16

ABC’ = D

where A is a matrix of linear weights on coefficients across the independent variables (the rows of B), C is a matrix of linear weights on the coefficients across dependent variables (the columns of B), B is the matrix of regression coefficients or effects, and D is a null hypothesis matrix (usually a null matrix).

For the multivariate models described in this chapter, the C matrix is an identity matrix, and the D matrix is null. The A matrix can have several different forms, but these are all submatrices of an identity matrix and are easily formed.

General Linear Models in SYSTAT

Model Estimation (in GLM)

To specify a general linear model using GLM, from the menus choose:

StatisticsGeneral Linear Model (GLM)

Estimate Model…

You can specify any multivariate linear model with General Linear Model. You must select the variables to include in the desired model.

Page 509: Statistics I

I-489

Linear Models II I: General Linear Models

Dependent(s). The variable(s) you want to examine. The dependent variable(s) should be continuous numeric variables (for example, income).

Independent(s). Select one or more continuous or categorical variables (grouping variables). Independent variables that are not denoted as categorical are considered covariates. Unlike ANOVA, GLM does not automatically include and test all interactions. With GLM, you have to build your model. If you want interactions or nested variables in your model, you need to build these components.

Model. The following model options allow you to include a constant in your model, do a means model, specify the sample size, and weight cell means:

� Include constant. The constant is an optional parameter. Deselect Include constant to obtain a model through the origin. When in doubt, include the constant.

� Means. Specifies a fully factorial design using means coding.

� Cases. When your data file is a symmetric matrix, specify the sample size that generated the matrix.

� Weight. Weights cell means by the cell counts before averaging.

In addition, you can save residuals and other data to a new data file. The following alternatives are available:

� Residuals. Saves predicted values, residuals, Studentized residuals, and the standard error of predicted values.

� Residuals/Data. Saves the statistics given by Residuals, plus all the variables in the working data file, including any transformed data values.

� Adjusted. Saves adjusted cell means from analysis of covariance.

� Adjusted/Data. Saves adjusted cell means plus all the variables in the working data file, including any transformed data values.

� Partial. Saves partial residuals.

� Partial/Data. Saves partial residuals plus all the variables in the working data file, including any transformed data values.

� Model. Saves statistics given in Residuals and the variables used in the model.

� Coefficients. Saves the estimates of the regression coefficients.

Page 510: Statistics I

I-490

Chapter 16

Categorical Variables

You can specify numeric or character-valued categorical (grouping) variables that define cells. You want to categorize an independent variable when it has several categories such as education levels, which could be divided into the following categories: less than high school, some high school, finished high school, some college, finished bachelor’s degree, finished master’s degree, and finished doctorate. On the other hand, a variable such as age in years would not be categorical unless age were broken up into categories such as under 21, 21–65, and over 65.

To specify categorical variables, click the Categories button in the General Linear Model dialog box.

Types of Categories. You can elect to use one of two different coding methods:

� Effect. Produces parameter estimates that are differences from group means.

� Dummy. Produces dummy codes for the design variables instead of effect codes. Coding of dummy variables is the classic analysis of variance parameterization, in which the sum of effects estimated for a classifying variable is 0. If your categorical variable has k categories, dummy variables are created. k 1–

Page 511: Statistics I

I-491

Linear Models II I: General Linear Models

Repeated Measures

In a repeated measures design, the same variable is measured several times for each subject (case). A paired-comparison t test is the most simple form of a repeated measures design (for example, each subject has a before and after measure).

SYSTAT derives values from your repeated measures and uses them in general linear model computations to test changes across the repeated measures (within subjects) as well as differences between groups of subjects (between subjects). Tests of the within-subjects values are called Polynomial Test Of Order 1, 2,..., up to k, where k is one less than the number of repeated measures. The first polynomial is used to test linear changes: Do the repeated responses increase (or decrease) around a line with a significant slope? The second polynomial tests if the responses fall along a quadratic curve, etc.

To open the Repeated Measures dialog box, click Repeated in the General Linear Model dialog box.

If you select Perform repeated measures analysis, SYSTAT treats the dependent variables as a set of repeated measures. Optionally, you can assign a name for each set of repeated measures, specify the number of levels, and specify the metric for unevenly spaced repeated measures.

Name. Name that identifies each set of repeated measures.

Levels. Number of repeated measures in the set. For example, if you have three dependent variables that represent measurements at different times, the number of levels is 3.

Metric. Metric that indicates the spacing between unevenly spaced measurements. For example, if measurements were taken at the third, fifth, and ninth weeks, the metric would be 3, 5, 9.

Page 512: Statistics I

I-492

Chapter 16

General Linear Model Options

General Linear Model Options allows you to specify a tolerance level, select complete or stepwise entry, and specify entry and removal criteria.

To open the Options dialog box, click Options in the General Linear Model dialog box.

The following options can be specified:

Tolerance. Prevents the entry of a variable that is highly correlated with the independent variables already included in the model. Enter a value between 0 and 1. Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required to exclude a variable.

Estimation. Controls the method used to enter and remove variables from the equation.

� Complete. All independent variables are entered in a single step.

� Mixture model. Constrains the independent variables to sum to a constant.

� Stepwise. Variables are entered into or removed from the model, one at a time.

Stepwise Options. The following alternatives are available for stepwise entry and removal:

� Backward. Begins with all candidate variables in the model. At each step, SYSTAT removes the variable with the largest Remove value.

Page 513: Statistics I

I-493

Linear Models II I: General Linear Models

� Forward. Begins with no variables in the model. At each step, SYSTAT adds the variable with the smallest Enter value.

� Automatic. For Backward, at each step, SYSTAT automatically removes a variable from your model. For Forward, at each step, SYSTAT automatically adds a variable to the model.

� Interactive. At each step in the model building, you select the variable to enter into or remove from the model.

You can also control the criteria used to enter and remove variables from the model:

� Enter. Enters a variable into the model if its alpha value is less than the specified value. Enter a value between 0 and 1(for example, 0.025).

� Remove. Removes a variable from the model if its alpha value is greater than the specified value. Enter a value between 0 and 1(for example, 0.025).

� Force. Forces the first n variables listed in your model to remain in the equation.

� FEnter. F-to-enter limit. Variables with F greater than the specified value are entered into the model if Tolerance permits.

� FRemove. F-to-remove limit. Variables with F less than the specified value are removed from the model.

� Max step. Maximum number of steps.

Pairwise Comparisons

Once you determine that your groups are different, you may want to compare pairs of groups to determine which pairs differ.

To open the Pairwise Comparisons dialog box, from the menus choose:

StatisticsGeneral Linear Model (GLM)

Pairwise Comparisons…

Page 514: Statistics I

I-494

Chapter 16

Groups. You must specify the variable that defines the groups.

Test. General Linear Model provides several post hoc tests to compare levels of this variable.

� Bonferroni. Multiple comparison test based on Student’s t statistic. Adjusts the observed significance level for the fact that multiple comparisons are made.

� Tukey. Uses the Studentized range statistic to make all pairwise comparisons between groups and sets the experimentwise error rate to the error rate for the collection for all pairwise comparisons. When testing a large number of pairs of means, Tukey is more powerful than Bonferroni. For a small number of pairs, Bonferroni is more powerful.

� Dunnett. The Dunnett test is available only with one-way designs. Dunnett compares a set of treatments against a single control mean that you specify. You can choose a two-sided or one-sided test. To test that the mean at any level (except the control category) of the experimental groups is not equal to that of the control category, select 2-sided. To test if the mean at any level of the experimental groups is smaller (or larger) than that of the control category, select 1-sided.

� Fisher’s LSD. Least significant difference pairwise multiple comparison test. Equivalent to multiple t tests between all pairs of groups. The disadvantage of this test is that no attempt is made to adjust the observed significance level for multiple comparisons.

� Scheffé. The significance level of Scheffé’s test is designed to allow all possible linear combinations of group means to be tested, not just pairwise comparisons available in this feature. The result is that Scheffé’s test is more conservative than other tests, meaning that a larger difference between means is required for significance.

Page 515: Statistics I

I-495

Linear Models II I: General Linear Models

Error Term. You can either use the mean square error specified by the model or you can enter the mean square error.

� Model MSE. Uses the mean square error from the general linear model that you ran.

� MSE and df. You can specify your own mean square error term and degrees of freedom for mixed models with random factors, split-plot designs, and crossover designs with carry-over effects.

Hypothesis Tests

Contrasts are used to test relationships among cell means. The post hoc tests in GLM Pairwise Comparison are the most simple form because they compare two means at a time. However, general contrasts can involve any number of means in the analysis.

To test hypotheses, from the menus choose:

StatisticsGeneral Linear Model (GLM)

Hypothesis Test…

Contrasts can be defined across the categories of a grouping factor or across the levels of a repeated measure.

Page 516: Statistics I

I-496

Chapter 16

Effects. Specify the factor (grouping variable) to which the contrast applies. For principal components, specify the grouping variable for within-groups components (if any). For canonical correlation, select All to test all of the effects in the model.

Within. Use when specifying a contrast across the levels of a repeated measures factor. Enter the name assigned to the set of repeated measures in the Repeated Measures subdialog box.

Error Term. You can specify which error term to use for the hypothesis tests.

� Model MSE. Uses the mean square error from the general linear model that you ran.

� MSE and df. You can specify your own mean square error and degrees of freedom if you know them from a previous model.

� Between Subject(s) Effect(s). Select this option to use main effect error terms or interaction error terms in all tests. Specify interactions using an asterisk between variables.

Priors. Prior probabilities for discriminant analysis. Type a value for each group, separated by spaces. These probabilities should add to 1. For example, if you have three groups, priors might be 0.5, 0.3, and 0.2.

Standardize. You can standardize canonical coefficients using the total sample or a within-groups covariance matrix.

� Within groups is usually used in discriminant analysis to make comparisons easier when measures are on different scales.

� Sample is used in canonical correlation.

Rotate. Specify the number of components to rotate.

Factor. In a factor analysis with grouping variables, factor the Hypothesis (between-groups) matrix or the Error (within-groups) matrix. This allows you to compute principal components on the hypothesis or error matrix separately, offering a direct way to compute principal components on residuals of any linear model you wish to fit. You can specify the matrix type as Correlations, SSCP, or Covariance.

Save scores and results. You can save the results to a SYSTAT data file. Exactly what is saved depends on the analysis, When you save scores and results, extended output is automatically produced. This enables you to see more detailed output when computing these statistics.

Page 517: Statistics I

I-497

Linear Models II I: General Linear Models

Specify

To specify contrasts for between-subjects effects, click Specify in the Hypothesis Test dialog box.

You can use GLM’s cell means “language” to define contrasts across the levels of a grouping variable in a multivariate model. For example, for a two-way factorial ANOVA design with DISEASE (four categories) and DRUG (three categories), you could contrast the marginal mean for the first level of drug against the third level by specifying:

Note that square brackets enclose the value of the category (for example, for GENDER$, specify GENDER$[male]). For the simple contrast of the first and third levels of DRUG for the second disease only, specify:

The syntax also allows statements like:

In addition, you can specify the error term to use for the contrasts.

Pooled. Uses the error term from the current model.

Separate. Generates a separate variances error term.

DRUG[1] = DRUG[3]

DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]

–3*DRUG[1] – 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]

Page 518: Statistics I

I-498

Chapter 16

Contrast

Contrast generates a contrast for a grouping factor or a repeated measures factor. To open the Contrast dialog box, click Contrast in the Hypothesis Test dialog box.

SYSTAT offers several types of contrasts:

Custom. Enter your own custom coefficients. For example, if your factor has four ordered categories (or levels), you can specify your own coefficients, such as –3 –1 1 3, by typing these values in the Custom text box.

Difference. Compares each level with its adjacent level.

Polynomial. Generates orthogonal polynomial contrasts (to test linear, quadratic, or cubic trends across ordered categories or levels).

� Order. Enter 1 for linear, 2 for quadratic, etc.

� Metric. Use Metric when the ordered categories are not evenly spaced. For example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as the metric.

Sum. In a repeated measures ANOVA, totals the values for each subject.

Page 519: Statistics I

I-499

Linear Models II I: General Linear Models

A Matrix, C Matrix, and D Matrix

A matrix, C matrix, and D matrix are available for hypothesis testing in multivariate models. You can test parameters of the multivariate model estimated or factor the quadratic form of your model into orthogonal components. Linear hypotheses have the form:

These matrices (A, C, and D) may be specified in several alternative ways; if they are not specified, they have default values. To specify an A matrix, click A matrix in the Hypothesis Test dialog box.

A is a matrix of linear weights contrasting the coefficient estimates (the rows of B). The A matrix has as many columns as there are regression coefficients (including the constant) in your model. The number of rows in A determine how many degrees of freedom your hypothesis involves. The A matrix can have several different forms, but these are all submatrices of an identity matrix and are easily formed using Hypothesis Test.

To specify a C matrix, click C matrix in the Hypothesis Test dialog box.

ABC’ = D

Page 520: Statistics I

I-500

Linear Models I I I : General Linear Models

Post hoc Tests for Repeated Measures

After performing analysis of variance, we just have an F-statistic, which tells usthat means are not equal - we still do not know exactly which means are significantly different from which other ones. Post hoc tests can only be used when the "omnibus" ANOVA found a significant effect. If the F-value for a factorturns out non-significant, you cannot go further with the analysis. This protects thepost hoc test from being used too liberally.

The main problem that designers of post hoc test try to deal with is alpha inflation. This refers to the fact that the more tests you conduct at alpha=0.5, the more likelyyou are to claim you have significant result when you shouldn't have. The overall chance of a type I error rate in a particular experiment is referred to as the "experimentwise error rate" (or family wise error rate).

specified level (alpha=0.5) a simple way of doing this is to divide the acceptable alpha level by number of comparisons we intend to make. That is, for any one comparison to be considered significant, the obtained p-value would have to be less than alpha/num of comparisons. Select this option if you would like to perform a

use of the formula.sidak_alpha = 1-(1-alpha)**(1/c), where c is number of paired comparisons.Select this option if you would like to perform a Sidak correction.

Bonferroni correction.

Bonferroni Correction. If you want to keep the experiment-wise error rate to a

Sidak Correction. The same above said experiment-wise error is kept in control by a

Factor Name. This is the name given to the set of repeated measures in GLM.

Page 521: Statistics I

I-501

Chapter 16

The C matrix is used to test hypotheses for repeated measures analysis of variance designs and models with multiple dependent variables. C has as many columns as there are dependent variables. For most multivariate models, C is an identity matrix.

To specify a D matrix, click D matrix in the Hypothesis Test dialog box.

D is a null hypothesis matrix (usually a null matrix). The D matrix, if you use it, must have the same number of rows as A. For univariate multiple regression, D has only one column. For multivariate models (multiple dependent variables), the D matrix has one column for each dependent variable.

A matrix and D matrix are often used to test hypotheses in regression. Linear hypotheses in regression have the form Aβ = D, where A is the matrix of linear weights on coefficients across the independent variables (the rows of β), β is the matrix of regression coefficients, and D is a null hypothesis matrix (usually a null matrix). The A and D matrices can be specified in several alternative ways, and if they are not specified, they have default values.

Page 522: Statistics I

I-502

Linear Models II I: General Linear Models

Using Commands

Select the data with USE filename and continue with:

For stepwise model building, use START in place of ESTIMATE:

To perform hypothesis tests:

Usage Considerations

Types of data. Normally, you analyze raw cases-by-variables data with General Linear Model. You can, however, use a symmetric matrix data file (for example, a covariance matrix saved in a file from Correlations) as input. If you use a matrix as input, you must specify a value for Cases when estimating the model (under Group in the General

GLM MODEL varlist1 = CONSTANT + varlist2 + var1*var2 + , var3(var4) / REPEAT=m,n,… REPEAT=m(x1,x2,…), n(y1,y2,…) NAMES=’name1’,’name2’,… , MEANS, WEIGHT N=n CATEGORY grpvarlist / MISS EFFECT or DUMMY SAVE filename / COEF MODEL RESID DATA PARTIAL ADJUSTED ESTIMATE / MIX TOL=n

START / FORWARD or BACKWARD TOL=n ENTER=p REMOVE=p , FENTER=n FREMOVE=n FORCE=n MAXSTEP=n STEP no argument or var or index / AUTO ENTER=p, REMOVE=p FENTER=n FREMOVE=n STOP

HYPOTHESIS EFFECT varlist, var1&var2,… WITHIN ‘name’ CONTRAST [matrix] / DIFFERENCE or POLYNOMIAL or SUM ORDER=n METRIC=m,n,… SPECIFY hypothesis lang / POOLED or SEPARATE AMATRIX [matrix] CMATRIX [matrix] DMATRIX [matrix] ALL POST varlist / LSD or TUKEY or BONF=n or SCHEFFE or, DUNNETT ONE or TWO CONTROL=’levelname’, POOLED or SEPARATE ROTATE=n TYPE=CORR or COVAR or SSCP STAND = TOTAL or WITHIN FACTOR = HYPOTHESIS or ERROR ERROR varlist or var1&var2 or value(df) or matrix PRIORS m n p … TEST

Page 523: Statistics I

I-503

Chapter 16

Linear Model dialog box) to specify the sample size of the data file that generated the matrix. The number you specify must be an integer greater than 2.

Be sure to include the dependent as well as independent variables in your matrix. SYSTAT picks out the dependent variable you name in your model.

SYSTAT uses the sample size to calculate degrees of freedom in hypothesis tests. SYSTAT also determines the type of matrix (SSCP, covariance, and so on) and adjusts appropriately. With a correlation matrix, the raw and standardized coefficients are the same; therefore, you cannot include a constant when using SSCP, covariance, or correlation matrices. Because these matrices are centered, the constant term has already been removed.

The triangular matrix input facility is useful for “meta-analysis” of published data and missing value computations; however, you should heed the following warnings: First, if you input correlation matrices from textbooks or articles, you may not get the same regression coefficients as those printed in the source. Because of round-off error, printed and raw data can lead to different results. Second, if you use pairwise deletion with Correlations, the degrees of freedom for hypotheses will not be appropriate. You may not even be able to estimate the regression coefficients because of singularities.

In general, correlation matrices containing missing data produce coefficient estimates and hypothesis tests that are optimistic. You can correct for this by specifying a sample size smaller than the number of actual observations (preferably set it equal to the smallest number of cases used for any pair of variables), but this is a guess that you can refine only by doing Monte Carlo simulations. There is no simple solution. Beware, especially, of multivariate regressions (MANOVA and others) with missing data on the dependent variables. You can usually compute coefficients, but hypothesis testing produces results that are suspect.

Print options. General Linear Model produces extended output if you set the output length to LONG or if you select Save scores and results in the Hypothesis Test dialog box.

For model estimation, extended output adds the following: total sum of product matrix, residual (or pooled within groups) sum of product matrix, residual (or pooled within groups) covariance matrix, and the residual (or pooled within groups) correlation matrix.

For hypothesis testing, extended output adds A, C, and D matrices, the matrix of contrasts, and the inverse of the cross products of contrasts, hypothesis and error sum of product matrices, tests of residual roots, canonical correlations, coefficients, and loadings.

Page 524: Statistics I

I-504

Linear Models II I: General Linear Models

Quick Graphs. If no variables are categorical, GLM produces Quick Graphs of residuals versus predicted values. For categorical predictors, GLM produces graphs of the least squares means for the levels of the categorical variable(s).

Saving files. Several sets of output can be saved to a file. The actual contents of the saved file depend on the analysis. Files may include estimated regression coefficients, model variables, residuals, predicted values, diagnostic statistics, canonical variable scores, and posterior probabilities (among other statistics).

BY groups. Each level of any BY variables yields a separate analysis.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. GLM uses the FREQUENCY variable, if present, to duplicate cases.

Case weights. GLM uses the values of any WEIGHT variables to weight each case.

Examples

Example 1 One-Way ANOVA

The following data, KENTON, are from Neter, Wasserman, and Kutner (1985). The data comprise unit sales of a cereal product under different types of package designs. Ten stores were selected as experimental units. Each store was randomly assigned to sell one of the package designs (each design was sold at two or three stores).

PACKAGE SALES

1 121 182 142 122 133 193 173 214 244 30

Page 525: Statistics I

I-505

Chapter 16

Numbers are used to code the four types of package designs; alternatively, you could have used words. Neter, Wasserman, and Kutner report that cartoons are part of designs 1 and 3 but not designs 2 and 4; designs 1 and 2 have three colors; and designs 3 and 4 have five colors. Thus, string codes for PACKAGE$ might have been ‘Cart 3,’ ‘NoCart 3,’ ‘Cart 5,’ and ‘NoCart 5.’ Notice that the data does not need to be ordered by PACKAGE as shown here. The input for a one-way analysis of variance is:

The output follows:

This is the standard analysis of variance table. The F ratio (11.217) appears significant, so you could conclude that the package designs differ significantly in their effects on sales, provided the assumptions are valid.

Pairwise Multiple Comparisons

SYSTAT offers five methods for comparing pairs of means: Bonferroni, Tukey-Kramer HSD, Scheffé, Fischer’s LSD, and Dunnett’s test.

The Dunnett test is available only with one-way designs. Dunnett requires the value of a control group against which comparisons are made. By default, two-sided tests are computed. One-sided Dunnett tests are also available. Incidentally, for Dunnett’s tests on experimental data, you should use the one-sided option unless you cannot predict from theory whether your experimental groups will have higher or lower means than the control.

USE kentonGLM CATEGORY package MODEL sales=CONSTANT + package GRAPH NONE ESTIMATE

Categorical values encountered during processing are:PACKAGE (4 levels) 1, 2, 3, 4 Dep Var: SALES N: 10 Multiple R: 0.921 Squared multiple R: 0.849 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P PACKAGE 258.000 3 86.000 11.217 0.007 Error 46.000 6 7.667

Page 526: Statistics I

I-506

Linear Models II I: General Linear Models

Comparisons for the pairwise methods are made across all pairs of least-squares group means for the design term that is specified. For a multiway design, marginal cell means are computed for the effects specified before the comparisons are made.

To determine significant differences, simply look for pairs with probabilities below your critical value (for example, 0.05 or 0.01). All multiple comparison methods handle unbalanced designs correctly.

After you estimate your ANOVA model, it is easy to do post hoc tests. To do a Tukey HSD test, first estimate the model, then specify these commands:

The output follows:

Results show that sales for the fourth package design (five colors and no cartoons) are significantly larger than those for packages 1 and 2. None of the other pairs differ significantly.

HYPOTHESISPOST package / TUKEYTEST

COL/ROW PACKAGE 1 1 2 2 3 3 4 4Using least squares means.Post Hoc test of SALES

Using model MSE of 7.667 with 6 df.Matrix of pairwise mean differences: 1 2 3 4 1 0.0 2 -2.000 0.0 3 4.000 6.000 0.0 4 12.000 14.000 8.000 0.0 Tukey HSD Multiple Comparisons.Matrix of pairwise comparison probabilities: 1 2 3 4 1 1.000 2 0.856 1.000 3 0.452 0.130 1.000 4 0.019 0.006 0.071 1.000

Page 527: Statistics I

I-507

Chapter 16

Contrasts

This example uses two contrasts:

� We compare the first and third packages using coefficients of (1, 0, –1, 0).

� We compare the average performance of the first three packages with the last, using coefficients of (1, 1, 1, –3).

The input is:

For each hypothesis, we specify one contrast, so the test has one degree of freedom; therefore, the contrast matrix has one row of numbers. These numbers are the same ones you see in ANOVA textbooks, although ANOVA offers one advantage—you do not have to standardize them so that their sum of squares is 1. The output follows:

For the first contrast, the F statistic (2.504) is not significant, so you cannot conclude that the impact of the first and third package designs on sales is significantly different.

HYPOTHESIS EFFECT = package CONTRAST [1 0 –1 0] TESTHYPOTHESIS EFFECT = package CONTRAST [1 1 1 –3] TEST

Test for effect called: PACKAGE A Matrix 1 2 3 4 0.0 1.000 0.0 -1.000Test of Hypothesis Source SS df MS F P Hypothesis 19.200 1 19.200 2.504 0.165 Error 46.000 6 7.667

------------------------------------------------------------------------------- Test for effect called: PACKAGE A Matrix 1 2 3 4 0.0 4.000 4.000 4.000Test of Hypothesis Source SS df MS F P Hypothesis 204.000 1 204.000 26.609 0.002 Error 46.000 6 7.667

Page 528: Statistics I

I-508

Linear Models II I: General Linear Models

Incidentally, the A matrix contains the contrast. The first column (0) corresponds to the constant in the model, and the remaining three columns (1 0 –1) correspond to the dummy variables for PACKAGE.

The last package design is significantly different from the other three taken as a group. Notice that the A matrix looks much different this time. Because the effects sum to 0, the last effect is minus the sum of the other three; that is, letting αi denote the effect for level i of package,

α1 + α2 + α3 + α4 = 0

so

α4 = –(α1 + α2 + α3)

and the contrast is

α1 + α2 + α3 – 3α4

which is

α1 + α2 + α3 – 3(–α1 – α2 – α3)

which simplifies to

4*α1 + 4*α2 + 4*α3

Remember, SYSTAT does all this work automatically.

Orthogonal Polynomials

Constructing orthogonal polynomials for between-group factors is useful when the levels of a factor are ordered. To construct orthogonal polynomials for your between-groups factors:

HYPOTHESISEFFECT = packageCONTRAST / POLYNOMIAL ORDER=2TEST

Page 529: Statistics I

I-509

Chapter 16

The output is:

Make sure that the levels of the factor—after they are sorted by the procedure numerically or alphabetically—are ordered meaningfully on a latent dimension. If you need a specific order, use LABEL or ORDER; otherwise, the results will not make sense. In the example, the significant quadratic effect is the result of the fourth package having a much larger sales volume than the other three.

Effect and Dummy Coding

The effects in a least-squares analysis of variance are associated with a set of dummy variables that SYSTAT generates automatically. Ordinarily, you do not have to concern yourself with these dummy variables; however, if you want to see them, you can save them in to a SYSTAT file. The input is:

Test for effect called: PACKAGE A Matrix 1 2 3 4 0.0 0.0 -1.000 -1.000Test of Hypothesis Source SS df MS F P Hypothesis 60.000 1 60.000 7.826 0.031 Error 46.000 6 7.667

USE kentonGLM CATEGORY package MODEL sales=CONSTANT + package GRAPH NONE SAVE mycodes / MODEL ESTIMATEUSE mycodesFORMAT 12,0LIST SALES x(1..3)

Page 530: Statistics I

I-510

Linear Models II I: General Linear Models

The listing of the dummy variables follows:

The variables X(1), X(2), and X(3) are the effects coding dummy variables generated by the procedure. All cases in the first cell are associated with dummy values 1 0 0; those in the second cell with 0 1 0; the third, 0 0 1; and the fourth, –1 –1 –1. Other least-squares programs use different methods to code dummy variables. The coding used by SYSTAT is most widely used and guarantees that the effects sum to 0.

If you had used dummy coding, these dummy variables would be saved:

This coding yields parameter estimates that are the differences between the mean for each group and the mean of the last group.

Case Number SALES X(1) X(2) X(3)

1 12 1 0 02 18 1 0 03 14 0 1 04 12 0 1 05 13 0 1 06 19 0 0 17 17 0 0 18 21 0 0 19 24 –1 –1 –1

10 30 –1 –1 –1

SALES X(1) X(2) X(3)

12 1 0 018 1 0 014 0 1 012 0 1 013 0 1 019 0 0 019 0 0 117 0 0 121 0 0 124 0 0 030 0 0 0

Page 531: Statistics I

I-511

Chapter 16

Example 2 Randomized Block Designs

A randomized block design is like a factorial design without an interaction term. The following example is from Neter, Wasserman, and Kutner (1985). Five blocks of judges were given the task of analyzing three treatments. Judges are stratified within blocks, so the interaction of blocks and treatments cannot be analyzed. These data are in the file BLOCK. The input is:

You must use GLM instead of ANOVA because you do not want the BLOCK*TREAT interaction in the model. The output is:

Example 3 Incomplete Block Designs

Randomized blocks can be used in factorial designs. Here is an example from John (1971). The data (in the file JOHN) involve an experiment with three treatment factors (A, B, and C) plus a blocking variable with eight levels. Notice that data were collected on 32 of the possible 64 experimental situations.

USE blockGLM CATEGORY block, treat MODEL judgment = CONSTANT + block + treat ESTIMATE

Dep Var: JUDGMENT N: 15 Multiple R: 0.970 Squared multiple R: 0.940 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P BLOCK 171.333 4 42.833 14.358 0.001TREAT 202.800 2 101.400 33.989 0.000 Error 23.867 8 2.983

Page 532: Statistics I

I-512

Linear Models II I: General Linear Models

The input is:

The output follows:

BLOCK A B C Y BLOCK A B C Y

1 1 1 1 101 5 1 1 1 871 2 1 2 373 5 2 1 2 3241 1 2 2 398 5 1 2 1 2791 2 2 1 291 5 2 2 2 4712 1 1 2 312 6 1 1 2 3232 2 1 1 106 6 2 1 1 1282 1 2 1 265 6 1 2 2 4232 2 2 2 450 6 2 2 1 3343 1 1 1 106 7 1 1 1 1313 2 2 1 306 7 2 1 1 1033 1 1 2 324 7 1 2 2 4453 2 2 2 449 7 2 2 2 4374 1 2 1 272 8 1 1 2 3244 2 1 1 89 8 2 1 2 3614 1 2 2 407 8 1 2 1 3024 2 1 2 338 8 2 2 1 272

USE johnGLM CATEGORY block, a, b, c MODEL y = CONSTANT + block + a + b + c +, a*b + a*c + b*c + a*b*c ESTIMATE

Dep Var: Y N: 32 Multiple R: 0.994 Squared multiple R: 0.988 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P BLOCK 2638.469 7 376.924 1.182 0.364A 3465.281 1 3465.281 10.862 0.004B 161170.031 1 161170.031 505.209 0.000C 278817.781 1 278817.781 873.992 0.000A*B 28.167 1 28.167 0.088 0.770A*C 1802.667 1 1802.667 5.651 0.029B*C 11528.167 1 11528.167 36.137 0.000A*B*C 45.375 1 45.375 0.142 0.711 Error 5423.281 17 319.017

Page 533: Statistics I

I-513

Chapter 16

Example 4 Fractional Factorial Designs

Sometimes a factorial design involves so many combinations of treatments that certain cells must be left empty to save experimental resources. At other times, a complete randomized factorial study is designed, but loss of subjects leaves one or more cells completely missing. These models are similar to incomplete block designs because not all effects in the full model can be estimated. Usually, certain interactions must be left out of the model.

The following example uses some experimental data that contain values in only 8 out of 16 possible cells. Each cell contains two cases. The pattern of nonmissing cells makes it possible to estimate only the main effects plus three two-way interactions. The data are in the file FRACTION.

The input follows:

A B C D Y

1 1 1 1 71 1 1 1 32 2 1 1 12 2 1 1 22 1 2 1 122 1 2 1 131 2 2 1 141 2 2 1 152 1 1 2 82 1 1 2 61 2 1 2 121 2 1 2 101 1 2 2 61 1 2 2 42 2 2 2 62 2 2 2 7

USE fractionGLM CATEGORY a, b, c, d MODEL y = CONSTANT + a + b + c + d + a*b + a*c + b*c ESTIMATE

Page 534: Statistics I

I-514

Linear Models II I: General Linear Models

We must use GLM instead of ANOVA to omit the higher-way interactions that ANOVA automatically generates. The output is:

When missing cells turn up by chance rather than by design, you may not know which interactions to eliminate. When you attempt to fit the full model, SYSTAT informs you that the design is singular. In that case, you may need to try several models before finding an estimable one. It is usually best to begin by leaving out the highest-order interaction (A*B*C*D in this example). Continue with subset models until you get an ANOVA table.

Looking for an estimable model is not the same as analyzing the data with stepwise regression because you are not looking at p values. After you find an estimable model, stop and settle with the statistics printed in the ANOVA table.

Example 5 Nested Designs

Nested designs resemble factorial designs with certain cells missing (incomplete factorials). This is because one factor is nested under another, so that not all combinations of the two factors are observed. For example, in an educational study, classrooms are usually nested under schools because it is impossible to have the same classroom existing at two different schools (except as antimatter). The following

Dep Var: Y N: 16 Multiple R: 0.972 Squared multiple R: 0.944 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P A 16.000 1 16.000 8.000 0.022B 4.000 1 4.000 2.000 0.195C 49.000 1 49.000 24.500 0.001D 4.000 1 4.000 2.000 0.195A*B 182.250 1 182.250 91.125 0.000A*C 12.250 1 12.250 6.125 0.038B*C 2.250 1 2.250 1.125 0.320 Error 16.000 8 2.000

Page 535: Statistics I

I-515

Chapter 16

example (in which teachers are nested within schools) is from Neter, Wasserman, and Kutner (1985). The data (learning scores) look like this:

In the study, there are actually six teachers, not just two; thus, the design really looks like this:

The data are set up in the file SCHOOLS.

TEACHER1 TEACHER2

SCHOOL1 2529

1411

SCHOOL2 11 6

2218

SCHOOL3 1720

52

TEACHER1 TEACHER2 TEACHER3 TEACHER4 TEACHER5 TEACHER6

SCHOOL1 2529

1411

SCHOOL2 116

2218

SCHOOL3 1720

52

TEACHER SCHOOL LEARNING

1 1 251 1 292 1 142 1 113 2 113 2 64 2 224 2 185 3 175 3 206 3 56 3 2

Page 536: Statistics I

I-516

Linear Models II I: General Linear Models

The input is:

The output follows:

Your data can use any codes for TEACHER, including a separate code for every teacher in the study, as long as each different teacher within a given school has a different code. GLM will use the nesting specified in the MODEL statement to determine the pattern of nesting. You can, for example, allow teachers in different schools to share codes.

This example is a balanced nested design. Unbalanced designs (unequal number of cases per cell) are handled automatically in SYSTAT because the estimation method is least squares.

Example 6 Split Plot Designs

The split plot design is closely related to the nested design. In the split plot, however, plots are often considered a random factor; therefore, you have to construct different error terms to test different effects. The following example involves two treatments: A (between plots) and B (within plots). The numbers in the cells are the YIELD of the crop within plots.

USE schoolsGLM CATEGORY teacher, school MODEL learning = CONSTANT + school + teacher(school) ESTIMATE

Dep Var: LEARNING N: 12 Multiple R: 0.972 Squared multiple R: 0.945 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P SCHOOL 156.500 2 78.250 11.179 0.009TEACHER(SCHOOL) 567.500 3 189.167 27.024 0.001 Error 42.000 6 7.000

A1 A2

PLOT1 PLOT2 PLOT3 PLOT4

B1 0 3 4 5B2 0 1 2 4B3 5 5 7 6B4 3 4 8 6

Page 537: Statistics I

I-517

Chapter 16

Here are the data from the PLOTS data file in the form needed by SYSTAT:

To analyze this design, you need two different error terms. For the between-plots effects (A), you need “plots within A.” For the within-plots effects (B and A*B), you need “B by plots within A.”

First, fit the saturated model with all the effects and then specify different error terms as needed. The input is:

The output follows:

PLOT A B YIELD

1 1 1 01 1 2 01 1 3 51 1 4 32 1 1 32 1 2 12 1 3 52 1 4 43 2 1 43 2 2 23 2 3 73 2 4 84 2 1 54 2 2 44 2 3 64 2 4 6

USE plotsGLM CATEGORY plot, a, b MODEL yield = CONSTANT + a + b + a*b + plot(a) + b*plot(a) ESTIMATE

Dep Var: YIELD N: 16 Multiple R: 1.000 Squared multiple R: 1.000 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P A 27.563 1 27.563 . . B 42.688 3 14.229 . . A*B 2.188 3 0.729 . . PLOT(A) 3.125 2 1.562 . . B*PLOT(A) 7.375 6 1.229 . . Error 0.0 0 .

Page 538: Statistics I

I-518

Linear Models II I: General Linear Models

You do not get a full ANOVA table because the model is perfectly fit. The coefficient of determination (Squared multiple R) is 1. Now you have to use some of the effects as error terms.

Between-Plots Effects

Let’s test for between-plots effects, namely A. The input is:

The output is:

The between-plots effect is not significant (p = 0.052).

Within-Plots Effects

To do the within-plots effects (B and A*B), the input is:

The output follows:

HYPOTHESISEFFECT = aERROR = plot(a)TEST

Test for effect called: A Test of Hypothesis Source SS df MS F P Hypothesis 27.563 1 27.563 17.640 0.052 Error 3.125 2 1.562

HYPOTHESIS EFFECT = b ERROR = b*plot(a) TESTHYPOTHESIS EFFECT = a*b ERROR = b*plot(a) TEST

Test for effect called: B Test of Hypothesis Source SS df MS F P Hypothesis 42.687 3 14.229 11.576 0.007 Error 7.375 6 1.229 -------------------------------------------------------------------------------

Page 539: Statistics I

I-519

Chapter 16

Here, we find a significant effect due to factor B (p = 0.007), but the interaction is not significant (p = 0.642).

This analysis is the same as that for a repeated measures design with subjects as PLOT, groups as A, and trials as B. Because this method becomes unwieldy for a large number of plots (subjects), SYSTAT offers a more compact method for repeated measures analysis as an alternative.

Example 7 Latin Square Designs

A Latin square design imposes a pattern on treatments in a factorial design to save experimental effort or reduce within cell error. As in the nested design, not all combinations of the square and other treatments are measured, so the model lacks certain interaction terms between squares and treatments. GLM can analyze these designs easily if an extra variable denoting the square is included in the file. The following fixed effects example is from Neter, Wasserman, and Kutner (1985). The SQUARE variable is represented in the cells of the design. For simplicity, the dependent variable, RESPONSE, has been left out.

Test for effect called: A*B Test of Hypothesis Source SS df MS F P Hypothesis 2.188 3 0.729 0.593 0.642 Error 7.375 6 1.229

day1 day2 day3 day4 day5

week1 D C A B E

week2 C B E A Dweek3 A D B E Cweek4 E A C D B

week5 B E D C A

Page 540: Statistics I

I-520

Linear Models II I: General Linear Models

You would set up the data as shown below (the LATIN file).

To do the analysis, the input is:

DAY WEEK SQUARE RESPONSE

1 1 D 181 2 C 171 3 A 141 4 E 211 5 B 172 1 C 132 2 B 342 3 D 212 4 A 162 5 E 153 1 A 73 2 E 293 3 B 323 4 C 273 5 D 134 1 B 174 2 A 134 3 E 244 4 D 314 5 C 255 1 E 215 2 D 265 3 C 265 4 B 315 5 A 7

USE latinGLM CATEGORY day, week, square MODEL response = CONSTANT + day + week + square ESTIMATE

Page 541: Statistics I

I-521

Chapter 16

The output follows:

Example 8 Crossover and Changeover Designs

In crossover designs, an experiment is divided into periods, and the treatment of a subject changes from one period to the next. Changeover studies often use designs similar to a Latin square. A problem with these designs is that there may be a residual or carry-over effect of a treatment into the following period. This can be minimized by extending the interval between experimental periods; however, this is not always feasible. Fortunately, there are methods to assess the magnitude of any carry-over effects that may be present.

Two-period crossover designs can be analyzed as repeated-measures designs. More complicated crossover designs can also be analyzed by SYSTAT, and carry-over effects can be assessed. Cochran and Cox (1957) present a study of milk production by cows under three different feed schedules: A (roughage), B (limited grain), and C (full grain). The design of the study has the form of two ( ) Latin squares:

Dep Var: RESPONSE N: 25 Multiple R: 0.931 Squared multiple R: 0.867 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P DAY 82.000 4 20.500 1.306 0.323WEEK 477.200 4 119.300 7.599 0.003SQUARE 664.400 4 166.100 10.580 0.001 Error 188.400 12 15.700

COW

Latin square 1 Latin square 2

Period I II III IV V VI

1 A B C A B C

2 B C A C A B

3 C A B B C A

3 3×

Page 542: Statistics I

I-522

Linear Models II I: General Linear Models

The data are set up in the WILLIAMS data file as follows:

PERIOD is nested within each Latin square (the periods for cows in one square are unrelated to the periods in the other). The variable RESIDUAL indicates the treatment of the preceding period. For the first period for each cow, there is no preceding period. The input is:

COW SQUARE PERIOD FEED CARRY RESIDUAL MILK

1 1 1 1 1 0 381 1 2 2 1 1 251 1 3 3 2 2 152 1 1 2 1 0 1092 1 2 3 2 2 862 1 3 1 2 3 393 1 1 3 1 0 1243 1 2 1 2 3 723 1 3 2 1 1 274 2 1 1 1 0 864 2 2 3 1 1 764 2 3 2 2 3 465 2 1 2 1 0 755 2 2 1 2 2 355 2 3 3 1 1 346 2 1 3 1 0 1016 2 2 2 2 3 636 2 3 1 2 2 1

USE williamsGLM CATEGORY cow, period, square, residual, carry, feed MODEL milk = CONSTANT + cow + feed +, period(square) + residual(carry) ESTIMATE

Page 543: Statistics I

I-523

Chapter 16

The output follows:

There is a significant effect of feed on milk production and an insignificant residual or carry-over effect in this instance.

Type I Sums-of-Squares Analysis

To replicate the Cochran and Cox Type I sums-of-squares analysis, you must fit a new model to get their sums of squares. The following commands test the COW effect. Notice that the Error specification uses the mean square error (MSE) from the previous analysis. It also contains the error degrees of freedom (4) from the previous model.

The output follows:

Dep Var: MILK N: 18 Multiple R: 0.995 Squared multiple R: 0.990 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P COW 3835.950 5 767.190 15.402 0.010FEED 2854.550 2 1427.275 28.653 0.004PERIOD(SQUARE) 3873.950 4 968.488 19.443 0.007RESIDUAL(CARRY) 616.194 2 308.097 6.185 0.060 Error 199.250 4 49.813

USE williamsGLM CATEGORY cow MODEL milk = CONSTANT + cow ESTIMATE HYPOTHESIS EFFECT = cow ERROR = 49.813(4) TEST

Dep Var: MILK N: 18 Multiple R: 0.533 Squared multiple R: 0.284 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P COW 5781.111 5 1156.222 0.952 0.484 Error 14581.333 12 1215.111 ------------------------------------------------------------------------------- Test for effect called: COW Test of Hypothesis Source SS df MS F P Hypothesis 5781.111 5 1156.222 23.211 0.005 Error 199.252 4 49.813

Page 544: Statistics I

I-524

Linear Models II I: General Linear Models

The remaining term, PERIOD, requires a different model. PERIOD is nested within SQUARE.

The resulting output is:

Example 9 Missing Cells Designs (the Means Model)

When cells are completely missing in a factorial design, parameterizing a model can be difficult. The full model cannot be estimated. GLM offers a means model parameterization so that missing cell parameters can be dropped automatically from the model, and hypotheses for main effects and interactions can be tested by specifying cells directly. Examine Searle (1987), Hocking (1985), or Milliken and Johnson (1984) for more information in this area.

USE williamsGLM CATEGORY period square MODEL milk = CONSTANT + period(square) ESTIMATE HYPOTHESIS EFFECT = period(square) ERROR = 49.813(4) TEST

Dep Var: MILK N: 18 Multiple R: 0.751 Squared multiple R: 0.564 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P PERIOD(SQUARE) 11489.111 4 2872.278 4.208 0.021 Error 8873.333 13 682.564 -------------------------------------------------------------------------------

> HYPOTHESIS

> EFFECT = period(square)

> ERROR = 49.813(4)

> TESTTest for effect called: PERIOD(SQUARE) Test of Hypothesis Source SS df MS F P Hypothesis 11489.111 4 2872.278 57.661 0.001 Error 199.252 4 49.813

Page 545: Statistics I

I-525

Chapter 16

Widely favored for this purpose by statisticians (Searle, 1987; Hocking, 1985; Milliken and Johnson, 1984), the means model allows:

� Tests of hypotheses in missing cells designs (using Type IV sums of squares)

� Tests of simple hypotheses (for example, within levels of other factors)

� The use of population weights to reflect differences in subclass sizes

Effects coding is the default for GLM. Alternatively, means models code predictors as cell means rather than effects, which differ from a grand mean. The constant is omitted, and the predictors are 1 for a case belonging to a given cell and 0 for all others. When cells are missing, GLM automatically excludes null columns and estimates the submodel.

The categorical variables are specified in the MODEL statement differently for a means model than for an effects model. Here are some examples:

The first two models generate fully factorial designs (A by B and group by AGE by SCHOOL$). Notice that they omit the constant and main effects parameters because the means model does not include effects or a grand mean. Nevertheless, the number of parameters is the same in the two models. The following are the effects model and the means model, respectively, for a design (two levels of A and three levels of B):

MODEL y = a*b / MEANS

MODEL y = group*age*school$ / MEANS

MODEL y = CONSTANT + A + B + A*B

A B m a1 b1 b2 a1b1 a1b2

1 1 1 1 1 0 1 01 2 1 1 0 1 0 11 3 1 1 –1 –1 –1 –12 1 1 –1 1 0 –1 02 2 1 –1 0 1 0 –12 3 1 –1 –1 –1 1 –1

2 3×

Page 546: Statistics I

I-526

Linear Models II I: General Linear Models

Means and effects models can be blended for incomplete factorials and others designs. All crossed terms (for example, A*B) will be coded with means design variables (provided the MEANS option is present), and the remaining terms will be coded as effects. The constant must be omitted, even in these cases, because it is collinear with the means design variables. All covariates and effects that are coded factors must precede the crossed factors in the MODEL statement.

Here is an example, assuming A has four levels, B has two, and C has three. In this design, there are 24 possible cells, but only 12 are nonmissing. The treatment combinations are partially balanced across the levels of B and C.

MODEL y = A*B / MEANS

A B a1b1 a1b2 a1b3 a2b1 a2b2 a2b3

1 1 1 0 0 0 0 01 2 0 1 0 0 0 01 3 0 0 1 0 0 02 1 0 0 0 1 0 02 2 0 0 0 0 1 02 3 0 0 0 0 0 1

MODEL y = A + B*C / MEANS

A B C a1 a2 a3 b1c1 b1c2 b1c3 b2c1 b2c2 b2c3

1 1 1 1 0 0 1 0 0 0 0 03 1 1 0 0 1 1 0 0 0 0 02 1 2 0 1 0 0 1 0 0 0 04 1 2 –1 –1 –1 0 1 0 0 0 01 1 3 1 0 0 0 0 1 0 0 04 1 3 –1 –1 –1 0 0 1 0 0 02 2 1 0 1 0 0 0 0 1 0 03 2 1 0 0 1 0 0 0 0 1 02 2 2 0 1 0 0 0 0 0 1 04 2 2 –1 –1 –1 0 0 0 0 1 01 2 3 1 0 0 0 0 0 0 0 13 2 3 0 0 1 0 0 0 0 0 1

Page 547: Statistics I

I-527

Chapter 16

Nutritional Knowledge Survey

The following example, which uses the data file MJ202, is from Milliken and Johnson (1984). The data are from a home economics survey experiment. DIFF is the change in test scores between pre-test and post-test on a nutritional knowledge questionnaire. GROUP classifies whether or not a subject received food stamps. AGE designates four age groups, and RACE$ was their term for designating Whites, Blacks, and Hispanics.

Empty cells denote age/race combinations for which no data were collected. Numbers within cells refer to cell designations in the Fisher LSD pairwise mean comparisons at the end of this example.

First, fit the model. The input is:

The output follows:

Group 0 Group 1

1 2 3 4 1 2 3 4

W 1 3 6 9 10 13 15

H 5 12B 2 4 7 8 11 14

USE mj202GLM CATEGORY group age race$ MODEL diff = group*age*race$ / MEANS ESTIMATE

Means Model Dep Var: DIFF N: 107 Multiple R: 0.538 Squared multiple R: 0.289 ***WARNING***Missing cells encountered. Tests of factors will not appear.Ho: All means equal. Unweighted Means Model Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio PModel 1068.546 14 76.325 2.672 0.003Error 2627.472 92 28.559

Page 548: Statistics I

I-528

Linear Models II I: General Linear Models

We need to test the GROUP main effect. The following notation is equivalent to Milliken and Johnson’s. Because of the missing cells, the GROUP effect must be computed over means that are balanced across the other factors.

In the drawing at the beginning of this example, notice that this specification contrasts all the numbered cells in group 0 (except 2) with all the numbered cells in group 1 (except 8 and 15). The input is:

The output is:

HYPOTHESISNOTE ’GROUP MAIN EFFECTSPECIFY , group[0] age[1] race$[W] + group[0] age[2] race$[W] +, group[0] age[3] race$[B] + group[0] age[3] race$[H] +, group[0] age[3] race$[W] + group[0] age[4] race$[B] =, group[1] age[1] race$[W] + group[1] age[2] race$[W] +, group[1] age[3] race$[B] + group[1] age[3] race$[H] +, group[1] age[3] race$[W] + group[1] age[4] race$[B]TEST

Hypothesis. A Matrix 1 2 3 4 5 -1.000 0.0 -1.000 -1.000 -1.000 6 7 8 9 10 -1.000 -1.000 0.0 1.000 1.000 11 12 13 14 15 1.000 1.000 1.000 1.000 0.0Null hypothesis value for D 0.0Test of Hypothesis Source SS df MS F P Hypothesis 75.738 1 75.738 2.652 0.107 Error 2627.472 92 28.559

Page 549: Statistics I

I-529

Chapter 16

The computations for the AGE main effect are similar to those for the GROUP main effect:

The output follows:

The GROUP by AGE interaction requires more complex balancing than the main effects. It is derived from a subset of the means in the following specified combination. Again, check Milliken and Johnson to see the correspondence.

HYPOTHESISNOTE ’AGE MAIN EFFECT’SPECIFY , GROUP[1] AGE[1] RACE$[B] + GROUP[1] AGE[1] RACE$[W] =, GROUP[1] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,

GROUP[0] AGE[2] RACE$[B] + GROUP[1] AGE[2] RACE$[W] =, GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,

GROUP[0] AGE[3] RACE$[B] + GROUP[1] AGE[3] RACE$[B] +, GROUP[1] AGE[3] RACE$[W] =, GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[B] +, GROUP[1] AGE[4] RACE$[W]TEST

Hypothesis. A Matrix 1 2 3 4 5 1 0.0 0.0 0.0 0.0 0.0 2 0.0 -1.000 0.0 0.0 0.0 3 0.0 0.0 0.0 -1.000 0.0 6 7 8 9 10 1 0.0 0.0 -1.000 -1.000 0.0 2 0.0 1.000 0.0 0.0 -1.000 3 0.0 1.000 0.0 0.0 0.0 11 12 13 14 15 1 0.0 0.0 0.0 1.000 1.000 2 0.0 0.0 0.0 0.0 1.000 3 -1.000 0.0 -1.000 1.000 1.000 D Matrix 1 0.0 2 0.0 3 0.0 Test of Hypothesis Source SS df MS F P Hypothesis 41.526 3 13.842 0.485 0.694 Error 2627.472 92 28.559

Page 550: Statistics I

I-530

Linear Models II I: General Linear Models

The input is:

The output is:

HYPOTHESISNOTE ’GROUP BY AGE INTERACTION’SPECIFY , group[0] age[1] race$[W] – group[0] age[3] race$[W] –, group[1] age[1] race$[W] + group[1] age[3] race$[W] +, group[0] age[3] race$[B] – group[0] age[4] race$[B] –, group[1] age[3] race$[B] + group[1] age[4] race$[B]=0.0;,

group[0] age[2] race$[W] – group[0] age[3] race$[W] –, group[1] age[2] race$[W] + group[1] age[3] race$[W] +, group[0] age[3] race$[B] – group[0] age[4] race$[B] –, group[1] age[3] RACE$[B] + group[1] age[4] race$[B]=0.0;,

group[0] age[3] race$[B] – group[0] age[4] race$[B] –, group[1] age[3] race$[B] + group[1] age[4] race$[B]=0.0TEST

Hypothesis. A Matrix 1 2 3 4 5 1 -1.000 0.0 0.0 -1.000 0.0 2 0.0 0.0 -1.000 -1.000 0.0 3 0.0 0.0 0.0 -1.000 0.0 6 7 8 9 10 1 1.000 1.000 0.0 1.000 0.0 2 1.000 1.000 0.0 0.0 1.000 3 0.0 1.000 0.0 0.0 0.0 11 12 13 14 15 1 1.000 0.0 -1.000 -1.000 0.0 2 1.000 0.0 -1.000 -1.000 0.0 3 1.000 0.0 0.0 -1.000 0.0 D Matrix 1 0.0 2 0.0 3 0.0 Test of Hypothesis Source SS df MS F P Hypothesis 91.576 3 30.525 1.069 0.366 Error 2627.472 92 28.559

Page 551: Statistics I

I-531

Chapter 16

The following commands are needed to produce the rest of Milliken and Johnson’s results. The remaining output is not listed.

Finally, Milliken and Johnson do pairwise comparisons:

HYPOTHESISNOTE ’RACE$ MAIN EFFECT’SPECIFY , group[0] age[2] race$[B] + group[0] age[3] race$[B] +, group[1] age[1] race$[B] + group[1] age[3] race$[B] +, group[1] age[4] race$[B] =, group[0] age[2] race$[W] + group[0] age[3] race$[W] +, group[1] age[1] race$[W] + group[1] age[3] race$[W] +, group[1] age[4] race$[W];,

group[0] age[3] race$[H] + group[1] age[3] race$[H] =, group[0] age[3] race$[W] + group[1] age[3] race$[W] TESTHYPOTHESISNOTE ’GROUP*RACE$’SPECIFY , group[0] age[3] race$[B] – group[0] age[3] race$[W] –, group[1] age[3] race$[B] + group[1] age[3] race$[W]=0.0;,

group[0] age[3] race$[H] – group[0] age[3] race$[W] –, group[1] age[3] race$[H] + group[1] age[3] race$[W]=0.0TESTHYPOTHESISNOTE 'AGE*RACE$'SPECIFY , group[1] age[1] race$[B] – group[1] age[1] race$[W] –, group[1] age[4] race$[B] + group[1] age[4] race$[W]=0.0;,

group[0] age[2] race$[B] – group[0] age[2] race$[W] –, group[0] age[3] race$[B] + group[0] age[3] race$[W]=0.0;,

group[1] age[3] race$[B] – group[1] age[3] race$[W] –, group[1] age[4] race$[B] + group[1] age[4] race$[W]=0.0TEST

HYPOTHESIS POST group*age*race$ / LSD TEST

Page 552: Statistics I

I-532

Linear Models II I: General Linear Models

The following is the matrix of comparisons printed by GLM. The matrix of mean differences has been omitted.

Within group 0 (cells 1–7), there are no significant pairwise differences in average test score changes. The same is true within group 1 (cells 8–15).

COL/ROW GROUP AGE RACE$ 1 0 1 W 2 0 2 B 3 0 2 W 4 0 3 B 5 0 3 H 6 0 3 W 7 0 4 B 8 1 1 B 9 1 1 W 10 1 2 W 11 1 3 B 12 1 3 H 13 1 3 W 14 1 4 B 15 1 4 WUsing unweighted means.Post Hoc test of DIFF

Using model MSE of 28.559 with 92 df.Fisher’s Least-Significant-Difference Test.Matrix of pairwise comparison probabilities: 1 2 3 4 5 1 1.000 2 0.662 1.000 3 0.638 0.974 1.000 4 0.725 0.323 0.295 1.000 5 0.324 0.455 0.461 0.161 1.000 6 0.521 0.827 0.850 0.167 0.497 7 0.706 0.901 0.912 0.527 0.703 8 0.197 0.274 0.277 0.082 0.780 9 0.563 0.778 0.791 0.342 0.709 10 0.049 0.046 0.042 0.004 0.575 11 0.018 0.016 0.015 0.002 0.283 12 0.706 0.901 0.912 0.527 0.703 13 0.018 0.007 0.005 0.000 0.456 14 0.914 0.690 0.676 0.908 0.403 15 0.090 0.096 0.090 0.008 0.783 6 7 8 9 10 6 1.000 7 0.971 1.000 8 0.292 0.543 1.000 9 0.860 0.939 0.514 1.000 10 0.026 0.392 0.836 0.303 1.000 11 0.010 0.213 0.451 0.134 0.425 12 0.971 1.000 0.543 0.939 0.392 13 0.000 0.321 0.717 0.210 0.798 14 0.610 0.692 0.288 0.594 0.168 15 0.059 0.516 0.930 0.447 0.619 11 12 13 14 15 11 1.000 12 0.213 1.000 13 0.466 0.321 1.000 14 0.082 0.692 0.124 1.000 15 0.219 0.516 0.344 0.238 1.000

Page 553: Statistics I

I-533

Chapter 16

Example 10 Covariance Alternatives to Repeated Measures

Analysis of covariance offers an alternative to repeated measures in a pre-post design. You can use the pre-test as a covariate in predicting the post-test. This example shows how to do a two-group, pre-post design:

When using this design, be sure to check the homogeneity of slopes assumption. Use the following commands to check that the interaction term, GROUP*PRE, is not significant:

Example 11 Weighting Means

Sometimes you want to weight the cell means when you test hypotheses in ANOVA. Suppose you have an experiment in which a few rats died before its completion. You do not want the hypotheses tested to depend upon the differences in cell sizes (which are presumably random). Here is an example from Morrison (1976). The data (MOTHERS) are hypothetical profiles on three scales of mothers in each of four socioeconomic classes.

Morrison analyzes these data with the multivariate profile model for repeated measures. Because the hypothesis of parallel profiles across classes is not rejected, you can test whether the profiles are level. That is, do the scales differ when we pool the classes together?

Pooling unequal classes can be done by weighting each according to sample size or averaging the means of the subclasses. First, let’s look at the model and test the hypothesis of equality of scale parameters without weighting the cell means.

GLM USE filename CATEGORY group MODEL post = CONSTANT + group + pre ESTIMATE

GLM USE filename CATEGORY group MODEL post = CONSTANT + group + pre + group*pre ESTIMATE

Page 554: Statistics I

I-534

Linear Models II I: General Linear Models

The input is:

The output is:

USE mothersGLM CATEGORY class MODEL scale(1 .. 3) = CONSTANT + class ESTIMATE HYPOTHESIS EFFECT = CONSTANT CMATRIX [1 –1 0; 0 1 –1] TEST

Dependent variable means SCALE(1) SCALE(2) SCALE(3) 14.524 15.619 15.857 -1Estimates of effects B = (X’X) X’Y SCALE(1) SCALE(2) SCALE(3) CONSTANT 13.700 14.550 14.988 CLASS 1 4.300 5.450 4.763 CLASS 2 0.100 0.650 -0.787 CLASS 3 -0.700 -0.550 0.012 Test for effect called: CONSTANT C Matrix 1 2 3 1 1.000 -1.000 0.0 2 0.0 1.000 -1.000 Univariate F Tests Effect SS df MS F P 1 14.012 1 14.012 4.652 0.046 Error 51.200 17 3.012 2 3.712 1 3.712 1.026 0.325 Error 61.500 17 3.618 Multivariate Test Statistics Wilks’ Lambda = 0.564 F-Statistic = 6.191 df = 2, 16 Prob = 0.010 Pillai Trace = 0.436 F-Statistic = 6.191 df = 2, 16 Prob = 0.010 Hotelling-Lawley Trace = 0.774 F-Statistic = 6.191 df = 2, 16 Prob = 0.010

Page 555: Statistics I

I-535

Chapter 16

Notice that the dependent variable means differ from the CONSTANT. The CONSTANT in this case is a mean of the cell means rather than the mean of all the cases.

Weighting by the Sample Size

If you believe (as Morrison does) that the differences in cell sizes reflect population subclass proportions, then you need to weight the cell means to get a grand mean; for example:

8*(µ1) + 5*(µ2) + 4*(µ3) + 4*(µ4)

Expressed in terms of our analysis of variance parameterization, this is:

8*(µ + α1) + 5*(µ + α2) + 4*(µ + α3) + 4*(µ + α4)

Because the sum of effects is 0 for a classification and because you do not have an independent estimate of CLASS4, this expression is equivalent to

8*(µ + α1) + 5*(µ + α2) + 4*(µ + α3) + 4*(µ - α1 - α2 - α3)

which works out to

21*µ + 4*(α1) + 1*(α2) + 0*(α3)

Use AMATRIX to test this hypothesis.

The output is:

HYPOTHESIS AMATRIX [21 4 1 0] CMATRIX [1 -1 0; 0 1 -1] TEST

Hypothesis. A Matrix 1 2 3 4 21.000 4.000 1.000 0.0C Matrix 1 2 3 1 1.000 -1.000 0.0 2 0.0 1.000 -1.000 Univariate F Tests

Page 556: Statistics I

I-536

Linear Models II I: General Linear Models

This is the multivariate F statistic that Morrison gets. For these data, we prefer the weighted means analysis because these differences in cell frequencies probably reflect population base rates. They are not random.

Example 12 Hotelling’s T-Square

You can use General Linear Model to calculate Hotelling’s T-square statistic.

One-Sample Test

For example, to get a one-sample test for the variables X and Y, select both X and Y as dependent variables.

The F test for CONSTANT is the statistic you want. It is the same as the Hotelling’s T2 for the hypothesis that the population means for X and Y are 0.

You can also test against the hypothesis that the means of X and Y have particular nonzero values (for example, 10 and 15) by using:

Effect SS df MS F P 1 25.190 1 25.190 8.364 0.010 Error 51.200 17 3.012 2 1.190 1 1.190 0.329 0.574 Error 61.500 17 3.618 Multivariate Test Statistics Wilks’ Lambda = 0.501 F-Statistic = 7.959 df = 2, 16 Prob = 0.004 Pillai Trace = 0.499 F-Statistic = 7.959 df = 2, 16 Prob = 0.004 Hotelling-Lawley Trace = 0.995 F-Statistic = 7.959 df = 2, 16 Prob = 0.004

GLM USE filename MODEL x, y = CONSTANT ESTIMATE

HYPOTHESISDMATRIX [10 15]TEST

Page 557: Statistics I

I-537

Chapter 16

Two-Sample Test

For a two-sample test, you must provide a categorical independent variable that represents the two groups. The input is:

Example 13 Discriminant Analysis

This example uses the IRIS data file. Fisher used these data to illustrate his discriminant function. To define the model:

SYSTAT saves the canonical scores associated with the hypothesis. The scores are stored in subscripted variables named FACTOR. Because the effects involve a categorical variable, the Mahalanobis distances (named DISTANCE) and posterior probabilities (named PROB) are saved in the same file. These distances are computed in the discriminant space itself. The closer a case is to a particular group’s location in that space, the more likely it is that it belongs to that group. The probability of group membership is computed from these distances. A variable named PREDICT that contains the predicted group membership is also added to the file.

The output follows:

GLM CATEGORY group MODEL x,y = CONSTANT + group ESTIMATE

USE irisGLM CATEGORY species MODEL sepallen sepalwid petallen petalwid = CONSTANT +, species ESTIMATEHYPOTHESIS EFFECT = species SAVE canon TEST

Dependent variable means SEPALLEN SEPALWID PETALLEN PETALWID 5.843 3.057 3.758 1.199

Page 558: Statistics I

I-538

Linear Models II I: General Linear Models

-1Estimates of effects B = (X’X) X’Y SEPALLEN SEPALWID PETALLEN PETALWID CONSTANT 5.843 3.057 3.758 1.199 SPECIES 1 -0.837 0.371 -2.296 -0.953 SPECIES 2 0.093 -0.287 0.502 0.127 ------------------------------------------------------------------------------- Test for effect called: SPECIES Null hypothesis contrast AB SEPALLEN SEPALWID PETALLEN PETALWID 1 -0.837 0.371 -2.296 -0.953 2 0.093 -0.287 0.502 0.127 -1Inverse contrast A(X’X) A’ 1 2 1 0.013 2 -0.007 0.013 -1 -1Hypothesis sum of product matrix H = B’A’(A(X’X) A’) AB SEPALLEN SEPALWID PETALLEN PETALWID SEPALLEN 63.212 SEPALWID -19.953 11.345 PETALLEN 165.248 -57.240 437.103 PETALWID 71.279 -22.933 186.774 80.413 Error sum of product matrix G = E’E SEPALLEN SEPALWID PETALLEN PETALWID SEPALLEN 38.956 SEPALWID 13.630 16.962 PETALLEN 24.625 8.121 27.223 PETALWID 5.645 4.808 6.272 6.157 Univariate F Tests Effect SS df MS F P SEPALLEN 63.212 2 31.606 119.265 0.000 Error 38.956 147 0.265 SEPALWID 11.345 2 5.672 49.160 0.000 Error 16.962 147 0.115 PETALLEN 437.103 2 218.551 1180.161 0.000 Error 27.223 147 0.185 PETALWID 80.413 2 40.207 960.007 0.000 Error 6.157 147 0.042

Page 559: Statistics I

I-539

Chapter 16

The multivariate tests are all significant. The dependent variable canonical coefficients are used to produce discriminant scores. These coefficients are standardized by the within-groups standard deviations so you can compare their magnitude across variables with different scales. Because they are not raw coefficients, there is no need for a constant. The scores produced by these coefficients have an overall zero mean and a unit standard deviation within groups.

Multivariate Test Statistics Wilks’ Lambda = 0.023 F-Statistic = 199.145 df = 8, 288 Prob = 0.000 Pillai Trace = 1.192 F-Statistic = 53.466 df = 8, 290 Prob = 0.000 Hotelling-Lawley Trace = 32.477 F-Statistic = 580.532 df = 8, 286 Prob = 0.000 THETA = 0.970 S = 2, M = 0.5, N = 71.0 Prob = 0.0 Test of Residual Roots Roots 1 through 2 Chi-Square Statistic = 546.115 df = 8 Roots 2 through 2 Chi-Square Statistic = 36.530 df = 3 Canonical Correlations 1 2 0.985 0.471Dependent variable canonical coefficients standardizedby conditional (within groups) standard deviations 1 2 SEPALLEN 0.427 0.012 SEPALWID 0.521 0.735 PETALLEN -0.947 -0.401 PETALWID -0.575 0.581

Canonical loadings (correlations between conditionaldependent variables and dependent canonical factors) 1 2 SEPALLEN -0.223 0.311 SEPALWID 0.119 0.864 PETALLEN -0.706 0.168 PETALWID -0.633 0.737 Group classification function coefficients 1 2 3 SEPALLEN 23.544 15.698 12.446 SEPALWID 23.588 7.073 3.685 PETALLEN -16.431 5.211 12.767 PETALWID -17.398 6.434 21.079 Group classification constants 1 2 3 -86.308 -72.853 -104.368Canonical scores have been saved.

Page 560: Statistics I

I-540

Linear Models II I: General Linear Models

The group classification coefficients and constants comprise the Fisher discriminant functions for classifying the raw data. You can apply these coefficients to new data and assign each case to the group with the largest function value for that case.

Studying Saved Results

The CANON file that was just saved contains the canonical variable scores (FACTOR(1) and FACTOR(2)), the Mahalanobis distances to each group centroid (DISTANCE(1), DISTANCE(2), and DISTANCE(3)), the posterior probability for each case being assigned to each group (PROB(1), PROB(2), and PROB(3)), the predicted group membership (PREDICT), and the original group assignment (GROUP).

To produce a classification table of the group assignment against the predicted group membership and a plot of the second canonical variable against the first, the input is:

The output follows:

USE canon XTAB PRINT NONE/ FREQ CHISQ TABULATE GROUP * PREDICT PLOT FACTOR(2)*FACTOR(1) / OVERLAY GROUP=GROUP COLOR=2,1,3 , FILL=1,1,1 SYMBOL=4,8,5

Frequencies GROUP (rows) by PREDICT (columns) 1 2 3 Total +-------------------+ 1 | 50 0 0 | 50 2 | 0 48 2 | 50 3 | 0 1 49 | 50 +-------------------+ Total 50 49 51 150 Test statistic Value df Prob Pearson Chi-square 282.593 4.000 0.000

Page 561: Statistics I

I-541

Chapter 16

However, it is much easier to use the Discriminant Analysis procedure.

Prior Probabilities

In this example, there were equal numbers of flowers in each group. Sometimes the probability of finding a case in each group is not the same across groups. To adjust the prior probabilities for this example, specify 0.5, 0.3, and 0.2 as the priors:

General Linear Model uses the probabilities you specify to compute the posterior probabilities that are saved in the file under the variable PROB. Be sure to specify a probability for each level of the grouping variable. The probabilities should add up to 1.

Example 14 Principal Components Analysis (Within Groups)

General Linear Model allows you to partial out effects based on grouping variables and to factor residual correlations. If between-group variation is significant, the within-group structure can differ substantially from the total structure (ignoring the grouping variable). However, if you are just computing principal components on a single sample (no grouping variable), you can obtain more detailed output using the Factor Analysis procedure.

PRIORS 0.5 0.3 0.2

-10 -5 0 5 10FACTOR(1)

-3

-2

-1

0

1

2

3

FA

CT

OR

(2)

123

GROUP

Page 562: Statistics I

I-542

Linear Models II I: General Linear Models

The following data (USSTATES) comprise death rates by cause from nine census divisions of the country for that year. The divisions are in the column labeled DIV, and the U.S. Post Office two-letter state abbreviations follow DIV. Other variables include ACCIDENT, CARDIO, CANCER, PULMONAR, PNEU_FLU, DIABETES, LIVER, STATE$, FSTROKE, MSTROKE.

The variation in death rates between divisions in these data is substantial. Here is a grouped box plot of the second variable, CARDIO, by division. The other variables show similar regional differences.

If you analyze these data ignoring DIVISION$, the correlations among death rates would be due substantially to between-division differences. You might want to examine the pooled within-region correlations to see if the structure is different when divisional differences are statistically controlled. Accordingly, you will factor the residual correlation matrix after regressing medical variables onto an index variable denoting the census regions. The input is:

USE usstatesGLM CATEGORY division MODEL accident cardio cancer pulmonar pneu_flu, diabetes liver fstroke mstroke = CONSTANT + division ESTIMATEHYPOTHESIS EFFECT = division FACTOR = ERROR TYPE = CORR ROTATE = 2 TEST

E N Central

E S Central

Mid Atlantic

Mountain

New EnglandPacifi

c

S Atlantic

W N Central

W S Central

DIVISION$

100

200

300

400

500

CA

RD

IO

Page 563: Statistics I

I-543

Chapter 16

The hypothesis commands compute the principal components on the error (residual) correlation matrix and rotate the first two components to a varimax criterion. For other rotations, use the Factor Analysis procedure.

The FACTOR options can be used with any hypothesis. Ordinarily, when you test a hypothesis, the matrix product INV(G)*H is factored and the latent roots of this matrix are used to construct the multivariate test statistic. However, you can indicate which matrix—the hypothesis (H) matrix or the error (G) matrix—is to be factored. By computing principal components on the hypothesis or error matrix separately, FACTOR offers a direct way to compute principal components on residuals of any linear model you wish to fit. You can use any A, C, and/or D matrices in the hypothesis you are factoring, or you can use any of the other commands that create these matrices.

The hypothesis output follows:

Factoring Error Matrix 1 2 3 4 5 1 1.000 2 0.280 1.000 3 0.188 0.844 1.000 4 0.307 0.676 0.711 1.000 5 0.113 0.448 0.297 0.396 1.000 6 0.297 0.419 0.526 0.296 -0.123 7 -0.005 0.251 0.389 0.252 -0.138 8 0.402 -0.202 -0.379 -0.190 -0.110 9 0.495 -0.119 -0.246 -0.127 -0.071 6 7 8 9 6 1.000 7 -0.025 1.000 8 -0.151 -0.225 1.000 9 -0.076 -0.203 0.947 1.000 Latent roots 1 2 3 4 5 3.341 2.245 1.204 0.999 0.475 6 7 8 9 0.364 0.222 0.119 0.033

Page 564: Statistics I

I-544

Linear Models II I: General Linear Models

Notice the sorted, rotated loadings. When interpreting these values, do not relate the row numbers (1 through 9) to the variables. Instead, find the corresponding loading in the Rotated Loadings table. The ordering of the rotated loadings corresponds to the order of the model variables.

The first component rotates to a dimension defined by CANCER, CARDIO, PULMONAR, and DIABETES; the second, by a dimension defined by MSTROKE and FSTROKE (male and female stroke rates). ACCIDENT also loads on the second factor but is not independent of the first. LIVER does not load highly on either factor.

Loadings 1 2 3 4 5 1 0.191 0.798 0.128 -0.018 -0.536 2 0.870 0.259 -0.097 0.019 0.219 3 0.934 0.097 0.112 0.028 0.183 4 0.802 0.247 -0.135 0.120 -0.071 5 0.417 0.146 -0.842 -0.010 -0.042 6 0.512 0.218 0.528 -0.580 0.068 7 0.391 -0.175 0.400 0.777 -0.044 8 -0.518 0.795 0.003 0.155 0.226 9 -0.418 0.860 0.025 0.138 0.204 6 7 8 9 1 0.106 -0.100 -0.019 -0.015 2 0.145 -0.254 0.177 0.028 3 0.039 -0.066 -0.251 -0.058 4 -0.499 0.085 0.044 0.015 5 0.216 0.220 -0.005 -0.002 6 0.093 0.241 0.063 0.010 7 0.154 0.159 0.046 0.009 8 -0.041 0.056 0.081 -0.119 9 0.005 0.035 -0.101 0.117 Rotated loadings on first 2 principal components 1 2 1 0.457 0.682 2 0.906 -0.060 3 0.909 -0.234 4 0.838 -0.047 5 0.441 -0.008 6 0.556 0.027 7 0.305 -0.300 8 -0.209 0.925 9 -0.093 0.951 Sorted rotated loadings on first 2 principal components(loadings less than .25 made 0.) 1 2 1 0.909 0.0 2 0.906 0.0 3 0.838 0.0 4 0.556 0.0 5 0.0 0.951 6 0.0 0.925 7 0.457 0.682 8 0.305 -0.300 9 0.441 0.0

Page 565: Statistics I

I-545

Chapter 16

Example 15 Canonical Correlation Analysis

Suppose you have 10 dependent variables, MMPI(1) to MMPI(10), and 3 independent variables, RATER(1) to RATER(3). Enter the following commands to obtain the canonical correlations and dependent canonical coefficients:

The canonical correlations are displayed; if you want, you can rotate the dependent canonical coefficients by using the Rotate option.

To obtain the coefficients for the independent variables, run GLM again with the model reversed:

Example 16 Mixture Models

Mixture models decompose the effects of mixtures of variables on a dependent variable. They differ from ordinary regression models because the independent variables sum to a constant value. The regression model, therefore, does not include a constant, and the regression and error sums of squares have one less degree of freedom. Marquardt and Snee (1974) and Diamond (1981) discuss these models and their estimation.

USE datafileGLM MODEL mmpi(1 .. 10) = CONSTANT + rater(1) + rater(2) + rater(3) ESTIMATE PRINT=LONG HYPOTHESIS STANDARDIZE EFFECT=rater(1) & rater(2) & rater(3) TEST

MODEL rater(1 .. 3) = CONSTANT + mmpi(1) + mmpi(2), + mmpi(3) + mmpi(4) + mmpi(5), + mmpi(6) + mmpi(7) + mmpi(8), + mmpi(9) + mmpi(10)ESTIMATEHYPOTHESISSTANDARDIZE = TOTALEFFECT = mmpi(1) & mmpi(2) & mmpi(3) & mmpi(4) &, mmpi(5) & mmpi(6) & mmpi(7) & mmpi(8) &, mmpi(9) & mmpi(10)TEST

Page 566: Statistics I

I-546

Linear Models II I: General Linear Models

Here is an example using the PUNCH data file from Cornell (1985). The study involved effects of various mixtures of watermelon, pineapple, and orange juice on taste ratings by judges of a fruit punch. The input is:

The output follows:

Not using a mixture model produces a much larger (0.999) and an F value of 2083.371, both of which are inappropriate for these data. Notice that the Regression Sum-of-Squares has five degrees of freedom instead of six as in the usual zero-intercept regression model. We have lost one degree of freedom because the predictors sum to 1.

Example 17 Partial Correlations

Partial correlations are easy to compute with General Linear Model. The partial correlation of two variables (a and b) controlling for the effects of a third (c) is the correlation between the residuals of each (a and b) after each has been regressed on the third (c). You can therefore use General Linear Model to compute an entire matrix of partial correlations.

USE punchGLM MODEL taste = watrmeln + pineappl + orange + , watrmeln*pineappl + watrmeln*orange + , pineappl*orange ESTIMATE / MIX

Dep Var: TASTE N: 18 Multiple R: 0.969 Squared multiple R: 0.939 Adjusted squared multiple R: 0.913 Standard error of estimate: 0.232 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) WATRMELN 4.600 0.134 3.001 0.667 34.322 0.000PINEAPPL 6.333 0.134 4.131 0.667 47.255 0.000ORANGE 7.100 0.134 4.631 0.667 52.975 0.000WATRMELN*PINEAPPL 2.400 0.657 0.320 0.667 3.655 0.003WATRMELN*ORANGE 1.267 0.657 0.169 0.667 1.929 0.078PINEAPPL*ORANGE -2.200 0.657 -0.293 0.667 -3.351 0.006 Analysis of Variance Source Sum-of-Squares df Mean-Square F-ratio P Regression 9.929 5 1.986 36.852 0.000Residual 0.647 12 0.054

R2

Page 567: Statistics I

I-547

Chapter 16

For example, to compute the matrix of partial correlations for Y1, Y2, Y3, Y4, and Y5, controlling for the effects of X, select Y1 through Y5 as dependent variables and X as the independent variable. The input follows:

Look for the Residual Correlation Matrix in the output; it is the matrix of partial correlations among the y’s given x. If you want to compute partial correlations for several x’s, just select them (also) as independent variables.

Computation

Algorithms

Centered sums of squares and cross products are accumulated using provisional algorithms. Linear systems, including those involved in hypothesis testing, are solved by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved with Householder tridiagonalization and implicit QL iterations. For further information, see Wilkinson and Reinsch (1971) or Chambers (1977).

ReferencesChambers, J.M. (1977). Computational methods for data analysis. New York: John

Wiley & Sons, Inc.Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John

Wiley & Sons, Inc.Cohen, J. and Cohen, P. (1983). Applied multiple regression/correlation analysis for the

behavioral sciences. 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum.Cornell, J.A. (1985). Mixture experiments. In Kotz, S., and Johnson, N.L. (Eds.),

Encyclopedia of statistical sciences, vol. 5, 569-579. New York: John Wiley & Sons, Inc.

Dempster, A.P. (1969). Elements of continuous multivariate analysis. San Francisco: Addison-Wesley.

Diamond, W.J. (1981). Practical experiment designs for engineers and scientists. Belmont, CA: Lifetime Learning Publications.

GLM MODEL y(1 .. 5) = CONSTANT + x PRINT=LONG ESTIMATE

Page 568: Statistics I

I-548

Linear Models II I: General Linear Models

Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.John, P.W.M. (1971). Statistical design and analysis of experiments. New York:

MacMillan, Inc.Linn, R. L., Centra, J. A., and Tucker, L. (1975). Between, within, and total group factor

analyses of student ratings of instruction. Multivariate Behavioral Research, 10, 277–288.

Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed Experiments. New York: Van Nostrand Reinhold Company.

Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.Neter, J., Wasserman,W., and Kutner, M. (1985). Applied linear statistical models. 2nd

ed. Homewood, Illinois: Richard E. Irwin, Inc.Marquardt, D.W., and Snee, R.D. (1974). Test Statistics for Mixture Models.

Technometrics, 16, 533-537.Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley &

Sons, Inc.Wilkinson, J.H. and Reinsch, C. (Eds.). (1971). Linear Algebra, Vol. 2, Handbook for

automatic computation. New York: Springer-Verlag.Winer, B. J. (1971). Statistical principles in experimental design. 2nd ed. New York:

McGraw-Hill.

Page 569: Statistics I
Page 570: Statistics I

I-550

Chapte r

17Logistic Regression

Dan Steinberg and Phillip Colla

LOGIT performs multiple logistic regression, conditional logistic regression, the econometric discrete choice model, general linear (Wald) hypothesis testing, score tests, odds ratios and confidence intervals, forward, backward and interactive stepwise regression, Pregibon regression diagnostics, prediction success and classification tables, independent variable derivatives and elasticities, model-based simulation of response curves, deciles of risk tables, options to specify start values and to separate data into learning and test samples, quasi-maximum likelihood standard errors, control of significance levels for confidence interval calculations, zero/one dependent variable coding, choice of reference group in automatic dummy variable generation, and integrated plotting tools.

Many of the results generated by modeling, testing, or diagnostic procedures can be saved to SYSTAT data files for subsequent graphing and display with the graphics routines. LOGIT and PROBIT are aliases to the categorical multivariate general modeling module called CMGLH, just as ANOVA, GLM, and REGRESSION are aliases to the multivariate general linear module called MGLH.

Statistical Background

The LOGIT module is SYSTAT’s comprehensive program for logistic regression analysis and provides tools for model building, model evaluation, prediction, simulation, hypothesis testing, and regression diagnostics. The program is designed to be easy for the novice and can produce the results most analysts need with just three simple commands. In addition, many advanced features are also included for sophisticated research projects. Beginners can skip over any unfamiliar concepts and gradually increase their mastery of logistic regression by working through the tools incorporated here.

Page 571: Statistics I

I-551

Chapter 17

LOGIT will estimate binary (Cox, 1970), multinomial (Anderson, 1972), conditional logistic regression models (Breslow and Day, 1980), and the discrete choice model (Luce, 1959; McFadden, 1973). The LOGIT framework is designed for analyzing the determinants of a categorical dependent variable. Typically, the dependent variable is binary and coded as 0 or 1; however, it may be multinomial and coded as an integer ranging from 1 to or 0 to .

Studies you can conduct with LOGIT include bioassay, epidemiology of disease (cohort or case-control), clinical trials, market research, transportation research (mode of travel), psychometric studies, and voter-choice analysis. The LOGIT module can also be used to analyze ranked choice information once the data have been suitably transformed (Beggs, Cardell, and Hausman, 1981).

This chapter contains a brief introduction to logistic regression and a description of the commands and features of the module. If you are unfamiliar with logistic regression, the textbook by Hosmer and Lemeshow (1989) is an excellent place to begin; Breslow and Day (1980) provide an introduction in the context of case-control studies; Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete-choice model for econometrics; Wrigley (1985) discusses the model for geographers; and Hoffman and Duncan (1988) review discrete choice in a demographic-sociological context. Valuable surveys appear in Amemiya (1981), McFadden (1984, 1982, 1976), and Maddala (1983).

Binary Logit

Although logistic regression may be applied to any categorical dependent variable, it is most frequently seen in the analysis of binary data, in which the dependent variable takes on only two values. Examples include survival beyond five years in a clinical trial, presence or absence of disease, responding to a specified dose of a toxin, voting for a political candidate, and participating in the labor force. The figure below compares the ordinary least-squares linear model to the basic binary logit model on the same data. Notice some features of the linear model in the upper panel of the figure:

� The linear model predicts values of y from minus to plus infinity. If the prediction is intended to be for probabilities, this model is clearly inappropriate.

� The linear model does not pass through the means of x for either value of the response. More generally, it does not appear to approach the data values very well. We shouldn’t blame the linear model for this; it is doing its job as a regression estimator by shrinking back toward the mean of y for all x values (0.5). The linear model is simply not designed to come “near” the data.

k k 1–

Page 572: Statistics I

I-552

Logist ic Regression

The lower panel illustrates a logistic model. By contrast, it is designed to fit binary data—either when y is assumed to represent a probability distribution or when it is taken simply as a binary measure we are attempting to predict.

Despite the difference in their graphical appearance, the linear and logit models are only slight variants of one another. Assuming the possibility of more than one predictor (x) variable, the linear model is:

where y is a vector of observations, X is a matrix of predictor scores, and e is a vector of errors.

The logit model is:

where the exponential function is applied to the vector argument. Rearranging terms, we have:

and logging both sides of the equation, we have:

This last expression is one source of the term “logit.” The model is “linear in the logs.”

y Xb e+=

y exp Xb e+( ) 1 exp Xb e+( )+[ ]⁄=

y 1 y–( )⁄ exp Xb e+( )=

log y 1 y–( )⁄[ ] Xb e+ b0 bjXij ei for all i+ + 1, ..., n= = =

Page 573: Statistics I

I-553

Chapter 17

Multinomial Logit

Multinomial logit is a logistic regression model having a dependent variable with more than two levels (Agresti, 1990; Santer and Duffy, 1989; Nerlove and Press, 1973). Examples of such dependent variables include political preference (Democrat, Republican, Independent), health status (healthy, moderately impaired, seriously impaired), smoking status (current smoker, former smoker, never smoked), and job classification (executive, manager, technical staff, clerical, other). Outside of the difference in the number of levels of the dependent variable, the multinomial logit is very similar to the binary logit, and most of the standard tools of interpretation, analysis, and model selection can be applied. In fact, the polytomous unordered logit we discuss here is essentially a combination of several binary logits estimated simultaneously (Begg and Gray, 1984). We use the term polytomous to differentiate this model from the conditional logistic regression and discrete choice models discussed below.

There are important differences between binary and multinomial models. Chiefly, the multinomial output is more complicated than that of the binary model, and care must be taken in the interpretation of the results. Fortunately, LOGIT provides some new tools that make the task of interpretation much easier. There is also a difference in dependent variable coding. The binary logit dependent variable is normally coded 0 or 1, whereas the multinomial dependent can be coded 1, 2, ..., , (that is, it starts at 1 rather than 0) or 0, 1, 2, ..., .

Conditional Logit

The conditional logistic regression model has become a major analytical tool in epidemiology since the work of Prentice and Breslow (1978), Breslow et al. (1978), Prentice and Pyke (1979), and the extended treatment of case-control studies in Breslow and Day (1980). A mathematically similar model with the same name was introduced independently and from a rather different perspective by McFadden (1973) in econometrics. The models have since seen widespread use in the considerably different contexts of biomedical research and social science, with parallel literatures on sampling, estimation techniques, and statistical results. In epidemiology, conditional logit is used to estimate relative risks in matched sample case-control studies (Breslow, 1982), whereas in econometrics a similar likelihood function is used to model consumer choices as a function of the attributes of alternatives. We begin this section with a treatment of the biomedical use of the conditional logistic model. A separate

kk 1–

Page 574: Statistics I

I-554

Logist ic Regression

section on the discrete choice model covers the econometric version and contains certain fine points that may be of interest to all readers. A discussion of parallels in the two literatures appears in Steinberg (1991).

In the traditional conditional logistic regression model, you are trying to measure the risk of disease corresponding to different levels of exposure to risk factors. The data have been collected in the form of matched sets of cases and controls, where the cases have the disease, the controls do not, and the sets are matched on background variables such as age, sex, marital status, education, residential location, and possibly other health indicators. The matching variables combine to form strata over which relative risks are to be estimated; thus, for example, a small group of persons of a given age, marital status, and health history will form a single stratum. The matching variables can also be thought of as proxies for a larger set of unobserved background variables that are assumed to be constant within strata. The logit for the jth individual in the ith stratum can be written as:

where is the vector of exposure variables and is a parameter dedicated to the stratum. Since case-control studies will frequently have a large number of small matched sets, the are nuisance parameters that can cause problems in estimation (Cox and Hinkley, 1974). In the example discussed below, there are 63 matched sets, each consisting of four cases and one control, with information on seven exposure variables for every subject.

The problem with estimating an unconditional model for these data is that we would need to include dummy variables for the strata. This would leave us with possibly 70 parameters being estimated for a data set with only 315 observations. Furthermore, increasing the sample size will not help because an additional stratum parameter would have to be estimated for each additional matched set in the study sample. By working with the appropriate conditional likelihood, however, the nuisance parameters can be eliminated, simplifying estimation and protecting against potential biases that may arise in the unconditional model (Cox, 1975; Chamberlain, 1980). The conditional model requires estimation only of the relative risk parameters of interest.

LOGIT allows the estimation of models for matched sample case-control studies with one case and any number of controls per set. Thus, matched pair studies, as well as studies with varying numbers of controls per case, are easily handled. However, not all commands discussed so far are available for conditional logistic regression.

logit pij( ) ai bjXij+=

Xij ai

ai

63 1– 62=

Page 575: Statistics I

I-555

Chapter 17

Discrete Choice Logit

Econometricians and psychometricians have developed a version of logit frequently called the discrete choice model, or McFadden’s conditional logit model (McFadden, 1973, 1976, 1982, 1984; Hensher and Johnson, 1981; Ben-Akiva and Lerman, 1985; Train, 1986; Luce, 1959). This multinomial model differs from the standard polytomous logit in the interpretation of the coefficients, the number of parameters estimated, the syntax of the model sentence, and options for data layout.

The discrete choice framework is designed specifically to model an individual’s choices in response to the characteristics of the choices. Characteristics of choices are attributes such as price, travel time, horsepower, or calories; they are features of the alternatives that an individual might choose from. By contrast, characteristics of the chooser, such as age, education, income, and marital status, are attributes of a person.

The classic application of the discrete choice model has been to the choice of travel mode to work (Domencich and McFadden, 1975). Suppose a person has three alternatives: private auto, car pool, and commuter train. The individual is assumed to have a utility function representing the desirability of each option, with the utility of an alternative depending solely on its own characteristics. With travel time and travel cost as key characteristics determining mode choice, the utility of each option could be written as:

where represents private auto, car pool, and train, respectively. In this random utility model, the utility of the ith alternative is determined by the travel time , the cost of that alternative, and a random error term, . Utility of an alternative is assumed not to be influenced by the travel times or costs of other alternatives available, although choice will be determined by the attributes of all available alternatives. In addition to the alternative characteristics, utility is sometimes also determined by an alternative specific constant.

The choice model specifies that an individual will choose the alternative with the highest utility as determined by the equation above. Because of the random component, we are reduced to making statements concerning the probability that a given choice is made. If the error terms are distributed as i.i.d. extreme value, it can be shown that the probability of the i th alternative being chosen is given by the familiar logit formula

Ui B1Ti B2Ci ei+ +=

i 1, 2, 3=Ui

Ti Ci ei

Page 576: Statistics I

I-556

Logist ic Regression

Suppose that for the first few cases our data are as follows:

The third record has a person who chooses to go to work by private auto (choice = 1); when he drives, it takes 15 minutes to get to work and costs one dollar. Had he carpooled instead, it would have taken 30 minutes to get to work and cost 50 cents. The train would have taken an hour and cost one dollar. For this case, the utility of each option is given by

U(private auto)= b1*15 + b2*1.00 + error13U(car pool) = b1*30 + b2* 0.50 + error23U(train) = b1*60 + b2*1.00 + error33

The error term has two subscripts, one pertaining to the alternative and the other pertaining to the individual. The error is individual-specific and is assumed to be independent of any other error or variable in the data set. The parameters and are common utility weights applicable to all individuals in the sample. In this example, these are the only parameters, and their number does not depend on the number of alternatives individuals can choose from. If a person also had the option of walking to work, we would expand the model to include this alternative with

U (walking) = b1*70 + b2*0.00 + error43

and we would still be dealing with only the two regression coefficients and .This highlights a major difference between the discrete choice and standard

polytomous logit models. In polytomous logit, the number of parameters grows with the number alternatives; if the value of NCAT is increased from 3 to 4, a whole new vector of parameters is estimated. By contrast, in the discrete choice model without a constant, increasing the number of alternatives does not increase the number of discrete choice parameters estimated.

Subject Choice Auto(1) Auto(2) Pool(1) Pool(2) Train(1) Train(2) Sex Age

1 1 20 3.50 35 2.00 65 1.10 Male 272 3 45 6.00 65 3.00 65 1.00 Female 353 1 15 1.00 30 0.50 60 1.00 Male 224 2 60 5.50 70 2.00 90 2.00 Male 455 3 30 4.25 40 1.75 55 1.50 Male 52

Prob Ui Uj for all j i≠>( ) X ib( )exp

Xib( )exp∑-------------------------------=

b1 b2

b1 b2

Page 577: Statistics I

I-557

Chapter 17

Finally, we need to look at the optional constant. Optional is emphasized because it is perfectly legitimate to estimate without a constant, and, in certain circumstances, it is even necessary to do so. If we were to add a constant to the travel mode model, we would obtain the following utility equations:

where represents private auto, car pool, and train, respectively. The constant here, , is alternative-specific, with a separate one estimated for each alternative: corresponds to private auto; , to car pooling; and , to train. Like polytomous logit, the constant pertaining to the reference group is normalized to 0 and is not estimated.

An alternative specific CONSTANT is entered into a discrete choice model to capture unmeasured desirability of an alternative. Thus, the first constant could reflect the convenience and comfort of having your own car (or in some cities the inconvenience of having to find a parking space), and the second might reflect the inflexibility of schedule associated with shared vehicles. With NCAT=3, the third constant will be normalized to 0.

Stepwise Logit

Automatic model selection can be extremely useful for analyzing data with a large number of covariates for which there is little or no guidance from previous research. For these situations, LOGIT supports stepwise regression, allowing forward, backward, mixed, and interactive covariate selection, with full control over forcing, selection criteria, and candidate variables (including interactions). The procedure is based on Peduzzi, Holford, and Hardy (1980).

Stepwise regression results in a model that cannot be readily evaluated using conventional significance criteria in hypothesis tests, but the model may prove useful for prediction. We strongly suggest that you separate the sample into learning and test sets for assessment of predictive accuracy before fitting a model to the full data set. See the cautionary discussion and references in Chapter 14.

Ui boi b1Ti b2Ci ei+ + +=

i 1, 2, 3=boi

bo1 bo2 bo3

Page 578: Statistics I

I-558

Logist ic Regression

Logistic Regression in SYSTAT

Estimate Model Main Dialog Box

Logistic regression analysis provides tools for model building, model evaluation, prediction, simulation, hypothesis testing, and regression diagnostics.

Many of the results generated by modeling, testing, or diagnostic procedures can be saved to SYSTAT data files for subsequent graphing and display. New data handling features for the discrete choice model allow tremendous savings in disk space when choice attributes are constant, and in some models, performance is greatly improved.

The Logit Estimate Model dialog box is shown below.

� Dependent. Select the variable you want to examine. The dependent variable should be a categorical numeric variable.

� Independent(s). Select one or more continuous or categorical variables. To add an interaction to your model, use the Cross button. For example, to add the term SEX*EDUCATION, add SEX to the Independent list and then add EDUCATION by clicking Cross.

� Conditional(s). Select conditional variables. To add interactive conditional variables to your model, use the Cross button. For example, to add the term SEX*EDUCATION, add SEX to the Conditional list and then add EDUCATION by clicking Cross.

Page 579: Statistics I

I-559

Chapter 17

� Include constant. The constant is an optional parameter. Deselect Include constant to obtain a model through the origin. When in doubt, include the constant.

� Prediction table. Produces a prediction-of-success table, which summarizes the classificatory power of the model.

� Quasi maximum likelihood. Specifies that the covariance matrix will be quasi-maximum likelihood adjusted after the first iteration. If this matrix is calculated, it will be used during subsequent hypothesis testing and will affect t ratios for estimated parameters.

� Save file. Saves specified statistics in filename.SYD.

Click the Options button to go to the Categories, Discrete Choice, and Estimation Options dialog boxes.

Categories

You must specify numeric or string grouping variables that define cells. Specify for all categorical variables for which logistic regression analysis should generate design variables.

Categorical Variables(s). Categorize an independent variable when it has several categories; for example, education levels, which could be divided into the following categories: less than high school, some high school, finished high school, some college, finished bachelor’s degree, finished master’s degree, and finished doctorate. On the other hand, a variable such as age in years would not be categorical unless age were broken up into categories such as under 21, 21–65, and over 65.

Effect. Produces parameter estimates that are differences from group means.

Page 580: Statistics I

I-560

Logist ic Regression

Dummy. Produces dummy codes for the design variables instead of effect codes. Coding of dummy variables is the classic analysis of variance parameterization, in which the sum of effects estimated for a classifying variable is 0. If your categorical variable has categories, dummy variables are created.

Discrete Choice

The discrete choice framework is designed specifically to model an individual’s choices in response to the characteristics of the choices. Characteristics of choices are attributes such as price, travel time, horsepower, or calories; they are features of the alternatives that an individual might choose from. You can define set names for groups of variables, and create, edit, or delete variables.

Set Name. Specifies conditional variables. Enter a set name and then you can add and cross variables. To create a new set, click New. Repeat this process until you have defined all of your sets. You can edit existing sets by highlighting the name of the set in the Set Name drop-down list. To delete a set, select the set in the drop-down list and click Delete. When you click Continue, SYSTAT will check that each set name has a definition. If a set name exists but no variables were assigned to it, the set is discarded and the set name will not be in the drop-down list when you return to this dialog box.

Alternatives for discrete choice. Specify an alternative for discrete choice. Characteristics of choice are features of the alternatives that an individual might choose between. It is needed only when the number of alternatives in a choice model varies per subject.

Number of categories. Specify the number of categories or alternatives the variable has. This is needed only for the by-choice data layout where the values of the dependent

k k 1–

Page 581: Statistics I

I-561

Chapter 17

variable are not explicitly coded. This is only enabled when the Alternatives for discrete choice field is not empty.

Options

The Logit Options dialog box allows you to specify convergence and a tolerance level, select complete or stepwise entry, and specify entry and removal criteria.

Converge. Specifies the largest relative change in any coordinate before iterations terminate.

Tolerance. Prevents the entry of a variable that is highly correlated with the independent variables already included in the model. Enter a value between 0 and 1. Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the correlation required to exclude a variable.

Estimation. Controls the method used to enter and remove variables from the equation.

� Complete. All independent variables are entered in a single step.

� Stepwise. Allows forward, backward, mixed, and interactive covariates selection, with full control over forcing, selection criteria, and candidates, including interactions. It results in a model that can be useful for prediction.

Stepwise Options. The following alternatives are available for stepwise entry and removal:

Page 582: Statistics I

I-562

Logist ic Regression

� Backward. Begins with all candidate variables in the model. At each step, SYSTAT removes the variable with the largest Remove value.

� Forward. Begins with no variables in the model. At each step, SYSTAT adds the variable with the smallest Enter value.

� Automatic. For Backward, SYSTAT automatically removes a variable from your model at each step. For Forward, SYSTAT automatically adds a variable to the model at each step.

� Interactive. Allows you to use your own judgment in selecting variables for addition or deletion.

Probability. You can also control the criteria used to enter variables into and remove variables from the model:

� Enter. Enters a variable into the model if its alpha value is less than the specified value. Enter a value between 0 and 1(for example, 0.025).

� Remove. Removes a variable from the model if its alpha value is greater than the specified value. Enter a value between 0 and 1(for example, 0.025).

Force. Forces the first n variables listed in your model to remain in the equation.

Max step. Specifies the maximum number of steps.

Deciles of Risk

After you successfully estimate your model using logistic regression, you can calculate deciles of risk. This will help you make sure that your model fits the data and that the results are not unduly influenced by a handful of unusual observations. In using the deciles of risk table, please note that the goodness-of-fit statistics will depend on the grouping rule specified.

Two grouping rules are available:

� Based on probability values. Probability is reallocated across the possible values of the dependent variable as the independent variable changes. It provides a global

Page 583: Statistics I

I-563

Chapter 17

view of covariate effects that is not easily seen when considering each binary submodel separately. In fact, the overall effect of a covariate on the probability of an outcome can be of the opposite sign of its coefficient estimate in the corresponding submodel. This is because the submodel concerns only two of the outcomes, whereas the derivative table considers all outcomes at once.

� Based on equal counts per bin. Allocates approximately equal numbers of observations to each cell. Enter the number of cells or bins in the Number of bins text box.

Quantiles

After estimating your model, you can calculate quantiles for any single-predictor in the model. Quantiles of unadjusted data can be useful in assessing the suitability of a functional form when you are interested in the unconditional distribution of the failure times.

Covariate(s). The Covariate(s) list contains all of the variables specified in the Independent list in the main Logit dialog box. You can set any of the covariates to a fixed value by selecting the variable in the Covariates list and entering a value in the Value text box. This constraint appears as variable name = value in the Fixed Value Settings list after you click Add. The quantiles for the desired variable correspond to a model in which the covariates are fixed at these values. Any covariates not fixed to a value are assigned the value of 0.

Quantile Value Variable. By default, the first variable in the Independent variable list in the main dialog box is shown in this field. You can change this to any variable from the list. This variable name is then issued as the argument for the QNTL command.

Page 584: Statistics I

I-564

Logist ic Regression

Simulation

SYSTAT allows you to generate and save predicted probabilities and odds ratios, using the last model estimated to evaluate a set of logits. The logits are calculated from a combination of fixed covariate values that you specify in this dialog box.

Covariate(s). The Covariate(s) list contains all of the variables specified in the Independent list on the main Logit dialog box. Select a covariate, enter a fixed value for the covariate in the Value text box, and click Add.

Value. Enter the value over which the parameters of the simulation are to vary.

Fixed value settings. This box lists the fixed values on the covariates from which the logits are calculated.

When you click OK, SYSTAT prompts you to specify a file to which the simulation results will be saved.

Hypothesis

After you successfully estimate your model using logistic regression, you can perform post hoc analyses.

Page 585: Statistics I

I-565

Chapter 17

Enter the hypotheses that you would like to test. All the hypotheses that you list will be tested jointly in a single test. To test each restriction individually, you will have to revisit this dialog box each time. To reference dummies generated from categorical covariates, use square brackets, as in:

You can reproduce the Wald version of the t ratio by testing whether a coefficient is 0:

If you don’t specify a sub-vector, the first is assumed; thus, the constraint above is equivalent to:

Using Commands

After selecting a file with USE filename, continue with:

LOGIT CATEGORY grpvarlist / MISS EFFECT DUMMYNCAT=nALT varSET parameter=condvarlistMODEL depvar = CONSTANT + indvarexp

depvar = condvarlist;polyvarlistESTIMATE / PREDICT TOLERANCE=d CONVERGE=d QML MEANS CLASS DERIVATIVE=INDIVIDUAL or AVERAGE or START / BACKWARD FORWARD ENTER=d REMOVE=d FORCE=n MAXSTEP=nSTEP var or + or - / AUTO(sequence of STEPs)STOPSAVEDC / SMART=n P=p1,p2,…QNTL var / covar=d covar=dSIMULATE var1=d1, var2=d2, … / DO var1=d1,d2,d3, var2=d1,d2,d3HYPOTHESISCONSTRAIN argumentTEST

RACE 1[ ] 0=

AGE 0=

AGE 1{ } 0=

Page 586: Statistics I

I-566

Logist ic Regression

Usage Considerations

Types of data. LOGIT uses rectangular data only. The dependent variable is automatically taken to be categorical. To change the order of the categories, use the ORDER statement. For example,

LOGIT can also handle categorical predictor variables. Use the CATEGORY statement to create them, and use the EFFECTS or DUMMY options of CATEGORY to determine the coding method. Use the ORDER command to change the order of the categories.

Print options. For PRINT=SHORT, the output gives N, the type of association, parameter estimates, and associated tests. PRINT=LONG gives, in addition to the above results, a correlation matrix of the parameter estimates.

Quick Graphs. LOGIT produces no Quick Graphs. Use the saved files from ESTIMATE or DC to produce diagnostic plots and fitted curves. See the examples.

Saving files. LOGIT saves simulation results, quantiles, or residuals and estimated values.

BY groups. LOGIT analyzes data by groups.

Bootstrapping. Bootstrapping is not available in this procedure.

Case frequencies. LOGIT uses the FREQ variable, if present, to weight cases. This inflates the total degrees of freedom to be the sum of the number of frequencies. Using a FREQ variable does not require more memory, however. Cases whose value on the FREQ variable are less than or equal to 0 are deleted from the analysis. The FREQ variable may take non-integer values. When the FREQ command is in effect, separate unweighted and weighted case counts are printed.

Weighting can be used to compensate for sampling schemes that stratify on the covariates, giving results that more accurately reflect the population. Weighting is also useful for market share predictions from samples stratified on the outcome variable in discrete choice models. Such samples are known as choice-based in the econometric literature (Manski and Lerman, 1977; Manski and McFadden, 1980; Coslett, 1980) and are common in matched-sample case-control studies where the cases are usually over-sampled, and in market research studies where persons who choose rare alternatives are sampled separately.

Case weights. LOGIT does not allow case weighting.

ORDER CLASS / SORT=DESCENDING

Page 587: Statistics I

I-567

Chapter 17

Examples

The following examples begin with the simple binary logit model and proceed to more complex multinomial and discrete choice logit models. Along the way, we will examine diagnostics and other options used for applications in various fields.

Example 1 Binary Logit

To illustrate the use of binary logistic regression, we take this example from Hosmer and Lemeshow’s book Applied Logistic Regression, referred to below as H&L. Hosmer and Lemeshow consider data on low infant birth weight (LOW) as a function of several risk factors. These include the mother’s age (AGE), mother’s weight during last menstrual period (LWT), race (RACE = 1: white, RACE = 2: black, RACE = 3: other), smoking status during pregnancy (SMOKE), history of premature labor (PTL), hypertension (HT), uterine irritability (UI), and number of physician visits during first trimester (FTV). The dependent variable is coded 1 for birth weights less than 2500 grams and coded 0 otherwise. These variables have previously been identified as associated with low birth weight in the obstetrical literature.

The first model considered is the simple regression of LOW on a constant and LWD, a dummy variable coded 1 if LWT is less than 110 pounds and coded 0 otherwise. (See H&L, Table 3.17.) LWD and LWT are similar variable names. Be sure to note which is being used in the models that follow.

The input is:

The output begins with a listing of the dependent variable and the sample split between 0 (reference) and 1 (response) for the dependent variable. A brief iteration history follows, showing the progress of the procedure to convergence. Finally, the parameter estimates, standard errors, standardized coefficients (popularly called t ratios), p values, and the log-likelihood are presented.

USE HOSLEMLOGITMODEL LOW=CONSTANT+LWDESTIMATE

Page 588: Statistics I

I-568

Logist ic Regression

Coefficients

We can evaluate these results much like a linear regression. The coefficient on LWD is large relative to its standard error (t ratio = 2.91) and so appears to be an important predictor of low birth weight. The interpretation of the coefficient is quite different from ordinary regression, however. The logit coefficient tells how much the logit increases for a unit increase in the independent variable, but the probability of a 0 or 1 outcome is a nonlinear function of the logit.

Odds Ratio

The odds-ratio table provides a more intuitively meaningful quantity for each coefficient. The odds of the response are given by , where is the probability of response, and the odds ratio is the multiplicative factor by which the odds change when the independent variable increases by one unit. In the first model, being a low-weight mother increases the odds of a low birth weight baby by a

Variables in the SYSTAT Rectangular file are: ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE1 CASEID PTD LWD Categorical values encountered during processing are: LOW (2 levels) 0, 1 Binary LOGIT Analysis. Dependent variable: LOW Input records: 189 Records for analysis: 189 Sample split Category choices REF 59 RESP 130 Total : 189 L-L at iteration 1 is -131.005 L-L at iteration 2 is -113.231 L-L at iteration 3 is -113.121 L-L at iteration 4 is -113.121 Log Likelihood: -113.121 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -1.054 0.188 -5.594 0.000 2 LWD 1.054 0.362 2.914 0.004

95.0 % bounds Parameter Odds Ratio Upper Lower 2 LWD 2.868 5.826 1.412 Log Likelihood of constants only model = LL(0) = -117.336 2*[LL(N)-LL(0)] = 8.431 with 1 df Chi-sq p-value = 0.004 McFadden’s Rho-Squared = 0.036

p 1 p–( )⁄ p

Page 589: Statistics I

I-569

Chapter 17

multiplicative factor of 2.87, with lower and upper confidence bounds of 1.41 and 5.83, respectively. Since the lower bound is greater than 1, the variable appears to represent a genuine risk factor. See Kleinbaum, Kupper, and Chambliss (1982) for a discussion.

Example 2 Binary Logit with Multiple Predictors

The binary logit example contains only a constant and a single dummy variable. We consider the addition of the continuous variable AGE to the model.

The input is:

The output follows:

USE HOSLEMLOGITMODEL LOW=CONSTANT+LWD+AGEESTIMATE / MEANS

Variables in the SYSTAT Rectangular file are: ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE1 CASEID PTD LWD Categorical values encountered during processing are: LOW (2 levels) 0, 1 Binary LOGIT Analysis. Dependent variable: LOW Input records: 189 Records for analysis: 189 Sample split Category choices REF 59 RESP 130 Total : 189 Independent variable MEANS

PARAMETER 0 -1 OVERALL 1 CONSTANT 1.000 1.000 1.000 2 LWD 0.356 0.162 0.222 3 AGE 22.305 23.662 23.238

L-L at iteration 1 is -131.005 L-L at iteration 2 is -112.322 L-L at iteration 3 is -112.144 L-L at iteration 4 is -112.143 L-L at iteration 5 is -112.143 Log Likelihood: -112.143

Page 590: Statistics I

I-570

Logist ic Regression

We see the means of the independent variables overall and by value of the dependent variable. In this sample, there is a substantial difference between the mean LWD across birth weight groups but an apparently small AGE difference.

AGE is clearly not significant by conventional standards if we look at the coefficient/standard-error ratio. The confidence interval for the odds ratio (0.898, 1.019) includes 1.00, indicating no effect in relative risk, when adjusting for LWD. Before concluding that AGE does not belong in the model, H&L consider the interaction of AGE and LWD.

Example 3 Binary Logit with Interactions

In this example, we fit a model consisting of a constant, a dummy variable, a continuous variable, and an interaction. Note that it is not necessary to create a new interaction variable; this is done for us automatically by writing the interaction on the MODEL statement. Let’s also add a prediction table for this model.

Following is the input:

Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -0.027 0.762 -0.035 0.972 2 LWD 1.010 0.364 2.773 0.006 3 AGE -0.044 0.032 -1.373 0.170 95.0 % bounds Parameter Odds Ratio Upper Lower 2 LWD 2.746 5.607 1.345 3 AGE 0.957 1.019 0.898 Log Likelihood of constants only model = LL(0) = -117.336 2*[LL(N)-LL(0)] = 10.385 with 2 df Chi-sq p-value = 0.006 McFadden’s Rho-Squared = 0.044

USE HOSLEMLOGITMODEL LOW=CONSTANT+LWD+AGE+LWD*AGEESTIMATE / PREDICTIONSAVE SIM319/SINGLE,”SAVE ODDS RATIOS FOR H&L TABLE 3.19”SIMULATE CONSTANT=0,AGE=0,LWD=1 / DO LWD*AGE =15,45,5USE SIM319 LIST

Page 591: Statistics I

I-571

Chapter 17

The output follows:

Variables in the SYSTAT Rectangular file are: ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE1 CASEID PTD LWD Categorical values encountered during processing are: LOW (2 levels) 0, 1 Total : 12 Binary LOGIT Analysis. Dependent variable: LOW Input records: 189 Records for analysis: 189 Sample split Category choices REF 59 RESP 130 Total : 189 L-L at iteration 1 is -131.005 L-L at iteration 2 is -110.937 L-L at iteration 3 is -110.573 L-L at iteration 4 is -110.570 L-L at iteration 5 is -110.570 Log Likelihood: -110.570 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.774 0.910 0.851 0.395 2 LWD -1.944 1.725 -1.127 0.260 3 AGE -0.080 0.040 -2.008 0.045 4 AGE*LWD 0.132 0.076 1.746 0.081 95.0 % bounds Parameter Odds Ratio Upper Lower 2 LWD 0.143 4.206 0.005 3 AGE 0.924 0.998 0.854 4 AGE*LWD 1.141 1.324 0.984 Log Likelihood of constants only model = LL(0) = -117.336 2*[LL(N)-LL(0)] = 13.532 with 3 df Chi-sq p-value = 0.004 McFadden’s Rho-Squared = 0.058

Model Prediction Success Table Actual Predicted Choice Actual Choice Response Reference Total Response 21.280 37.720 59.000 Reference 37.720 92.280 130.000 Pred. Tot. 59.000 130.000 189.000 Correct 0.361 0.710 Success Ind. 0.049 0.022 Tot. Correct 0.601 Sensitivity: 0.361 Specificity: 0.710 False Reference: 0.639 False Response: 0.290

Simulation Vector Fixed Parameter Value 1 CONSTANT 0.0 2 LWD 1.000 3 AGE 0.0

Page 592: Statistics I

I-572

Logist ic Regression

Likelihood-Ratio Statistic

At this point, it would be useful to assess the model as a whole. One method of model evaluation is to consider the likelihood-ratio statistic. This statistic tests the hypothesis that all coefficients except the constant are 0, much like the F test reported below linear regressions. The likelihood-ratio statistic (LR for short) of 13.532 is chi-squared with three degrees of freedom and a p value of 0.004. The degrees of freedom are equal to the number of covariates in the model, not including the constant. McFadden’s rho-squared is a transformation of the LR statistic intended to mimic an R-squared. It is always between 0 and 1, and a higher rho-squared corresponds to more significant results. Rho-squared tends to be much lower than R-squared though, and a low number does not necessarily imply a poor fit. Values between 0.20 and 0.40 are considered very satisfactory (Hensher and Johnson, 1981).

Models can also be assessed relative to one another. A likelihood-ratio test is formally conducted by computing twice the difference in log-likelihoods for any pair of nested models. Commonly called the G statistic, it has degrees of freedom equal to the difference in the number of parameters estimated in the two models. Comparing the current model with the model without the interaction, we have

with one degree of freedom, which has a p value of 0.076. This result corresponds to the bottom row of H&L’s Table 3.17. The conclusion of the test is that the interaction approaches significance.

Loop Parameter Minimum Maximum Increment 4 AGE*LWD 15.000 45.000 5.000 SYSTAT save file created. 7 records written to SYSTAT save file.

Case number LOGIT SELOGIT PROB PLOWER PUPPER ODDS ODDSL ODDSU LOOP(1) 1 0.04 0.66 0.51 0.22 0.79 1.04 0.28 3.79 15.00 2 0.70 0.40 0.67 0.48 0.82 2.01 0.91 4.44 20.00 3 1.36 0.42 0.80 0.63 0.90 3.90 1.71 8.88 25.00 4 2.02 0.69 0.88 0.66 0.97 7.55 1.95 29.19 30.00 5 2.68 1.03 0.94 0.66 0.99 14.63 1.94 110.26 35.00 6 3.34 1.39 0.97 0.65 1.00 28.33 1.85 432.77 40.00 7 4.00 1.76 0.98 0.64 1.00 54.86 1.75 1724.15 45.00

G 2 * 112.14338 110.56997–( ) 3.14684= =

Page 593: Statistics I

I-573

Chapter 17

Prediction Success Table

The output also includes a prediction success table, which summarizes the classificatory power of the model. The rows of the table show how observations from each level of the dependent variable are allocated to predicted outcomes. Reading across the first (Response) row we see that of the 59 cases of low birth weight, 21.28 are correctly predicted and 37.72 are incorrectly predicted. The second row shows that of the 130 not-LOW cases, 37.72 are incorrectly predicted and 92.28 are correctly predicted.

By default, the prediction success table sums predicted probabilities into each cell; thus, each observation contributes a fractional amount to both the Response and Reference cells in the appropriate row. Column sums give predicted totals for each outcome, and row sums give observed totals. These sums will always be equal for models with a constant.

The table also includes additional analytic results. The Correct row is the proportion successfully predicted, defined as the diagonal table entry divided by the column total, and Tot.Correct is the ratio of the sum of the diagonal elements in the table to the total number of observations. In the Response column, 21.28 are correctly predicted out of a column total of 59, giving a correct rate of 0.3607. Overall, out of a total of 189 are correct, giving a total correct rate of 0.6009.

Success Ind. is the gain that this model shows over a purely random model that assigned the same probability of LOW to every observation in the data. The model produces a gain of 0.0485 over the random model for responses and 0.0220 for reference cases. Based on these results, we would not think too highly of this model.

In the biostatistical literature, another terminology is used for these quantities. The Correct quantity is also known as sensitivity for the Response group and specificity for the Reference group. The False Reference rate is the fraction of those predicted to respond that actually did not respond, while the False Response rate is the fraction of those predicted to not respond that actually responded.

We prefer the prediction success terminology because it is applicable to the multinomial case as well.

Simulation

To understand the implications of the interaction, we need to explore how the relative risk of low birth weight varies over the typical child-bearing years. This changing relative risk is evaluated by computing the logit difference for base and comparison

21.28 92.28+

Page 594: Statistics I

I-574

Logist ic Regression

groups. The logit for the base group, mothers with LWD = 0, is written as L(0); the logit for the comparison group, mothers with LWD = 1, is L(l). Thus,

since, for L(l), LWD = 1. The logit difference is

which is the coefficient on LWD plus the interaction multiplied by its coefficient. The difference L(l) – (0) evaluated for a mother of a given age is a measure of the log relative risk due to LWD being 1. This can be calculated simply for several ages, and converted to odds ratios with upper and lower confidence bounds, using the SIMULATE command.

SIMULATE calculates the predicted logit, predicted probability, odds ratio, upper and lower bounds, and the standard error of the logit for any specified values of the covariates. In the above command, the constant and age are set to 0, because these coefficients do not appear in the logit difference. LWD is set to 1, and the interaction is allowed to vary from 15 to 45 in increments of five years. The only printed output produced by this command is a summary report.

SIMULATE does not print results when a DO LOOP is specified because of the potentially large volume of output it can generate. To view the results, use the commands:

The results give the effect of low maternal weight (LWD) on low birth weight as a function of age, where LOOP(1) is the value of AGE * LWD (which is just AGE) and ODDSU and ODDSL are upper and lower bounds of the odds ratio. We see that the effect of LWD goes up dramatically with age, although the confidence interval becomes quite large beyond age 30. The results presented here are calculated internally within LOGIT and thus differ slightly from those reported in H&L, who use printed output with fewer decimal places of precision to obtain their results.

L(O) = CONSTANT + B2*AGEL(l) = CONSTANT + B1*LWD + B2*AGE + B3*LWD*AGE = CONSTANT + B1 + B2*AGE + B3*AGE

L(l)-L(0) = B1 + B3*LWD*AGE

USE SIM319LIST

Page 595: Statistics I

I-575

Chapter 17

Example 4 Deciles of Risk and Model Diagnostics

Before turning to more detailed model diagnostics, we fit H&L’s final model. As a result of experimenting with more variables and a large number of interactions, H&L arrive at the model used here. The input is:

The categorical variable RACE is specified to have three levels. By default LOGIT uses the highest category as the reference group, although this can be changed. The model includes all of the main variables except FTV, with LWT and PTL transformed into dummy variable variants LWD and PTD, and two interactions. To reproduce the results of Table 5.1 of H&L, we specify a particular set of cut points for the deciles of risk table. Some of the results are:

USE HOSLEMLOGITCATEGORY RACE / DUMMYMODEL LOW=CONSTANT+AGE+RACE+SMOKE+HT+UI+LWD+PTD+ , AGE*LWD+SMOKE*LWDESTIMATESAVE RESIDDC / P=0.06850,0.09360,0.15320,0.20630,0.27810,0.33140, 0.42300,0.49124,0.61146

USE RESIDPPLOT PEARSON / SIZE=VARIANCEPLOT DELPSTAT*PROB/SIZE=DELBETA(1)

Variables in the SYSTAT Rectangular file are: ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE1 CASEID PTD LWD Categorical values encountered during processing are: RACE (3 levels) 1, 2, 3 LOW (2 levels) 0, 1 Binary LOGIT Analysis. Dependent variable: LOW Input records: 189 Records for analysis: 189 Sample split Category choices REF 59 RESP 130 Total : 189 L-L at iteration 1 is -131.005 L-L at iteration 2 is -98.066 L-L at iteration 3 is -96.096 L-L at iteration 4 is -96.006 L-L at iteration 5 is -96.006 L-L at iteration 6 is -96.006 Log Likelihood: -96.006

Page 596: Statistics I

I-576

Logist ic Regression

Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.248 1.068 0.232 0.816 2 AGE -0.084 0.046 -1.843 0.065 3 RACE_1 -0.760 0.464 -1.637 0.102 4 RACE_2 0.323 0.532 0.608 0.543 5 SMOKE 1.153 0.458 2.515 0.012 6 HT 1.359 0.661 2.055 0.040 7 UI 0.728 0.479 1.519 0.129 8 LWD -1.730 1.868 -0.926 0.354 9 PTD 1.232 0.471 2.613 0.009 10 AGE*LWD 0.147 0.083 1.779 0.075 11 SMOKE*LWD -1.407 0.819 -1.719 0.086 95.0 % bounds Parameter Odds Ratio Upper Lower 2 AGE 0.919 1.005 0.841 3 RACE_1 0.468 1.162 0.188 4 RACE_2 1.382 3.920 0.487 5 SMOKE 3.168 7.781 1.290 6 HT 3.893 14.235 1.065 7 UI 2.071 5.301 0.809 8 LWD 0.177 6.902 0.005 9 PTD 3.427 8.632 1.360 10 AGE*LWD 1.159 1.363 0.985 11 SMOKE*LWD 0.245 1.218 0.049 Log Likelihood of constants only model = LL(0) = -117.336 2*[LL(N)-LL(0)] = 42.660 with 10 df Chi-sq p-value = 0.000 McFadden’s Rho-Squared = 0.182

Deciles of Risk Records processed: 189 Sum of weights = 189.000 Statistic p-value df Hosmer-Lemeshow* 5.231 0.733 8.000 Pearson 183.443 0.374 178.000 Deviance 192.012 0.224 178.000 * Large influence of one or more deciles may affect statistic. Category 0.069 0.094 0.153 0.206 0.278 0.331 0.423 0.491 0.611 Resp Obs 0.0 1.000 4.000 2.000 6.000 6.000 6.000 10.000 9.000 Exp 0.854 1.641 2.252 3.646 5.017 5.566 6.816 8.570 10.517 Ref Obs 18.000 19.000 14.000 18.000 14.000 12.000 12.000 9.000 10.000 Exp 17.146 18.359 15.748 16.354 14.983 12.434 11.184 10.430 8.483

Avg Prob 0.047 0.082 0.125 0.182 0.251 0.309 0.379 0.451 0.554 Category 1.000 Resp Obs 15.000 Exp 14.122 Ref Obs 4.000 Exp 4.878 Avg Prob 0.743 SYSTAT save file created. 189 records written to %1 save file.

Page 597: Statistics I

I-577

Chapter 17

Deciles of Risk

How well does a model fit the data? Are the results unduly influenced by a handful of unusual observations? These are some of the questions we try to answer with our model assessment tools. Besides the prediction success table and likelihood-ratio tests (see the “Binary Logit with Interactions” example), the model assessment methods in LOGIT include the Pearson chi-square, deviance and Hosmer-Lemeshow statistics, the deciles of risk table, and a collection of residual, leverage, and influence quantities. Most of these are produced by the DC command, which is invoked after estimating a model.

-4 -3 -2 -1 0 1 2 3 4PEARSON

-3

-2

-1

0

1

2

3E

xpec

ted

Val

ue fo

r N

orm

al D

istr

ibut

ion

0.00.10.20.3

VARIANCE

Page 598: Statistics I

I-578

Logist ic Regression

The table in this example is generated by partitioning the sample into 10 groups based on the predicted probability of the observations. The row labeled Category gives the end points of the cells defining a group. Thus, the first group consists of all observations with predicted probability between 0 and 0.069, the second group covers the interval 0.069 to 0.094, and the last group contains observations with predicted probability greater than 0.611.

The cell end points can be specified explicitly as we did or generated automatically by LOGIT. Cells will be equally spaced if the DC command is given without any arguments, and LOGIT will allocate approximately equal numbers of observations to each cell when the SMART option is given, as:

which requests 10 cells. Within each cell, we are given a breakdown of the observed and expected 0’s (Ref) and 1’s (Resp) calculated as in the prediction success table. Expected l’s are just the sum of the predicted probabilities of 1 in the cell. In the table, it is apparent that observed totals are close to expected totals everywhere, indicating a fairly good fit. This conclusion is borne out by the Hosmer-Lemeshow statistic of 5.23, which is approximately chi-squared with eight degrees of freedom. H&L discuss the degrees of freedom calculation.

In using the deciles of risk table, it should be noted that the goodness-of-fit statistics will depend on the grouping rule specified and that not all statistics programs will apply the same rules. For example, some programs assign all tied probabilities to the same cell, which can result in very unequal cell counts. LOGIT gives the user a high degree of control over the grouping, allowing you to choose among several methods. The table also provides the Pearson chi-square and the sum of squared deviance residuals, assuming that each observation has a unique covariate pattern.

Regression Diagnostics

If the DC command is preceded by a SAVE command, a SYSTAT data file containing regression diagnostics will be created (Pregibon, 1981; Cook and Weisberg, 1984). The SAVE file contains these variables:

DC / SMART = 10

Page 599: Statistics I

I-579

Chapter 17

LEVERAGE(1) is a measure of the influence of an observation on the model fit and is H&L’s h. DELBETA(1) is a measure of the change in the coefficient vector due to the observation and is their (delta beta), DELPSTAT is based on the squared residual and is their (delta chi-square), and DELDSTAT is the change in deviance and is their (delta D). As in linear regression, the diagnostics are intended to identify outliers and influential observations. Plots of PEARSON, DEVIANCE, LEVERAGE(l), DELDSTAT, DELPSTAT against the CASE will highlight unusual data points. H&L suggest plotting , , and against PROB and against h.

There is an important difference between our calculation of these measures and those produced by H&L. In LOGIT, the above quantities are computed separately for each observation, with no account taken of covariate grouping; whereas, in H&L, grouping is taken into account. To obtain the grouped variants of these statistics, several SYSTAT programming steps are involved. For further discussion and interpretation of diagnostic graphs, see H&L’s Chapter 5. We include the probability plot of the residuals from our model, with the variance of the residuals used to size the plotting characters.

We also display an example of the graph on the cover of H&L. The original cover was plotted using SYSTAT Version 5 for the Macintosh. There are slight differences between the two plots because of the scales and number of iterations in the model fitting, but the examples are basically the same. H&L is an extremely valuable resource for learning about graphical aids to diagnosing logistic models.

ACTUAL Value of Dependent VariablePREDICT Class Assignment (1 or 0) PROB Predicted probabilityLEVERAGE(1) Diagonal element of Pregibon “hat” matrixLEVERAGE(2) Component of LEVERAGE(1)PEARSON Pearson Residual for observationVARIANCE Variance of Pearson ResidualSTANDARD Standardized Pearson ResidualDEVIANCE Deviance ResidualDELDSTART Change in Deviance chi-squareDELPSTART Change in Pearson chi-squareDELBETA(1) Standardized Change in BetaDELBETA(2) Standardized Change in BetaDELBETA(3) Standardized Change in Beta

δβ

δχ2

δD

δχ2 δD δβ

Page 600: Statistics I

I-580

Logist ic Regression

Example 5 Quantiles

In bioassay, it is common to estimate the dosage required to kill 50% of a target population. For example, a toxicity experiment might establish the concentration of nicotine sulphate required to kill 50% of a group of common fruit flies (Hubert, 1984). More generally, the goal is to identify the level of a stimulus required to induce a 50% response rate, where the response is any binary outcome variable and the stimulus is a continuous covariate. In bioassay, stimuli include drugs, toxins, hormones, and insecticides; the responses include death, weight gain, bacterial growth, and color change, but the concepts are equally applicable to other sciences.

To obtain the LD50 in LOGIT, simply issue the QNTL command. However, don’t make the mistake of spelling “quantile” as QU, which means QUIT in SYSTAT. QNTL will produce not only the LD50 but also a number of other quantiles as well, with upper and lower bounds when they exist. Consider the following data from Williams (1986):

Here, RESPONSE is the dependent variable, LDOSE is the logarithm of the dose (stimulus), and COUNT is the number of subjects with that response. The model estimated is:

RESPONSE LDOSE COUNT

CASE 1 1 –2 1

CASE 2 0 –2 4

CASE 3 1 –1 3

CASE 4 0 –1 2

CASE 5 1 0 2

CASE 6 0 0 3

CASE 7 1 1 4

CASE 8 0 1 1

CASE 9 1 2 5

USE WILLFREQ=COUNTLOGITMODEL RESPONSE=CONSTANT+LDOSEESTIMATEQNTL

Page 601: Statistics I

I-581

Chapter 17

Following is the output:

Variables in the SYSTAT Rectangular file are: RESPONSE LDOSE COUNT Case frequencies determined by value of variable COUNT. Categorical values encountered during processing are: RESPONSE (2 levels) 0, 1 Binary LOGIT Analysis. Dependent variable: RESPONSE Analysis is weighted by COUNT Sum of weights = 25.000 Input records: 9 Records for analysis: 9 Sample split Weighted Category Count Count REF 5 15.000 RESP 4 10.000 Total : 9 25.000 L-L at iteration 1 is -17.329 L-L at iteration 2 is -13.277 L-L at iteration 3 is -13.114 L-L at iteration 4 is -13.112 L-L at iteration 5 is -13.112 Log Likelihood: -13.112 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.564 0.496 1.138 0.255 2 LDOSE 0.919 0.394 2.334 0.020 95.0 % bounds Parameter Odds Ratio Upper Lower 2 LDOSE 2.507 5.425 1.159 Log Likelihood of constants only model = LL(0) = -16.825 2*[LL(N)-LL(0)] = 7.427 with 1 df Chi-sq p-value = 0.006 McFadden’s Rho-Squared = 0.221

Evaluation Vector 1 CONSTANT 1.000 2 LDOSE VALUE Quantile Table Probability LOGIT LDOSE Upper Lower 0.999 6.907 6.900 44.788 3.518 0.995 5.293 5.145 33.873 2.536 0.990 4.595 4.385 29.157 2.105 0.975 3.664 3.372 22.875 1.519 0.950 2.944 2.590 18.042 1.050 0.900 2.197 1.777 13.053 0.530 0.750 1.099 0.582 5.928 -0.445 0.667 0.695 0.142 3.551 -1.047 0.500 0.0 -0.613 0.746 -3.364 0.333 -0.695 -1.369 -0.347 -7.392 0.250 -1.099 -1.809 -0.731 -9.987 0.100 -2.197 -3.004 -1.552 -17.266 0.050 -2.944 -3.817 -2.046 -22.281 0.025 -3.664 -4.599 -2.503 -27.126 0.010 -4.595 -5.612 -3.081 -33.416 0.005 -5.293 -6.372 -3.508 -38.136 0.001 -6.907 -8.127 -4.486 -49.055

Page 602: Statistics I

I-582

Logist ic Regression

This table includes LD (probability) values between 0.001 and 0.999. The median lethal LDOSE (log-dose) is –0.613 with upper and lower bounds of 0.746 and –3.364 for the default 95% confidence interval, corresponding to a dose of 0.542 with limits 2.11 and 0.0346.

Indeterminate Confidence Intervals

Quantile confidence intervals are calculated using Fieller bounds (Finney, 1978), which can easily include positive or negative infinity for steep dose-response relationships. In the output, these are represented by the SYSTAT missing value. If this happens, an alternative suggested by Williams (1986) is to calculate confidence bounds using likelihood-ratio (LR) tests. See Cox and Oakes (1984) for a likelihood profile example. Williams observes that the LR bounds seem to be invariably smaller than the Fieller bounds even for well-behaved large-sample problems.

With SYSTAT BASIC, the search for the LR bounds can be conducted easily. However, if you are not familiar with LR testing of this type, please refer to Cox and Oakes (1984) and Williams (1986) for further explanation, because our account here is necessarily brief.

We first estimate the model of RESPONSE on LDOSE reported above, which will be the unrestricted model in the series of tests. The key statistic is the final log-likelihood of –13.112. We then need to search for restricted models that force the LD50 to other values and that yield log-likelihoods no worse than . A difference in log-likelihoods of 1.92 marks a 95% confidence interval because 2 * 1.92 = 3.84 is the 0.95 cutoff of the chi-squared distribution with one degree of freedom.

A restricted model is estimated by using a new independent variable and fitting a model without a constant. The new independent variable is equal to the original minus the value of the hypothesized LD50 bound. Values of the bounds will be selected by trial and error. Thus, to test an LD50 value of 0.4895, we could type:

SYSTAT BASIC is used to create the new variable LDOSEB “on the fly,” and the new model is then estimated without a constant. The only important part of the results from a restricted model is the final log-likelihood. It should be close to –15.032 if we have

LOGITLET LDOSEB=LDOSE-.4895MODEL RESPONSE=LDOSEBESTIMATELET LDOSEB=LDOSE+2.634MODEL RESPONSE=LDOSEBESTIMATE

13.112– 1.92 15.032–=–

Page 603: Statistics I

I-583

Chapter 17

found the boundary of the confidence interval. We won’t show the results of these estimations except to say that the lower bound was found to be –2.634 and is tested using the second LET statement. Note that the value of the bound is subtracted from the original independent variable, resulting in the subtraction of a negative number. While the process of looking for a bound that will yield a log-likelihood of –15.032 for these data is one of trial and error, it should not take long with the interactive program. Several other examples are provided in Williams (1986). We were able to reproduce most of his confidence interval results, but for several models his reported LD50 values seem to be incorrect.

Quantiles and Logistic Regression

The calculation of LD values has traditionally been conducted in the context of simple regressions containing a single predictor variable. LOGIT extends the notion to multiple regression by allowing you to select one variable for LD calculations while holding the values of the other variables constant at prespecified values. Thus,

will produce the quantiles for AGE with the other variables set as specified. The Fieller bounds are calculated, adjusting for all other parameters estimated.

Example 6 Multinomial Logit

We will illustrate multinomial modeling with an example, emphasizing what is new in this context. If you have not already read the example on binary logit, this is a good time to do so. The data used here have been extracted from the National Longitudinal Survey of Young Men, 1979. Information on 200 individuals is supplied on school enrollment status (NOTENR = 1 if not enrolled, 0 otherwise), log10 of wage (LW), age, highest completed grade (EDUC), mother’s education (MED), father’s education (FED), an index of reading material available in the home (CULTURE = 1 for least, 3 for most), mean income of persons in father’s occupation in 1960 (FOMY), an IQ

USE HOSLEMCATEGORY RACEMODEL LOW = CONSTANT + AGE + RACE + SMOKE + HT +, UI + LWD + PTD ESTIMATEQNTL AGE / CONSTANT=1, RACE[1]=1, SMOKE=1, PTD=1, LWD=1, HT=1, UI=1

Page 604: Statistics I

I-584

Logist ic Regression

measure, a race dummy (BLACK = 0 for white), a region dummy (SOUTH = 0 for non-South), and the number of siblings (NSIBS).

We estimate a model to analyze the CULTURE variable, predicting its value with several demographic characteristics. In this example, we ignore the fact that the dependent variable is ordinal and treat it as a nominal variable. (See Agresti, 1990, for a discussion of the distinction.)

These commands look just like our binary logit analyses with the exception of the DERIVATIVE and CLASS options, which we will discuss below. The resulting output is:

USE NLSFORMAT=4PRINT=LONGLOGITMODEL CULTURE=CONSTANT+MED+FOMYESTIMATE / MEANS,PREDICT,CLASS,DERIVATIVE=INDIVIDUALPRINT

Categorical values encountered during processing are:CULTURE (3 levels) 1, 2, 3

Multinomial LOGIT Analysis. Dependent variable: CULTUREInput records: 200Records for analysis: 200Sample split Category choices 1 12 2 49 3 139Total : 200 Independent variable MEANS PARAMETER 1 2 3 OVERALL 1 CONSTANT 1.0000 1.0000 1.0000 1.0000 2 MED 8.7500 10.1837 11.4460 10.9750 3 FOMY 4551.5000 5368.8571 6116.1367 5839.1750L-L at iteration 1 is -219.7225L-L at iteration 2 is -145.2936L-L at iteration 3 is -138.9952L-L at iteration 4 is -137.8612L-L at iteration 5 is -137.7851L-L at iteration 6 is -137.7846L-L at iteration 7 is -137.7846Log Likelihood: -137.7846

Parameter Estimate S.E. t-ratio p-valueChoice Group: 1 1 CONSTANT 5.0638 1.6964 2.9850 0.0028 2 MED -0.4228 0.1423 -2.9711 0.0030 3 FOMY -0.0006 0.0002 -2.6034 0.0092

Page 605: Statistics I

I-585

Chapter 17

Choice Group: 2 1 CONSTANT 2.5435 0.9834 2.5864 0.0097 2 MED -0.1917 0.0768 -2.4956 0.0126 3 FOMY -0.0003 0.0001 -2.1884 0.0286 95.0 % bounds Parameter Odds Ratio Upper LowerChoice Group: 1 2 MED 0.6552 0.8660 0.4958 3 FOMY 0.9994 0.9998 0.9989Choice Group: 2 2 MED 0.8255 0.9597 0.7101 3 FOMY 0.9997 1.0000 0.9995Log Likelihood of constants only model = LL(0) = -153.25352*[LL(N)-LL(0)] = 30.9379 with 4 df Chi-sq p-value = 0.0000McFadden’s Rho-Squared = 0.1009

Wald tests on effects across all choices Wald Chi-Sq Effect Statistic Signif df 1 CONSTANT 12.0028 0.0025 2.0000 2 MED 12.1407 0.0023 2.0000 3 FOMY 9.4575 0.0088 2.0000

Covariance Matrix 1 2 3 4 5 1 2.8777 2 -0.1746 0.0202 3 -0.0002 -0.0000 0.0000 4 0.5097 -0.0282 -0.0000 0.9670 5 -0.0274 0.0027 -0.0000 -0.0541 0.0059 6 -0.0000 -0.0000 0.0000 -0.0001 -0.0000 6 6 0.0000 Correlation Matrix 1 2 3 4 5 1 1.0000 -0.7234 -0.6151 0.3055 -0.2100 2 -0.7234 1.0000 -0.0633 -0.2017 0.2462 3 -0.6151 -0.0633 1.0000 -0.1515 -0.0148 4 0.3055 -0.2017 -0.1515 1.0000 -0.7164 5 -0.2100 0.2462 -0.0148 -0.7164 1.0000 6 -0.1659 -0.0149 0.2284 -0.5544 -0.1570 6 1 -0.1659 2 -0.0149 3 0.2284 4 -0.5544 5 -0.1570 6 1.0000 Individual variable derivatives averaged over all observations PARAMETER 1 2 3 1 CONSTANT 0.2033 0.3441 -0.5474 2 MED -0.0174 -0.0251 0.0425 3 FOMY -0.0000 -0.0000 0.0001

Page 606: Statistics I

I-586

Logist ic Regression

The output begins with a report on the number of records read and retained for analysis. This is followed by a frequency table of the dependent variable; both weighted and unweighted counts would be provided if the FREQ option had been used. The means table provides means of the independent variables by value of the dependent variable. We observe that the highest educational and income values are associated with the most reading material in the home. Next, an abbreviated history of the optimization process lists the log-likelihood at each iteration, and finally, the estimation results are printed.

Note that the regression results consist of two sets of estimates, labeled Choice Group 1 and Choice Group 2. It is this multiplicity of parameter estimates that differentiates multinomial from binary logit. If there had been five categories in the dependent variable, there would have been four sets of estimates, and so on. This volume of output provides the challenge to understanding the results.

The results are a little more intelligible when you realize that we have really estimated a series of binary logits simultaneously. The first submodel consists of the two dependent variable categories 1 and 3, and the second consists of categories 2 and 3. These submodels always include the highest level of the dependent variable as the reference class and one other level as the response class. If NCAT had been set to 25, the 24 submodels would be categories 1 and 25, categories 2 and 25, through categories 24 and 25. We then obtain the odds ratios for the two submodels separately, comparing

Model Prediction Success Table Actual Predicted Choice Actual Choice 1 2 3 Total 1 1.8761 4.0901 6.0338 12.0000 2 3.6373 13.8826 31.4801 49.0000 3 6.4865 31.0273 101.4862 139.0000 Pred. Tot. 12.0000 49.0000 139.0000 200.0000Correct 0.1563 0.2833 0.7301Success Ind. 0.0963 0.0383 0.0351Tot. Correct 0.5862

Model Classification Table Actual Predicted Choice Actual Choice 1 2 3 Total 1 1.0000 3.0000 8.0000 12.0000 2 0.0 4.0000 45.0000 49.0000 3 1.0000 5.0000 133.0000 139.0000 Pred. Tot. 2.0000 12.0000 186.0000 200.0000Correct 0.0833 0.0816 0.9568Success Ind. 0.0233 -0.1634 0.2618Tot. Correct 0.6900

Page 607: Statistics I

I-587

Chapter 17

dependent variable levels 1 against 3 and 2 against 3. This table shows that levels 1 and 2 are less likely as MED and FOMY increase, as the odds ratio is less than 1.

Wald Test Table

The coefficient/standard-error ratios (t ratios) reported next to each coefficient are a guide to the significance of an individual parameter. But when the number of categories is greater than two, each variable corresponds to more than one parameter. The Wald test table automatically conducts the hypothesis test of dropping all parameters associated with a variable, and the degrees of freedom indicates how many parameters were involved. Because each variable in this example generates two coefficients, the Wald tests have two degrees of freedom each. Given the high individual t ratios, it is not surprising that every variable is also significant overall. The PRINT = LONG option also produces the parameter covariance and correlation matrices.

Derivative Tables

In a multinomial context, we will want to know how the probabilities of each of the outcomes will change in response to a change in the covariate values. This information is provided in the derivative table, which tells us, for example, that when MED increases by one unit, the probability of category 3 goes up by 0.042, and categories 1 and 2 go down by 0.017 and 0.025, respectively. To assess properly the effect of father’s income, the variable should be rescaled to hundreds or thousands of dollars (or the FORMAT increased) because the effect of an increase of one dollar is very small. The sum of the entries in each row is always 0 because an increase in probability in one category must come about by a compensating decrease in other categories. There is no useful interpretation of the CONSTANT row.

In general, the table shows how probability is reallocated across the possible values of the dependent variable as the independent variable changes. It thus provides a global view of covariate effects that is not easily seen when considering each binary submodel separately. In fact, the overall effect of a covariate on the probability of an outcome can be of the opposite sign of its coefficient estimate in the corresponding submodel. This is because the submodel concerns only two of the outcomes, whereas the derivative table considers all outcomes at once.

This table was generated by evaluating the derivatives separately for each individual observation in the data set and then computing the mean; this is the theoretically

Page 608: Statistics I

I-588

Logist ic Regression

correct way to obtain the results. A quick alternative is to evaluate the derivatives once at the sample average of the covariates. This method saves time (but at the possible cost of accuracy) and is requested with the option DERIVATIVE=AVERAGE.

Prediction Success

The PREDICT option instructs LOGIT to produce the prediction success table, which we have already seen in the binary logit. (See Hensher and Johnson, 1981; McFadden, 1979.) The table will break down the distribution of predicted outcomes by actual choice, with diagonals representing correct predictions and off-diagonals representing incorrect predictions. For the multinomial model, the table will have dimensions NCAT by NCAT with additional marginal results. For our example model, the core table is 3 by 3.

Each row of the table takes all cases having a specific value of the dependent variable and shows how the model allocates those cases across the possible outcomes. Thus in row 1, the 12 cases that actually had CULTURE = 1 were distributed by the predictive model as 1.88 to CULTURE = 1, 4.09 to CULTURE = 2, and 6.03 to CULTURE = 3. These numbers are obtained by summing the predicted probability of being in each category across all of the cases with CULTURE actually equal to 1. A similar allocation is provided for every value of the dependent variable.

The prediction success table is also bordered by additional information—row totals are observed sums, and column totals are predicted sums and will be equal for any model containing a constant. The Correct row gives the ratio of the number correctly predicted in a column to the column total. Thus, among cases for which CULTURE = 1, the fraction correct is ; for CULTURE = 3, the ratio is

. The total correct gives the fraction correctly predicted overall and is computed as the sum Correct in each column divided by the table total. This is .

The success index measures the gain that the model exhibits in number correctly predicted in each column over a purely random model (a model with just a constant). A purely random model would assign the same probabilities of the three outcomes to each case, as illustrated below:

Random Probabitity ModelPredicted Sample Fraction

Success Index =CORRECT - Random Predicted

PROB (CULTURE=l)= 12/200 = 0.0600 0.1563 – 0.0600 = 0.0963PROB (CULTURE=2)= 49/200 = 0.2450 0.2833 – 0.2450 = 0.0383PROB (CULTURE=3)=139/200 = 0.6950 0.7301 – 0.6950 = 0.0351

1.8761 12⁄ 0.1563=101.4862 139⁄ 0.7301=

1.8761 13.8826 101.4862+ +( ) 200⁄ 0.5862=

Page 609: Statistics I

I-589

Chapter 17

Thus, the smaller the success index in each column, the poorer the performance of the model; in fact, the index can even be negative.

Normally, one prediction success table is produced for each model estimated. However, if the data have been separated into learning and test subsamples with BY, a separate prediction success table will be produced for each portion of the data. This can provide a clear picture of the strengths and weaknesses of the model when applied to fresh data.

Classification Tables

Classification tables are similar to prediction success tables except that predicted choices instead of predicted probabilities are added into the table. Predicted choice is the choice with the highest probability. Mathematically, the classification table is a prediction success table with the predicted probabilities changed, setting the highest probability of each case to 1 and the other probabilities to 0.

In the absence of fractional case weighting, each cell of the main table will contain an integer instead of a real number. All other quantities are computed as they would be for the prediction success table. In our judgment, the classification table is not as good a diagnostic tool as the prediction success table. The option is included primarily for the binary logit to provide comparability with results reported in the literature.

Example 7 Conditional Logistic Regression

Data must be organized in a specific way for the conditional logistic model; fortunately, this organization is natural for matched sample case-control studies. First, matched samples must be grouped together; all subjects from a given stratum must be contiguous. It is thus advisable to provide each set with a unique stratum number to facilitate the sorting and tracking of records. Second, the dependent variable gives the relative position of the case within a matched set. Thus, the dependent variable will be an integer between 1 and NCAT, and if the case is first in each stratum, then the dependent variable will be equal to 1 for every record in the data set.

To illustrate how to set up conditional logit models, we use data discussed at length by Breslow and Day (1980) on cases of endometrial cancer in a retirement community near Los Angeles. The data are reproduced in their Appendix III and are identified in SYSTAT as MACK.SYD.

Page 610: Statistics I

I-590

Logist ic Regression

The data set includes the dependent variable CANCER, the exposure variables AGE, GALL (gall bladder disease), HYP (hypertension), OBESE, ESTROGEN, DOSE, DUR (duration of conjugated estrogen exposure), NON (other drugs), some transformations of these variables, and a set identification number. The data are organized by sets, with the case coming first, followed by four controls, and so on, for a total of 315 observations .

To estimate a model of the relative risks of gall bladder disease, estrogen use, and their interaction, you may proceed as follows:

There are three key points to notice about this sequence of commands. First, the NCAT command is required to let LOGIT know how many subjects there are in a matched set. Unlike the unconditional binary LOGIT, a unit of information in matched samples will typically span more than one line of data, and NCAT will establish the minimum size of each matched set. If each set contains the same number of subjects, the NCAT command completely describes the data organization. If there were a varying number of controls per set, the size of each set would be signaled with the ALT command, as in

Here, SETSIZE is a variable containing the total number of subjects (number of controls plus 1) per set. Each set could have its own value.

The second point is that the matched set conditional logit never contains a constant; the constant is eliminated along with all other variables that do not vary among members of a matched set. The third point is the appearance of the semicolon at the end of the model. This is required to distinguish the conditional from the unconditional model.

USE MACKPRINT LONGLOGITMODEL DEPVAR=GALL+EST+GALL*EST ;ALT=SETSIZENCAT=5ESTIMATE

ALT = SETSIZE

63 * 4 1+( )( )

Page 611: Statistics I

I-591

Chapter 17

After you specify the commands, the output produced includes:

Variables in the SYSTAT Rectangular file are: CANCER GALL HYP OBESE EST DOS DURATION NON REC DEPVAR GROUP OB DOSGRP DUR DURGRP CEST SETSIZ Conditional LOGIT, data organized by matched set. Categorical values encountered during processing are: DEPVAR (1 levels) 1 Conditional LOGIT Analysis. Dependent variable: DEPVAR Number of alternatives: SETSIZ Input records: 315 Matched sets for analysis: 63L-L at iteration 1 is -101.3946

L-L at iteration 2 is -79.0552 L-L at iteration 3 is -76.8868 L-L at iteration 4 is -76.7326 L-L at iteration 5 is -76.7306 L-L at iteration 6 is -76.7306 Log Likelihood: -76.7306 Parameter Estimate S.E. t-ratio p-value 1 GALL 2.8943 0.8831 3.2777 0.0010 2 EST 2.7001 0.6118 4.4137 0.0000 3 GALL*EST -2.0527 0.9950 -2.0631 0.0391 95.0 % bounds Parameter Odds Ratio Upper Lower 1 GALL 18.0717 102.0127 3.2014 2 EST 14.8818 49.3621 4.4866 3 GALL*EST 0.1284 0.9025 0.0183 Log Likelihood of constants only model = LL(0) = 0.0000 McFadden’s Rho-Squared = 4.56944E+15 Covariance Matrix 1 2 3 1 0.7798 2 0.3398 0.3743 3 -0.7836 -0.3667 0.9900 Correlation Matrix 1 2 3 1.0000 0.6290 -0.8918 2 0.6290 1.0000 -0.6024 3 -0.8918 -0.6024 1.0000

Page 612: Statistics I

I-592

Logist ic Regression

The output begins with a report on the number of SYSTAT records read and the number of matched sets kept for analysis. The remaining output parallels the results produced by the unconditional logit model. The parameters estimated are coefficients of a linear logit, the relative risks are derived by exponentiation, and the interpretation of the model is unchanged. Model selection will proceed as it would in linear regression; you might experiment with logarithmic transformations of the data, explore quadratic and higher-order polynomials in the risk factors, and look for interactions. Examples of such explorations appear in Breslow and Day (1980).

Example 8 Discrete Choice Models

The CHOICE data set contains hypothetical data motivated by McFadden (1979). The CHOICE variable represents which of the three transportation alternatives (AUTO, POOL, TRAIN) each subject prefers. The first subscripted variable in each choice category represents TIME and the second, COST. Finally, SEX$ represents the gender of the chooser, and AGE, the age.

A basic discrete choice model is estimated with:

USE CHOICELOGITSET TIME = AUTO(1),POOL(1),TRAIN(1)SET COST = AUTO(2),POOL(2),TRAIN(2)MODEL CHOICE=TIME+COSTESTIMATE

Page 613: Statistics I

I-593

Chapter 17

There are two new features of this program. First, the word TIME is not a SYSTAT variable name; rather, it is a label we chose to remind us of time spent commuting. The group of names in the SET statement are valid SYSTAT variables corresponding, in order, to the three modes of transportation. Although there are three variable names in the SET variable, only one attribute is being measured.

Following is the output:

Categorical values encountered during processing are:CHOICE (3 levels) 1, 2, 3Categorical variables are effects coded with the highest value as reference. Conditional LOGIT Analysis. Dependent variable: CHOICEInput records: 29Records for analysis: 29Sample split Category choices 1 15 2 6 3 8Total : 29 L-L at iteration 1 is -31.860L-L at iteration 2 is -31.142L-L at iteration 3 is -31.141L-L at iteration 4 is -31.141Log Likelihood: -31.141 Parameter Estimate S.E. t-ratio p-value 1 TIME -0.020 0.017 -1.169 0.243 2 COST -0.088 0.145 -0.611 0.541 95.0 % bounds Parameter Odds Ratio Upper Lower 1 TIME 0.980 1.014 0.947 2 COST 0.915 1.216 0.689Log Likelihood of constants only model = LL(0) = -29.645McFadden’s Rho-Squared = -0.050 Covariance Matrix 1 2 1 0.000 2 0.001 0.021 Correlation Matrix 1 2 1 1.000 0.384 2 0.384 1.000

Page 614: Statistics I

I-594

Logist ic Regression

The output begins with a frequency distribution of the dependent variable and a brief iteration history and prints standard regression results for the parameters estimated.

A key difference between a conditional variable clause and a standard SYSTAT polytomous variable is that each clause corresponds to only one estimated parameter regardless of the value of NCAT, while each free-standing polytomous variable generates NCAT – 1 parameters. The difference is best seen in a model that mixes both types of variables (see Hoffman and Duncan, 1988, or Steinberg, 1987) for further discussion).

Mixed Parameters

The following is an example of mixing polytomous and conditional variables:

The hybrid model generates a single coefficient each for TIME and COST and two sets of parameters for the polytomous variables.

The resulting output is:

USE CHOICELOGITCATEGORY SEX$SET TIME = AUTO(1),POOL(1),TRAIN(1)SET COST = AUTO(2),POOL(2),TRAIN(2)MODEL CHOICE=TIME+COST+SEX$+AGEESTIMATE

Categorical values encountered during processing are:SEX$ (2 levels) Female, MaleCHOICE (3 levels) 1, 2, 3 Conditional LOGIT Analysis. Dependent variable: CHOICEInput records: 29Records for analysis: 29Sample split Category choices 1 15 2 6 3 8Total : 29 L-L at iteration 1 is -31.860L-L at iteration 2 is -28.495L-L at iteration 3 is -28.477L-L at iteration 4 is -28.477L-L at iteration 5 is -28.477Log Likelihood: -28.477 Parameter Estimate S.E. t-ratio p-value 1 TIME -0.018 0.020 -0.887 0.375 2 COST -0.351 0.217 -1.615 0.106

Page 615: Statistics I

I-595

Chapter 17

Varying Alternatives

For some discrete choice problems, the number of alternatives available varies across choosers. For example, health researchers studying hospital choice pooled data from several cities in which each city had a different number of hospitals in the choice set (Luft et al., 1988). Transportation research may pool data from locations having train service with locations without trains. Carson, Hanemann, and Steinberg (1990) pool responses from two contingent valuation survey questions having differing numbers of alternatives. To let LOGIT know about this, there are two ways of proceeding. The most flexible is to organize the data by choice. With the standard data layout, use the ALT command, as in

Choice Group: 1 3 SEX$_Female 0.328 0.509 0.645 0.519 4 AGE 0.026 0.014 1.850 0.064Choice Group: 2 3 SEX$_Female 0.024 0.598 0.040 0.968 4 AGE -0.008 0.016 -0.500 0.617 95.0 % bounds Parameter Odds Ratio Upper Lower 1 TIME 0.982 1.022 0.945 2 COST 0.704 1.078 0.460Choice Group: 1 4 AGE 1.026 1.054 0.998Choice Group: 2 4 AGE 0.992 1.024 0.961Log Likelihood of constants only model = LL(0) = -29.6452*[LL(N)-LL(0)] = 2.335 with 4 df Chi-sq p-value = 0.674McFadden’s Rho-Squared = 0.039Wald tests on effects across all choices Wald Chi-Sq Effect Statistic Signif df 3 SEX$_Female 0.551 0.759 2.000 4 AGE 4.475 0.107 2.000 Covariance Matrix 1 2 3 4 5 6 1 0.000 2 0.001 0.047 3 0.002 0.009 0.259 4 -0.000 -0.001 0.002 0.000 5 0.002 -0.018 0.165 0.002 0.358 6 -0.000 0.001 0.002 0.000 0.003 0.000 Correlation Matrix 1 2 3 4 5 6 1 1.000 0.180 0.150 -0.076 0.146 -0.266 2 0.180 1.000 0.084 -0.499 -0.140 0.310 3 0.150 0.084 1.000 0.230 0.543 0.193 4 -0.076 -0.499 0.230 1.000 0.281 0.265 5 0.146 -0.140 0.543 0.281 1.000 0.323 6 -0.266 0.310 0.193 0.265 0.323 1.000

ALT=NCHOICES

Page 616: Statistics I

I-596

Logist ic Regression

where NCHOICES is a SYSTAT variable containing the number of alternatives available to the chooser. If the value of the ALT variable is less than NCAT for an observation, LOGIT will use only the first NCHOICES variables in each conditional variable clause in the analysis.

With the standard data layout, the ALT command is useful only if the choices not available to some cases all appear at the end of the choice list. Organizing data by choice is much more manageable. One final note on varying numbers of alternatives: if the ALT command is used in the standard data layout, the model may not contain a constant or any polytomous variables; the model must be composed only of conditional variable clauses. We will not show an example here because by now you must have figured that we believe the by-choice layout is more suitable if you have data with varying choice alternatives.

Interactions

A common practice in discrete choice models is to enter characteristics of choosers as interactions with attributes of the alternatives in conditional variable clauses. When dealing with large sets of alternatives, such as automobile purchase choices or hospital choices, where the model may contain up to 60 different alternatives, adding polytomous variables can quickly produce unmanageable estimation problems, even for mainframes. In the transportation literature, it has become commonplace to introduce demographic variables as interactions with, or other functions, of the discrete choice variables. Thus, instead of, or in addition to, the COST group of variables, AUTO(2), POOL(2), TRAIN(2), you might see the ratio of cost to income. These ratios would be created with LET transformations and then added in another SET list for use as a conditional variable in the MODEL statement. Interactions can also be introduced this way. By confining demographic variables to appear only as interactions with choice variables, the number of parameters estimated can be kept quite small.

Thus, an investigator might prefer

USE CHOICELOGITSET TIME = AUTO(1),POOL(1),TRAIN(1)SET TIMEAGE=AUTO(1)*AGE,POOL(1)*AGE,TRAIN(1)*AGESET COST = AUTO(2),POOL(2),TRAIN(2)MODEL CHOICE=TIME+TIMEAGE+COSTESTIMATE

Page 617: Statistics I

I-597

Chapter 17

as a way of entering demographics. The advantage to using only conditional clauses is clear when dealing with a large value of NCAT as the number of additional parameters estimated is minimized. The model above yields:

Categorical values encountered during processing are:CHOICE (3 levels) 1, 2, 3 Conditional LOGIT Analysis. Dependent variable: CHOICEInput records: 29Records for analysis: 29Sample split Category choices 1 15 2 6 3 8Total : 29 L-L at iteration 1 is -31.860L-L at iteration 2 is -28.021L-L at iteration 3 is -27.866L-L at iteration 4 is -27.864L-L at iteration 5 is -27.864Log Likelihood: -27.864 Parameter Estimate S.E. t-ratio p-value 1 TIME -0.148 0.062 -2.382 0.017 2 TIMEAGE 0.003 0.001 2.193 0.028 3 COST 0.007 0.155 0.043 0.966 95.0 % bounds Parameter Odds Ratio Upper Lower 1 TIME 0.863 0.974 0.764 2 TIMEAGE 1.003 1.006 1.000 3 COST 1.007 1.365 0.742Log Likelihood of constants only model = LL(0) = -29.6452*[LL(N)-LL(0)] = 3.561 with 1 df Chi-sq p-value = 0.059McFadden’s Rho-Squared = 0.060 Covariance Matrix 1 2 3 1 0.004 2 -0.000 0.000 3 -0.001 0.000 0.024 Correlation Matrix 1 2 3 1 1.000 -0.936 -0.110 2 -0.936 1.000 0.273 3 -0.110 0.273 1.000

Page 618: Statistics I

I-598

Logist ic Regression

Constants

The models estimated here deliberately did not include a constant because the constant is treated as a polytomous variable in LOGIT. To obtain an alternative specific constant, enter the following model statement:

Two CONSTANT parameters would be estimated. For the discrete choice model with the type of data layout of this example, there is no need to specify the NCAT value because LOGIT determines this automatically by the number of variables between the brackets. If the model statement is inconsistent in the number of variables within brackets across conditional variable clauses, an error message will be generated.

Following is the output:

USE CHOICELOGITSET TIME = AUTO(1),POOL(1),TRAIN(1)SET COST = AUTO(2),POOL(2),TRAIN(2)MODEL CHOICE=CONSTANT+TIME+COSTESTIMATE

Categorical values encountered during processing are:CHOICE (3 levels) 1, 2, 3 Conditional LOGIT Analysis. Dependent variable: CHOICEInput records: 29Records for analysis: 29Sample split Category choices 1 15 2 6 3 8Total : 29 L-L at iteration 1 is -31.860L-L at iteration 2 is -25.808L-L at iteration 3 is -25.779L-L at iteration 4 is -25.779L-L at iteration 5 is -25.779Log Likelihood: -25.779 Parameter Estimate S.E. t-ratio p-value 1 TIME -0.012 0.020 -0.575 0.565 2 COST -0.567 0.222 -2.550 0.011 3 CONSTANT 1.510 0.608 2.482 0.013 3 CONSTANT -0.865 0.675 -1.282 0.200 95.0 % bounds Parameter Odds Ratio Upper Lower 1 TIME 0.988 1.029 0.950 2 COST 0.567 0.877 0.367Log Likelihood of constants only model = LL(0) = -29.6452*[LL(N)-LL(0)] = 7.732 with 2 df Chi-sq p-value = 0.021McFadden’s Rho-Squared = 0.130

Page 619: Statistics I

I-599

Chapter 17

Example 9 By-Choice Data Format

In the standard data layout, there is one data record per case that contains information on every alternative open to a chooser. With a large number of alternatives, this can quickly lead to an excessive number of variables. A convenient alternative is to organize data by choice; with this data layout, there is one record per alternative and as many as NCAT records per case. The data set CHOICE2 organizes the CHOICE data of the Discrete Choice Models example in this way. If you analyze the differences between the two data sets, you will see that they are similar to those between the split-plot and multivariate layout for the repeated measures design (see Analysis of Variance). To set up the same problem in a by-choice layout, input the following:

The by-choice format requires that the dependent variable appear with the same value on each record pertaining to the case. An ALT variable (here NCHOICES) indicating the number of records for this case must also appear on each record. The by-choice organization results in fewer variables on the data set, with the savings increasing with the number of alternatives. However, there is some redundancy in that certain data values are repeated on each record. The best reason for using a by-choice format is to

Wald tests on effects across all choices Wald Chi-Sq Effect Statistic Signif df 3 CONSTANT 8.630 0.013 2.000 Covariance Matrix 1 2 3 4 1 0.000 2 0.001 0.049 3 -0.001 -0.082 0.370 4 -0.005 0.056 0.046 0.455 Correlation Matrix 1 2 3 4 1 1.000 0.130 -0.053 -0.350 2 0.130 1.000 -0.606 0.372 3 -0.053 -0.606 1.000 0.113 4 -0.350 0.372 0.113 1.000

USE CHOICE2LOGITNCAT=3ALT=NCHOICESMODEL CHOICE=TIME+COST ;ESTIMATE

Page 620: Statistics I

I-600

Logist ic Regression

handle varying numbers of alternatives per case. In this situation, there is no need to shuffle data values or to be concerned with choice order.

With the by-choice data format, the NCAT statement is required; it is the only way for LOGIT to know the number of alternatives to expect per case. For varying numbers of alternatives per case, the ALT statement is also required, although we use it here with the same number of alternatives.

Because the number of alternatives (ALT) is the same for each case in this example, the output is the same as the “Mixed Parameters” example.

Weighting Choice-Based Samples

For estimation of the slope coefficients of the discrete choice model, weighting is not required even in choice-based samples. For predictive purposes, however, weighting is necessary to forecast aggregate shares, and it is also necessary for consistent estimation of the alternative specific dummies (Manski and Lerman, 1977).

The appropriate weighting procedure for choice-based sample logit estimation requires that the sum of the weights equal the actual number of observations retained in the estimation sample. For choice-based samples, the weight for any observation choosing the ith option is , where is the population share choosing the jth option and is the choice-based sample share choosing the jth option.

As an example, suppose theatergoers make up 10% of the population and we have a choice-based sample consisting of 100 theatergoers ( ) and 100 non-theatergoers ( ). Although theatergoers make up only 10% of the population, they are heavily oversampled and make up 50% of the study sample. Using the above formulas, the correct weights would be

USE CHOICE2LOGITCATEGORY SEX$NCAT=3ALT=NCHOICESMODEL CHOICE=TIME+COST ; AGE+SEX$ESTIMATE

Wj Sj js⁄= Sj

sj

Y 1=Y 0=

W0 0.9 0.5⁄ 1.8= =

W1 0.1 0.5⁄ 0.2= =

Page 621: Statistics I

I-601

Chapter 17

and the sum of the weights would be , as required. To handle such samples, LOGIT permits non-integer weights and does not truncate them to integers.

Example 10 Stepwise Regression

LOGIT offers forward and backward stepwise logistic regression with single stepping as an option. The simplest way to initiate stepwise regression is to substitute START for ESTIMATE following a MODEL statement and then proceed with stepping with the STEP command, just as in GLM or Regression.

An upward step consists of three components. First, the current model is estimated to convergence. The procedure is exactly the same as regular estimation. Second, score statistics for each additional effect are conducted, adjusted for variables already in the model. The joint significance of all additional effects together is also computed. Finally, the effect with the smallest significance level for its score statistic is identified. If this significance level is below the ENTER option (0.05 by default), the effect is added to the model.

A downward step also consists of three computational segments. First, the model is estimated to convergence. Then Wald statistics are computed for each effect in the model. Finally, the effect with the largest p value for its Wald test statistic is identified. If this significance level is above the REMOVE criterion (by default 0.10), the effect is removed from the model.

If you require certain effects to remain in the model regardless of the outcome of the Wald test, force them into the model by listing them first on the model and using the FORCE option of START. It is important to set the ENTER and REMOVE criteria carefully because it is possible to have a variable cycle in and out of a model repeatedly. The defaults are:

although Hosmer and Lemeshow use

in the example we reproduce below.

START / ENTER = .05, REMOVE = .10

START / ENTER =.15, REMOVE =.20

100 * 1.8 100 * 0.2+ 200=

Page 622: Statistics I

I-602

Logist ic Regression

Hosmer and Lemeshow use stepwise regression in their search for a model of low birth weight discussed in the “Binary Logit” section. We conduct a similar analysis with:

Following is the output:

USE HOSLEMLOGITCATEGORY RACEMODEL LOW=CONSTANT+PTL+LWT+HT+RACE+SMOKE+UI+AGE+FTVSTART / ENTER=.15,REMOVE=.20STEP / AUTO

Variables in the SYSTAT Rectangular file are: ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT RACE1 CASEID PTD LWD Stepping parameters: Significance to include = 0.150 Significance to remove = 0.200 Number of effects to force = 1 Maximum number of steps = 10 Direction : Up and Down Categorical values encountered during processing are: RACE (3 levels) 1, 2, 3 LOW (2 levels) 0, 1 Categorical variables are effects coded with the highest value as reference. Binary Stepwise LOGIT Analysis. Dependent variable: LOW Input records: 189 Records for analysis: 189 Sample split Category choices REF 59 RESP 130 Total : 189 Step 0 Log Likelihood: -117.336 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -0.790 0.157 -5.033 0.000 Score tests on effects not in model Score Chi-Sq Effect Statistic Signif df 2 PTL 7.267 0.007 1.000 3 LWT 5.438 0.020 1.000 4 HT 4.388 0.036 1.000 5 RACE 5.005 0.082 2.000 6 SMOKE 4.924 0.026 1.000 7 UI 5.401 0.020 1.000 8 AGE 2.674 0.102 1.000 9 FTV 0.749 0.387 1.000 Joint Score 30.959 0.000 9.000 Step 1 Log Likelihood: -113.946 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -0.964 0.175 -5.511 0.000 2 PTL 0.802 0.317 2.528 0.011

Page 623: Statistics I

I-603

Chapter 17

Score tests on effects not in model Score Chi-Sq Effect Statistic Signif df 3 LWT 4.113 0.043 1.000 4 HT 4.722 0.030 1.000 5 RACE 5.359 0.069 2.000 6 SMOKE 3.164 0.075 1.000 7 UI 3.161 0.075 1.000 8 AGE 3.478 0.062 1.000 9 FTV 0.577 0.448 1.000 Joint Score 24.772 0.002 8.000 Step 2 Log Likelihood: -111.792 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -1.062 0.184 -5.764 0.000 2 PTL 0.823 0.318 2.585 0.010 3 HT 1.272 0.616 2.066 0.039 Score tests on effects not in model

Score Chi-Sq Effect Statistic Signif df 4 LWT 6.900 0.009 1.000 5 RACE 4.882 0.087 2.000 6 SMOKE 3.117 0.078 1.000 7 UI 4.225 0.040 1.000 8 AGE 3.448 0.063 1.000 9 FTV 0.370 0.543 1.000 Joint Score 20.658 0.004 7.000 Step 3 Log Likelihood: -107.982 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 1.093 0.841 1.299 0.194 2 PTL 0.726 0.328 2.213 0.027 3 HT 1.856 0.705 2.633 0.008 4 LWT -0.017 0.007 -2.560 0.010 Score tests on effects not in model Score Chi-Sq Effect Statistic Signif df 5 RACE 5.266 0.072 2.000 6 SMOKE 2.857 0.091 1.000 7 UI 3.081 0.079 1.000 8 AGE 1.895 0.169 1.000 9 FTV 0.118 0.732 1.000 Joint Score 14.395 0.026 6.000 Step 4 Log Likelihood: -105.425 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 1.405 0.900 1.560 0.119 2 PTL 0.746 0.328 2.278 0.023 3 HT 1.805 0.714 2.530 0.011 4 LWT -0.018 0.007 -2.607 0.009 5 RACE_1 -0.518 0.237 -2.190 0.029 6 RACE_2 0.569 0.318 1.787 0.074 Score tests on effects not in model Score Chi-Sq Effect Statistic Signif df 6 SMOKE 5.936 0.015 1.000 7 UI 3.265 0.071 1.000 8 AGE 1.019 0.313 1.000 9 FTV 0.025 0.873 1.000 Joint Score 9.505 0.050 4.000

Page 624: Statistics I

I-604

Logist ic Regression

Not all logistic regression programs compute the variable addition statistics in the same way, so minor differences in output are possible. Our results listed in the Chi-Square Significance column of the first step, for example, correspond to H&L’s first row in their Table 4.15; the two sets of results are very similar but not identical. While our method yields the same final model as H&L, the order in which variables are entered

Step 5 Log Likelihood: -102.449 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.851 0.913 0.933 0.351 2 PTL 0.602 0.335 1.797 0.072 3 HT 1.745 0.695 2.511 0.012 4 LWT -0.017 0.007 -2.418 0.016 5 RACE_1 -0.734 0.263 -2.790 0.005 6 RACE_2 0.557 0.324 1.720 0.085 7 SMOKE 0.946 0.395 2.396 0.017 Score tests on effects not in model

Score Chi-Sq Effect Statistic Signif df 7 UI 3.034 0.082 1.000 8 AGE 0.781 0.377 1.000 9 FTV 0.014 0.904 1.000 Joint Score 3.711 0.294 3.000 Step 6 Log Likelihood: -100.993 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.654 0.921 0.710 0.477 2 PTL 0.503 0.341 1.475 0.140 3 HT 1.855 0.695 2.669 0.008 4 LWT -0.016 0.007 -2.320 0.020 5 RACE_1 -0.741 0.265 -2.797 0.005 6 RACE_2 0.585 0.323 1.811 0.070 7 SMOKE 0.939 0.399 2.354 0.019 8 UI 0.786 0.456 1.721 0.085 Score tests on effects not in model Score Chi-Sq Effect Statistic Signif df 8 AGE 0.553 0.457 1.000 9 FTV 0.056 0.813 1.000 Joint Score 0.696 0.706 2.000 Log Likelihood: -100.993 Parameter Estimate S.E. t-ratio p-value 1 CONSTANT 0.654 0.921 0.710 0.477 2 PTL 0.503 0.341 1.475 0.140 3 HT 1.855 0.695 2.669 0.008 4 LWT -0.016 0.007 -2.320 0.020 5 RACE_1 -0.741 0.265 -2.797 0.005 6 RACE_2 0.585 0.323 1.811 0.070 7 SMOKE 0.939 0.399 2.354 0.019 8 UI 0.786 0.456 1.721 0.085 95.0 % bounds Parameter Odds Ratio Upper Lower 2 PTL 1.654 3.229 0.847 3 HT 6.392 24.964 1.637 4 LWT 0.984 0.998 0.971 5 RACE_1 0.477 0.801 0.284 6 RACE_2 1.795 3.379 0.953 7 SMOKE 2.557 5.586 1.170 8 UI 2.194 5.367 0.897 Log Likelihood of constants only model = LL(0) = -117.336 2*[LL(N)-LL(0)] = 32.686 with 7 df Chi-sq p-value = 0.000 McFadden’s Rho-Squared = 0.139

Page 625: Statistics I

I-605

Chapter 17

is not the same because intermediate p values differ slightly. Once a final model is arrived at, it is re-estimated to give true maximum likelihood estimates.

Example 11 Hypothesis Testing

Two types of hypothesis tests are easily conducted in LOGIT: the likelihood ratio (LR) test and the Wald test. The tests are discussed in numerous statistics books, sometimes under varying names. Accounts can be found in Maddala’s text (1988), Cox and Hinkley (1974), Rao (1973), Engel (1984), and Breslow and Day (1980). Here we provide some elementary examples.

Likelihood-Ratio Test

The likelihood-ratio test is conducted by fitting two nested models (the restricted and the unrestricted) and comparing the log-likelihoods at convergence. Typically, the unrestricted model contains a proposed set of variables, and the restricted model omits a selected subset, although other restrictions are possible. The test statistic is twice the difference of the log-likelihoods and is chi-squared with degrees of freedom equal to the number of restrictions imposed. When the restrictions consist of excluding variables, the degrees of freedom are equal to the number of parameters set to 0.

If a model contains a constant, LOGIT automatically calculates a likelihood-ratio test of the null hypothesis that all coefficients except the constant are 0. It appears on a line that looks like:

This example line states that twice the difference between the likelihood of the estimated model and the “constants only” model is 26.586, which is a chi-squared deviate on five degrees of freedom. The p value indicates that the null hypothesis would be rejected.

To illustrate use of the LR test, consider a model estimated on the low birth weight data (see the “Binary Logit” example). Assuming CATEGORY=RACE, compare the following model

2*[LL(N)-LL(0)] = 26.586 with 5 df, Chi-sq p-value = 0.00007

MODEL LOW CONSTANT + LWD + AGE + RACE + PTD

Page 626: Statistics I

I-606

Logist ic Regression

with

The null hypothesis is that the categorical variable RACE, which contributes two parameters to the model, and PTD are jointly 0. The model likelihoods are –104.043 and –112.143, and twice the difference (16.20) is chi-squared with three degrees of freedom under the null hypothesis. This value can also be more conveniently calculated by taking the difference of the LR test statistics reported below the parameter estimates and the difference in the degrees of freedom. The unrestricted model above has with five degrees of freedom, and the restricted model has with two degrees of freedom. The difference between the G values is 16.20, and the difference between degrees of freedom is 3.

Although LOGIT will not automatically calculate LR statistics across separate models, the p value of the result can be obtained with the command:

Wald Test

The Wald test is the best known inferential procedure in applied statistics. To conduct a Wald test, we first estimate a model and then pose a linear constraint on the parameters estimated. The statistic is based on the constraint and the appropriate elements of the covariance matrix of the parameter vector. A test of whether a single parameter is 0 is conducted as a Wald test by dividing the squared coefficient by its variance and referring the result to a chi-squared distribution on one degree of freedom. Thus, each t ratio is itself the square root of a simple Wald test. Following is an example:

MODEL LOW CONSTANT + LWD + AGE

CALC 1-XCF(16.2,3)

USE HOSLEMLOGITCATEGORY RACEMODEL LOW=CONSTANT+LWD+AGE+RACE+PTDESTIMATEHYPOTHESISCONSTRAIN PTD=0CONSTRAIN RACE[1]=0CONSTRAIN RACE[2]=0TEST

G 26.587=G 10.385=

Page 627: Statistics I

I-607

Chapter 17

Following is the output (minus the estimation stage):

Note that this statistic of 15.104 is close to the LR statistic of 16.2 obtained for the same hypothesis in the previous section. Although there are three separate CONSTRAIN lines in the HYPOTHESIS paragraph above, they are tested jointly in a single test. To test each restriction individually, place a TEST after each CONSTRAIN. The restrictions being tested are each entered with separate CONSTRAIN commands. These can include any linear algebraic expression without parentheses involving the parameters. If interactions were present on the MODEL statement, they can also appear on the CONSTRAIN statement. To reference dummies generated from categorical covariates, use square brackets, as in the example for RACE. This constraint refers to the coefficient labeled RACE–1 in the output.

More elaborate tests can be posed in this framework. For example,

or

For multinomial models, the architecture is a little different. To reference a variable that appears in more than one parameter vector, it is followed with curly braces around the number corresponding to the Choice Group. For example,

Entering hypothesis procedure. Linear Restriction System Parameter EQN 1 2 3 4 5 1 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 1.000 0.0 3 0.0 0.0 0.0 0.0 1.000 EQN 6 RHS Q 1 1.000 0.0 1.515 2 0.0 0.0 -0.442 3 0.0 0.0 0.464 General linear Wald test results ChiSq Statistic = 15.104 ChiSq p-value = 0.002 Degrees of freedom = 3

CONSTRAIN 7*LWD - 4.3*AGE + 1.5*RACE[2] = -5

CONSTRAIN AGE + LWD = 1

CONSTRAIN CONSTANT{1} - CONSTANT{2} = 0 CONSTRAIN AGE{1} - AGE{2} = 0

Page 628: Statistics I

I-608

Logist ic Regression

Comparisons between Tests

The Wald and likelihood-ratio tests are classical testing methods in statistics. The properties of the tests are based on asymptotic theory, and in the limit, as sample sizes tend to infinity, the tests give identical results. In small samples, there will be differences between results and conclusions, as has been emphasized by Hauck and Donner (1977). Given a choice, which test should be used?

Most statisticians favor the LR test over the Wald for three reasons. First, the likelihood is the fundamental measure on which model fitting is based. Cox and Oakes (1984) illustrate this preference when they use the likelihood profile to determine confidence intervals for a parameter in a survival model. Second, Monte Carlo studies suggest that the LR test is more reliable in small samples. Finally, a nonlinear constraint can be imposed on the parameter estimates and simply tested by estimating restricted and unrestricted models. See the “Quantiles” example for an illustration involving LD50 values. Also, you can use the FUNPAR option in NONLIN to do the same thing.

Why bother with the Wald test, then? One reason is simplicity and computational cost. The LR test requires estimation of two models to final convergence for a single test, and each additional test requires another full estimation. By contrast, any number of Wald tests can be run on the basis of one estimated model, and they do not require an additional pass through the data.

Example 12 Quasi-Maximum Likelihood

When a model to be estimated by maximum likelihood is misspecified, White (1982) has shown that the standard methods for obtaining the variance-covariance matrix are incorrect. In particular, standard errors derived from the inverse matrix of second derivatives and all hypothesis tests based on this matrix are unreliable. Since misspecification may be the rule rather than the exception, is there any safe way to proceed with inference? White offers an alternative variance-covariance matrix that simplifies (asymptotically) to the inverse Hessian when the model is not misspecified and is correct when the model is misspecified. Calling the procedure of estimating a misspecified model quasi-maximum likelihood estimation (QMLE), the proper QML matrix is defined as

Q = H–1GH–1

where H–1 is the covariance matrix at convergence and G is the cumulated outer product of the gradient vectors.

Page 629: Statistics I

I-609

Chapter 17

White shows that for a misspecified model, the LR test is not asymptotically chi-squared, and the Wald and likelihood-ratio tests are not asymptotically equivalent, even when the QML matrix is used for Wald tests.

The best course of action appears to be to use only the QML version of the Wald test when misspecification is a serious possibility. If the QML covariance matrix is requested with the ESTIMATE command, a second set of parameter statistics will be printed, reflecting the new standard errors, t ratios and p values; the coefficients are unchanged. The QML covariance matrix will replace the standard covariance matrix during subsequent hypothesis testing with the HYPOTHESIS command. Following is an example:

Following is the output:

USE NLSLOGITMODEL CULTURE=CONSTANT+IQESTIMATE / QML

Categorical values encountered during processing are:CULTURE (3 levels) 1, 2, 3 Multinomial LOGIT Analysis. Dependent variable: CULTUREInput records: 200Records for analysis: 200Sample split Category choices 1 12 2 49 3 139Total : 200 L-L at iteration 1 is -219.722L-L at iteration 2 is -148.554L-L at iteration 3 is -144.158L-L at iteration 4 is -143.799L-L at iteration 5 is -143.793L-L at iteration 6 is -143.793Log Likelihood: -143.793 Parameter Estimate S.E. t-ratio p-valueChoice Group: 1 1 CONSTANT 4.252 2.107 2.018 0.044 2 IQ -0.065 0.021 -3.052 0.002Choice Group: 2 1 CONSTANT 3.287 1.275 2.579 0.010 2 IQ -0.041 0.012 -3.372 0.001 95.0 % bounds Parameter Odds Ratio Upper LowerChoice Group: 1 2 IQ 0.937 0.977 0.898Choice Group: 2 2 IQ 0.960 0.983 0.937Log Likelihood of constants only model = LL(0) = -153.2542*[LL(N)-LL(0)] = 18.921 with 2 df Chi-sq p-value = 0.000McFadden’s Rho-Squared = 0.062 Covariance matrix QML adjusted.

Page 630: Statistics I

I-610

Logist ic Regression

Note the changes in the standard errors, t ratios, p values, odds ratio bounds, Wald test p values, and covariance matrix.

Computation

All calculations are in double precision.

Algorithms

LOGIT uses Gauss Newton methods for maximizing the likelihood. By default, two tolerance criteria must be satisfied: the maximum value for relative coefficient changes must fall below 0.001, and the Euclidean norm of the relative parameter change vector must also fall below 0.001. By default, LOGIT uses the second derivative matrix to update the parameter vector. In discrete choice models, it may be preferable to use a first derivative approximation to the Hessian instead. This option, popularized by Berndt, Hall, Hall, and Hausman (1974), will be noted if it is used by the program. BHHH uses the summed outer products of the gradient vector in place of the Hessian matrix and generally will converge much more slowly than the default method.

Missing Data

Cases with missing data on any variables included in a model are deleted.

Log Likelihood: -143.793 Parameter Estimate S.E. t-ratio p-valueChoice Group: 1 1 CONSTANT 4.252 2.252 1.888 0.059 2 IQ -0.065 0.023 -2.860 0.004Choice Group: 2 1 CONSTANT 3.287 1.188 2.767 0.006 2 IQ -0.041 0.011 -3.682 0.000 95.0 % bounds Parameter Odds Ratio Upper LowerChoice Group: 1 2 IQ 0.937 0.980 0.896Choice Group: 2 2 IQ 0.960 0.981 0.939Log Likelihood of constants only model = LL(0) = -153.2542*[LL(N)-LL(0)] = 18.921 with 2 df Chi-sq p-value = 0.000McFadden’s Rho-Squared = 0.062

Page 631: Statistics I

I-611

Chapter 17

Basic Formulas

For the binary logistic regression model, the dependent variable for the ith case is , taking on values of 0 (nonresponse) and 1 (response), and the probability of response is a function of the covariate vector and the unknown coefficient vector . We write this probability as:

and abbreviate it as . The log-likelihood for the sample is given by

For the polytomous multinomial logit, the integer-valued dependent variable ranges from 1 to , and the probability that the ith case has , where is:

In this model, is fixed for all cases, there is a single covariate vector , and parameter vectors are estimated. This last equation is identified by normalizing to 0.

McFadden’s discrete choice model represents a distinct variant of the logit model based on Luce’s (1959) probabilistic choice model. Each subject is observed to make a choice from a set consisting of elements. Each element is characterized by a separate covariate vector of attributes . The dependent variable Yi ranges from 1 to

, with possibly varying across subjects, and the probability that , where is a function of the attribute vectors , , ... and the parameter vector .

The probability that the ith subject chooses element m from his choice set is:

Yi

xi β

Prob Yi 1 xi=( ) exiβ

1 exiβ+

----------------=

Pi

LL β( ) Yi Pilog

i 1=

n

∑ 1 Yi–( ) 1 Pi–( )log+=

k Y m= 1 m k≤ ≤

Prob Yi m xi=( ) exiβm

exiβj

j 1=

k

∑----------------=

k xi k βj

βk

Ci Ji

Zk

Ji Ji Yi k=1 k J≤ ≤ Z1 Z2 Zj β

Prob Yi m Z=( ) eZmβ

eZjβ

j Ci∈

∑----------------=

Page 632: Statistics I

I-612

Logist ic Regression

Heuristically, this equation differs from the previous one in the components that vary with alternative outcomes of the dependent variable. In the polytomous logit, the coefficients are alternative-specific and the covariate vector is constant; in the discrete choice model, while the attribute vector is alternative-specific, the coefficients are constant. The models also differ in that the range of the dependent variable can be case-specific in the discrete choice model, while it is constant for all cases in the polytomous model.

The polytomous logit can be recast as a discrete choice model in which each covariate x is entered as an interaction with an alternative-specific dummy, and the number of alternatives is constant for all cases. This reparameterization is used for the mixed polytomous discrete choice model.

Regression Diagnostics Formulas

The SAVE command issued before the deciles of risk command (DC) produces a SYSTAT save file with a number of diagnostic quantities computed for each case in the input data set. Computations are always conducted on the assumption that each covariate pattern is unique. The following formulas are based on the binary dependent variable , which is either 0 or 1, and fitted probabilities , obtained from the basic logistic equation.

LEVERAGE(1) is the diagonal element of Pregibon’s (1981) hat matrix, with formulas given by Hosmer and Lemeshow (1989) as their equations (5.7) and (5.8). It is defined as , where

and is the covariate vector for the xth case, X is the data matrix for the sample including a constant, and V is a diagonal matrix with general A A element , the fitted probability for the ith case. is our LEVERAGE(2).

Thus LEVERAGE(L) is given by

yi Pi

bjvj

bj xj ′ X ′VX( ) 1– xj=

xj

Pi 1 Pi–( )bj

vj Pi 1 Pi–( )=

hj vjbj=

Page 633: Statistics I

I-613

Chapter 17

The PEARSON residual is

The VARIANCE of the residual is

and the standardized residual STANDARD is

The DEVIANCE residual is defined as

for and

otherwise.

DELDSTAT is the change in deviance and is

DELPSTAT is the change in Pearson chi-square:

The final three saved quantities are measures of the overall change in the estimated parameter vector .

rj

yi pi–

pi 1 pi–( )---------------------------=

vj 1 hj–( )

rsj

rj

1 hj–-----------------=

dj 2 ln pj( )=

yj 1=

dj ln 1 pj–( )2–=

∇ Dj dj2 1 hj–( )⁄=

∇χ 2 rsj2=

β

DELBETA 1( ) rsj2hj 1 hj–( )⁄=

Page 634: Statistics I

I-614

Logist ic Regression

is a measure proposed by Pregibon, and

References

Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons, Inc.Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates

in logistic regression models. Biometrika, 71, 1–10.Amemiya, T. (1981). Qualitative response models: A survey. Journal of Economic

Literature, 1483–1536.Begg, Colin B. and Gray, R. (1984). Calculation of polychotomous logistic regression

parameters using individualized regressions. Biometrika, 71, 11–18.Beggs, S., Cardell, N. S., and Hausman, J. A. (1981). Assessing the potential demand for

electric cars. Journal of Econometrics, 16, 1–19.Ben-Akival, M. and Lerman, S. (1985). Discrete choice analysis. Cambridge, Mass.: MIT

Press.Berndt, E. K., Hall, B. K., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference

in non-linear structural models. Annals of Economic and Social Measurement, 3, 653–665.

Breslow, N. (1982). Covariance adjustment of relative-risk estimates in matched studies. Biometrics, 38, 661–672.

Breslow, N. and Day, N. E. (1980). Statistical methods in cancer research, vol. II: The design and analysis of cohort studies. Lyon: IARC.

Breslow, N., Day, N. E., Halvorsen, K.T, Prentice, R.L., and Sabai, C. (1978). Estimation of multiple relative risk functions in matched case-control studies. American Journal of Epidemiology, 108, 299–307.

Carson, R., Hanemann, M., and Steinberg, S. (1990). A discrete choice contingent valuation estimate of the value of kenai king salmon. Journal of Behavioral Economics, 19, 53–68.

Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies, 47, 225–238.

Cook, D. R. and Weisberg, S. (1984). Residuals and influence in regression. New York: Chapman and Hall.

DELBETA 2( ) rsj2hj 1 hj–( )⁄=

DELBETA 3( ) rsj2hj 1 hj–( )2⁄=

Page 635: Statistics I

I-615

Chapter 17

Coslett, S. R. (1980). Efficient estimation of discrete choice models. In C. Manski and D. McFadden, Eds., Structural Analysis of Discrete Data with Econometric Applications. Cambridge, Mass.: MIT Press.

Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.Cox, D. R. and Hinkley, D.V. (1974). Theoretical statistics. London: Chapman and Hall.Cox, D. R. and Oakes, D. (1984). Analysis of survival data. New York: Chapman and Hall.Domencich, T. and McFadden, D. (1975). Urban travel demand: A behavioral analysis.

Amsterdam: North-Holland.Engel, R. F. (1984). Wald, likelihood ratio and Lagrange multiplier tests in econometrics.

In Z. Griliches and M. Intrilligator, Eds., Handbook of Econometrics. New York: North-Holland.

Finney, D. J. (1978). Statistical method in biological assay. London: Charles Griffin.Hauck, W. W. (1980). A note on confidence bands for the logistic response Curve.

American Statistician, 37, 158–160.Hauck, W. W. and Donner, A. (1977). Wald’s test as applied to hypotheses in logit

analysis. Journal of the American Statistical Association, 72, 851–853.Hensher, D. and Johnson, L. W. (1981). Applied discrete choice modelling. London:

Croom Helm.Hoffman, S. and Duncan, G. (1988). Multinomial and conditional logit discrete choice

models in demography. Demography, 25, 415–428.Hosmer, D. W. and Lemeshow, S. (1989). Applied logistic regression. New York: John

Wiley & Sons, Inc.Hubert, J. J. (1984). Bioassay, 2nd ed. Dubuque, Iowa: Kendall-Hunt.Kalbfleisch, J. and Prentice, R. (1980). The statistical analysis of failure time data. New

York: John Wiley & Sons, Inc.Kleinbaum, D., Kupper, L., and Chambliss, L. (1982). Logistic regression analysis of

epidemiologic data: Theory and practice. Communications in Statistics: Theory and Methods, 11, 485–547.

Luce, D. R. (1959). Individual choice behavior: A theoretical analysis. New York: John Wiley & Sons, Inc.

Luft, H., Garnick, D., Peltzman, D., Phibbs, C., Lichtenberg, E., and McPhee, S. (1988). The sensitivity of conditional choice models for hospital care to estimation technique. Draft, Institute for Health Policy Studies. San Francisco: University of California.

Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics. Cambridge University Press.

Maddala, G. S. (1988). Introduction to econometrics. New York: MacMillan.McFadden, D. (1982). Qualitative response models. In W. Hildebrand (ed.), Advances in

Econometrics. Cambridge University Press.McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. In P.

Page 636: Statistics I

I-616

Logist ic Regression

Zarembka (ed.), Frontiers in Econometrics. New York: Academic Press.McFadden, D. (1976). Quantal choice analysis: A survey. Annals of Economic and Social

Measurement, 5, 363–390.McFadden, D. (1979). Quantitative methods for analyzing travel behavior of individuals:

Some recent developments. In D. A. Hensher and P. R. Stopher (eds.), Behavioral Travel Modelling. London: Croom Helm.

McFadden, D. (1984). Econometric analysis of qualitative response models. In Z. Griliches and M. D. Intrilligator (eds.), Handbook of Econometiics, Volume III. Elsevier Science Publishers BV.

Manski, C. and Lerman, S. (1977). The estimation of choice probabilities from choice based samples. Econometrica, 8, 1977–1988.

Manski, C. and McFadden, D. (1980). Alternative estimators and sample designs for discrete choice analysis. In C. Manski and D. McFadden (eds.), Structural Analysis of Discrete Data with Econometric Applications. Cambridge, Mass.: MIT Press.

Manski, C. and McFadden, D., eds. (1981). Structural analysis of discrete data with econometric applications. Cambridge, Mass.: MIT Press.

Nerlove, M. and Press, S. J. (1973). Univariate and multivariate loglinear and logistic models. Rand Report No R-1306EDA/NIH.

Peduzzi, P. N., Holford, T. R., and Hardy, R. J. (1980). A stepwise variable selection procedure for nonlinear regression models. Biometrics, 36, 511–516.

Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705–724.Prentice, R. and Breslow, N. (1978). Retrospective studies and failure time models.

Biometrika, 65, 153–158.Prentice, R. and Pyke, R. (1979). Logistic disease incidence models and case-control

studies. Biometrika, 66, 403–412.Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John

Wiley & Sons, Inc.Santer, T. J. and Duffy, D. E. (1989). The statistical analysis of discrete data. New York:

Springer-Verlag.Steinberg, D. (1991). The common structure of discrete choice and conditional logistic

regression models. Unpublished paper. Department of Economics, San Diego State University.

Steinberg, D. (1987). Interpretation and diagnostics of the multinomial and binary logistic regression using PROC MLOGIT. SAS Users Group International, Proceedings of the Twelfth Annual Conference, 1071–1073, Cary, N.C.: SAS Institute Inc.

Page 637: Statistics I

I-617

Chapter 17

Steinberg, D. and Cardell, N. S. (1987). Logistic regression on pooled choice based samples and samples missing the dependent variable. Proceedings of the Social Statistics Section. Alexandria, Va.: American Statistical Association, 158–160.

Train, K. (1986). Qualitative choice analysis. Cambridge, Mass.: MIT Press.White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,

50, 1–25.Williams, D. A. (1986). Interval estimation of the median lethal dose. Biometrics, 42,

641–645.Wrigley, N. (1985). Categorical data analysis for geographers and environmental

scientists. New York: Longman.

Page 638: Statistics I

I-618

Chapte r

18Loglinear Models

Laszlo Engelman

Loglinear models are useful for analyzing relationships among the factors of a multiway frequency table. The loglinear procedure computes maximum likelihood estimates of the parameters of a loglinear model by using the Newton-Raphson method. For each user-specified model, a test of fit of the model is provided, along with observed and expected cell frequencies, estimates of the loglinear parameters (lambdas), standard errors of the estimates, the ratio of each lambda to its standard error, and multiplicative effects (EXP(λ)).

For each cell, you can request its contribution to the Pearson chi-square or the likelihood-ratio chi-square. Deviates, standardized deviates, Freeman-Tukey deviates, and likelihood-ratio deviates are available to characterize departures of the observed values from expected values.

When searching for the best model, you can request tests after removing each first-order effect or interaction term one at a time individually or hierarchically (when a lower-order effect is removed, so are its respective interaction terms). The models do not need to be hierarchical.

A model can explain the frequencies well in most cells, but poorly in a few. LOGLIN uses Freeman-Tukey deviates to identify the most divergent cell, fit a model without it, and continue in a stepwise manner identifying other outlier cells that depart from your model.

You can specify cells that contain structural zeros (cells that are empty naturally or by design, not by sampling), and fit a model to the subset of cells that remain. A test of fit for such a model is often called a test of quasi-independence.

Page 639: Statistics I

I-619

Chapter 18

Statistical Background

Researchers fit loglinear models to the cell frequencies of a multiway table in order to describe relationships among the categorical variables that form the table. A loglinear model expresses the logarithm of the expected cell frequency as a linear function of certain parameters in a manner similar to that of analysis of variance.

To introduce loglinear models, recall how to calculate expected values for the Pearson chi-square statistic. The expected value for a cell in a row i and column j is:

Let’s ignore the denominator, because it’s the same for every cell. Write:

(Part of each expected value comes from the row it’s in and part from the column it’s in.) Now take the log:

and let:

and write:

This expected value is computed under the null hypothesis of independence (that is, there is no interaction between the table factors). If this hypothesis is rejected, you would need more information than Ai and Bj. In fact, the usual chi-square test can be expressed as a test that the interaction term is needed in a model that estimates the log of the cell frequencies. We write this model as:

or more commonly as:

(row i total) * (column j total)total table count

--------------------------------------------------------------------------

Ri * Cj

ln (Ri * Cj ) ln Ri ln Cj+=

ln Ri Ai=

ln Cj Bj=

Ai Bj+

ln Fij θ Ai Bj ABij+ + +=

Page 640: Statistics I

I-620

Loglinear Models

where is an overall mean effect and the parameters sum to zero over the levels of the row factors and the column factors. For a particular cell in a three-way table (a cell in the i row, j column, and k level of the third factor) we write:

The order of the effect is the number of indices in the subscript.Notation in publications for loglinear model parameters varies. Grant Blank

summarizes:

So, a loglinear model expresses the logarithm of the expected cell frequency as a linear function of certain parameters in a manner similar to that of analysis of variance. An important distinction between ANOVA and loglinear modeling is that in the latter, the focus is on the need for interaction terms; while in ANOVA, testing for main effects is the primary interest. Look back at the loglinear model for the two-way table—the usual chi-square tests the need for the ABij interaction, not for A alone or B alone.

The loglinear model for a three-way table is saturated because it contains all possible terms or effects. Various smaller models can be formed by including only selected combinations of effects (or equivalently testing that certain effects are 0). An important goal in loglinear modeling is parsimony—that is, to see how few effects are needed to estimate the cell frequencies. You usually don’t want to test that the main effect of a factor is 0 because this is the same as testing that the total frequencies are equal for all levels of the factor. For example, a test that the main effect for SURVIVE$

SYSTAT FATHER + SON + FATHER ∗ SON

Agresti (1984) log mij= µ + λiF+ λj

S+ λijFS

Fienberg (1980) log mij= µ + µ1(i)+ µ2(j)+ µ12(ij)

Goodman (1978) ξij= θ + λiA+ λj

B+ λijAB

Haberman (1978) log mij= µ + λiA+ λj

B+ λijAB

Knoke and Burke (1980) Gij= θ + λiF+ λj

S+ λijFS

or, in multiplicative form, Goodman (1971) Fij= ηri

ArjBrij

AB where ξij= log(Fij), θ = log η, λiA= log(ri

A), etc.

ln Fij θ λiA λ j

B λ ijAB+ + +=

θ λ

ln Fijk θ λ iA λ j

B λkC λ ij

AB λ ikAC λ jk

BC λ ijkABC+ + + + + + +=

Page 641: Statistics I

I-621

Chapter 18

(alive, dead) is 0 simply tests whether the total number of survivors equals the number of nonsurvivors. If no interaction terms are included and the test is not significant (that is, the model fits), you can report that the table factors are independent. When there are more than two second-order effects, the test of an interaction is conditional on the other interactions and may not have a simple interpretation.

Fitting a Loglinear Model

To fit a loglinear model:

� First, screen for an appropriate model to test.

� Test the model, and if significant, compare its results with those for models with one or more terms. If not significant, compare results with models with fewer terms.

� For the model you select as best, examine fitted values and residuals, looking for cells (or layers within the table) with large differences between observed and expected (fitted) cell counts.

How do you determine which effects or terms to include in your loglinear model? Ideally, by using your knowledge of the subject matter of your study, you have a specific model in mind—that is, you want to make statements regarding the independence of certain table factors. Otherwise, you may want to screen for effects.

The likelihood-ratio chi-square is additive under partitioning for nested models. Two models are nested if all the effects of the first are a subset of the second. The likelihood ratio chi-square is additive because the statistic for the second model can be subtracted from that for the first. The difference provides a test of the additional effects—that is, the difference in the two statistics has an asymptotic chi-square distribution with degrees of freedom equal to the difference between those for the two model chi-squares (or the difference between the number of effects in the two models). This property does not hold for the Pearson chi-square. The additive property for the likelihood ratio chi-square is useful for screening effects to include in a model.

If you are doing exploratory research and lack firm knowledge about which effects to include, some statisticians suggest a strategy of starting with a large model and, step by step, identifying effects to delete. (You compare each smaller model nested within the larger one as described above.) But we caution you about multiple testing. If you test many models in a search for your ideal model, remember that the p value associated with a specific test is valid when you execute one and only one test. That is, use p values as relative measures when you test several models.

Page 642: Statistics I

I-622

Loglinear Models

Loglinear Models in SYSTAT

Loglinear Model Main Dialog Box

To open the Loglinear Model dialog box, from the menus choose:

StatisticsTables

Loglinear ModelEstimate Model...

The following must be specified:

Model Terms. Build the model components (main effects and interactions) by adding terms to the Model Terms text box. All variables should be categorical (either numerical or character). Click Cross to add interactions.

Define Table. The variables that define the frequency table. Variables that are used in the model terms must be included in the frequency table.

The following optional computational controls can also be specified:

� Convergence. The parameter convergence criteria.

� L Convergence. The log-likelihood convergence criteria.

Page 643: Statistics I

I-623

Chapter 18

� Tolerance. The tolerance limit.

� Iterations. The maximum number of iterations.

� Halvings. The maximum number of step halvings.

� Delta. The constant value added to the observed frequency in each cell.

You can save two sets of statistics to a file:

� Estimates. Saves, for each cell in the table, the observed and expected frequencies and their differences, standardized and Freeman-Tukey deviates, the contribution to the Pearson and likelihood-ratio chi-square statistics, the contribution to the log-likelihood, and the cell indices.

� Lambdas. Saves, for each level of each term in the model, the estimate of lambda, the standard error of lambda, the ratio of lambda to its standard error, the multiplicative effect (EXP(λ)), and the indices of the table of factors.

Loglinear Model Statistics

Loglinear models offer statistics for hypothesis testing, parameter estimation, and individual cell examination.

The following statistics are available:

� Chi-square. Displays Pearson and likelihood-ratio chi-square statistics for lack of fit.

Page 644: Statistics I

I-624

Loglinear Models

� Ratio. Displays lambda divided by standard error of lambda. For large samples, this ratio can be interpreted as a standard normal deviate (z score).

� Maximized likelihood value. The log of the model’s maximum likelihood value.

� Multiplicative effects. Multiplicative parameters, EXP(lambda). Large values indicate an increased probability for that combination of indices.

� Term. One at a time, LOGLIN removes each first-order effect and each interaction term from the model. For each smaller model, LOGLIN provides a likelihood-ratio chi-square for testing the fit of the model and the difference in the chi-square statistics between the smaller model and the full model.

� Hterm. Tests each term by removing it and its higher order interactions from the model. These tests are similar to those in Term except that only hierarchical models are tested—if a lower-order effect is removed, so are the higher-order effects that include it.

To examine the parameters, you can request the coefficients of the design variables, the covariance matrix of the parameters, the correlation matrix of the parameters, and the additive effect of each level for each term (lambda).

In addition, for each cell you can choose to display the observed frequency, the expected frequency, the standardized deviate, the standard error of lambda, the observed minus the expected frequency, the likelihood ratio of the deviate, the Freeman-Tukey deviate, the contribution to Pearson chi-square, and the contribution to the model’s log-likelihood.

Finally, you can select the number of cells to identify as outlandish. The first cell has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the data are from a Poisson distribution). It is treated as a structural zero, the model is fit to the remaining cells, and the cell with the largest Freeman-Tukey deviate is identified. This process continues step by step, each time including one more cell as a structural zero and refitting the model.

Structural Zeros

A cell is declared to be a structural zero when the probability is zero that there are counts in the cell. Notice that such zero frequencies do not arise because of small samples but because the cells are empty naturally (a male hysterectomy patient) or by design (the diagonal of a two-way table comparing father’s (rows) and son’s (columns) occupations is not of interest when studying changes or mobility). A model can then

Page 645: Statistics I

I-625

Chapter 18

be fit to the subset of cells that remain. A test of fit for such a model is often called a test of quasi-independence.

To specify structural zeros, click Zero in the Loglinear Model dialog box.

The following can be specified:

No structural zeros. No cells are treated as structural zeros.

Make all empty cells structural zeros. Treats all empty cells with zero frequency as structural zeros.

Define custom structural zeros. Specifies one or more cells for treatment as structural zeros. List the index (n1, n2, ...) of each factor in the order in which the factor appears in the table. If you want to select a layer or level of a factor, use 0’s for the other factors when specifying the indices. For example, in a table with four factors (TUMOR$ being the fourth factor), to declare the third level of TUMOR$ as structural zeros, use 0 0 0 3. Alternatively, you can replace the 0’s with blanks or periods (. . . 3).

When fitting a model, LOGLIN excludes cells identified as structural zeros, and then, as in a regression analysis with zero weight cases, it can compute expected values, deviates, and so on, for all cells including the structural zero cells.

You might consider identifying cells as structural zeros when:

� It is meaningful to the study at hand to exclude some cells—for example, the diagonal of a two-way table crossing the occupations of fathers and sons.

� You want to determine whether an interaction term is necessary only because there are one or two aberrant cells. That is, after you select the “best” model, fit a second model with fewer effects and identify the outlier cells (the most outlandish cells) for the smaller model. Then refit the “best” model declaring the outlier cells to be

Page 646: Statistics I

I-626

Loglinear Models

structural zeros. If the additional interactions are no longer necessary, you might report the smaller model, adding a sentence describing how the unusual cell(s) depart from the model.

Frequency Tables (Tabulate)

If you want only a frequency table and no analysis, from the menus choose:

StatisticsTables

Loglinear ModelTabulate...

Simply specify the table factors in the same order in which you want to view them from left to right. In other words, the last variable selected defines the columns of the table and cross-classifications of all preceding variables define the rows.

Although you can also form multiway tables using Crosstabs, tables for loglinear models are more compact and easy to read. Crosstabs forms a series of two-way tables stratified by all combinations of the other table factors. Loglinear models create one table, with the rows defined by factor combinations. However, loglinear model tables do not display marginal totals, whereas Crosstabs tables do.

Page 647: Statistics I

I-627

Chapter 18

Using Commands

First, specify your data with USE filename. Continue with:

Usage Considerations

Types of data. LOGLIN uses a cases-by-variables rectangular file or data recorded as frequencies with cell indices.

Print options. You can control what report panels appear in the output by globally setting output length to SHORT, MEDIUM, or LONG. You can also use the PRINT command in LOGLIN to request reports individually. You can specify individual panels by specifying the particular option.

Short output panels include the observed frequency for each cell, the Pearson and likelihood-ratio chi-square statistics, lambdas divided by their standard errors, the log of the model’s maximized likelihood value, and a report of the three most outlandish cells.

Medium results include all of the above, plus the following: the expected frequency for each cell (current model), standardized deviations, multiplicative effects, a test of each term by removing it from the model, a test of each term by removing it and its higher-order interactions from the model, and the five most outlandish cells.

Long results add the following: coefficients of design variables, the covariance matrix of the parameters, the correlation matrix of the parameters, the additive effect of each level for each term, the standard errors of the lambdas, the observed minus the expected frequency for each cell, the contribution to the Pearson chi-square from each cell, the likelihood-ratio deviate for each cell, the Freeman-Tukey deviate for each cell, the contribution to the model’s log-likelihood from each cell, and the 10 most outlandish cells.

As a PRINT option, you can also specify CELLS=n, where n is the number of outlandish cells to identify.

LOGLIN FREQ var TABULATE var1*var2*… MODEL variables defining table = terms of model ZERO CELL n1, n2, … SAVE filename / ESTIMATES or LAMBDAS PRINT SHORT or MEDIUM or LONG or NONE , / OBSFREQ CHISQ RATIO MLE EXPECT STAND ELAMBDA , TERM HTERM COVA CORR LAMBDA SELAMBDA DEVIATES , LRDEV FTDEV PEARSON LOGLIKE CELLS=n ESTIMATE / DELTA=n LCONV=n CONV=n TOL=n ITER=n HALF=n

Page 648: Statistics I

I-628

Loglinear Models

Quick Graphs. LOGLIN produces no Quick Graphs.

Saving files. For each level of a term included in your model, you can save the estimate of lambda, the standard error of lambda, the ratio of lambda to its standard error, the multiplicative effect, and the marginal indices of the effect. Alternatively, for each cell, you can save the observed and expected frequencies, its deviates (listed above), the Pearson and likelihood-ratio chi-square, the contributions to the log-likelihood, and the cell indices.

BY groups. LOGLIN analyzes each level of any BY variables separately.

Bootstrapping. Bootstrapping is available in this procedure.

Case frequencies. LOGLIN uses the FREQ variable, if present, to duplicate cases.

Case weights. WEIGHT variables have no effect in LOGLIN.

Examples

Example 1 Loglinear Modeling of a Four-Way Table

In this example, you use the Morrison breast cancer data stored in the CANCER data file (Bishop, et al. (1975)) and treat the data as a four-way frequency table:

The CANCER data include one record for each of the 72 cells formed by the four table factors. Each record includes a variable, NUMBER, that has the number of women in the cell plus numeric or character value codes to identify the levels of the four factors that define the cell.

CENTER$ Center or city where the data were collectedSURVIVE$ Survival—dead or aliveAGE Age groups of under 50, 50 to 69, and 70 or overTUMOR$ Tumor diagnosis (called INFLAPP by some researchers) with levels:

–Minimal inflammation and benign–Greater inflammation and benign –Minimal inflammation and malignant –Greater inflammation and malignant

Page 649: Statistics I

I-629

Chapter 18

For the first model of the CANCER data, you include three two-way interactions. The input is:

The MODEL statement has two parts: table factors and terms (effects to fit). Table factors appear to the left of the equals sign and terms are on the right. The layout of the table is determined by the order in which the variables are specified—for example, specify TUMOR$ last so its levels determine the columns.

The LABEL statement assigns category names to the numeric codes for AGE. If the statement is omitted, the data values label the categories. By default, SYSTAT orders string variables alphabetically, so we specify SORT = NONE to list the categories for the other factors as they first appear in the data file.

We specify DELTA = 0.5 to add 0.5 to each cell frequency. This option is common in multiway table procedures as an aid when some cell sizes are sparse. It is of little use in practice and is used here only to make the results compare with those reported elsewhere.

USE cancer LOGLIN FREQ = number LABEL age / 50=’Under 50’, 60=’50 to 69’, 70=’70 & Over’ ORDER center$ survive$ tumor$ / SORT=NONE MODEL center$*age*survive$*tumor$ = center$ + age, + survive$ + tumor$, + age*center$, + survive$*center$, + tumor$*center$ PRINT SHORT / EXPECT LAMBDAS ESTIMATE / DELTA=0.5

Page 650: Statistics I

I-630

Loglinear Models

The output is:

Case frequencies determined by value of variable NUMBER. Number of cells (product of levels): 72 Total count: 764 Observed Frequencies====================CENTER$ AGE SURVIVE$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+---------+---------+-------------------------------------------------Tokyo Under 50 Dead | 9.000 7.000 4.000 3.000 Alive | 26.000 68.000 25.000 9.000 + 50 to 69 Dead | 9.000 9.000 11.000 2.000 Alive | 20.000 46.000 18.000 5.000 + 70 & Over Dead | 2.000 3.000 1.000 0.0 Alive | 1.000 6.000 5.000 1.000---------+---------+---------+-------------------------------------------------Boston Under 50 Dead | 6.000 7.000 6.000 0.0 Alive | 11.000 24.000 4.000 0.0 + 50 to 69 Dead | 8.000 20.000 3.000 2.000 Alive | 18.000 58.000 10.000 3.000 + 70 & Over Dead | 9.000 18.000 3.000 0.0 Alive | 15.000 26.000 1.000 1.000---------+---------+---------+-------------------------------------------------Glamorgn Under 50 Dead | 16.000 7.000 3.000 0.0 Alive | 16.000 20.000 8.000 1.000 + 50 to 69 Dead | 14.000 12.000 3.000 0.0 Alive | 27.000 39.000 10.000 4.000 + 70 & Over Dead | 3.000 7.000 3.000 0.0 Alive | 12.000 11.000 4.000 1.000-----------------------------+-------------------------------------------------

Pearson ChiSquare 57.5272 df 51 Probability 0.24635 LR ChiSquare 55.8327 df 51 Probability 0.29814 Raftery’s BIC -282.7342 Dissimilarity 9.9530

Page 651: Statistics I

I-631

Chapter 18

Expected Values===============CENTER$ AGE SURVIVE$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+---------+---------+-------------------------------------------------Tokyo Under 50 Dead | 7.852 15.928 7.515 2.580 Alive | 28.076 56.953 26.872 9.225 + 50 to 69 Dead | 6.281 12.742 6.012 2.064 Alive | 22.460 45.563 21.498 7.380 + 70 & Over Dead | 1.165 2.363 1.115 0.383 Alive | 4.166 8.451 3.988 1.369---------+---------+---------+-------------------------------------------------Boston Under 50 Dead | 5.439 12.120 2.331 0.699 Alive | 10.939 24.378 4.688 1.406 + 50 to 69 Dead | 11.052 24.631 4.737 1.421 Alive | 22.231 49.542 9.527 2.858 + 70 & Over Dead | 6.754 15.052 2.895 0.868 Alive | 13.585 30.276 5.822 1.747---------+---------+---------+-------------------------------------------------Glamorgn Under 50 Dead | 9.303 10.121 3.476 0.920 Alive | 19.989 21.746 7.468 1.977 + 50 to 69 Dead | 14.017 15.249 5.237 1.386 Alive | 30.117 32.764 11.252 2.979 + 70 & Over Dead | 5.582 6.073 2.086 0.552 Alive | 11.993 13.048 4.481 1.186-----------------------------+-------------------------------------------------

Log-Linear Effects (Lambda)=========================== THETA------------- 1.826------------- CENTER$ Tokyo Boston Glamorgn------------------------------------- 0.049 0.001 -0.050------------------------------------- AGE Under 50 50 to 69 70 & Over------------------------------------- 0.145 0.444 -0.589------------------------------------- SURVIVE$ Dead Alive------------------------- -0.456 0.456------------------------- TUMOR$ MinMalig MinBengn MaxMalig MaxBengn------------------------------------------------- 0.480 1.011 -0.145 -1.346-------------------------------------------------

Page 652: Statistics I

I-632

Loglinear Models

CENTER$ | AGE | Under 50 50 to 69 70 & Over---------+-------------------------------------Tokyo | 0.565 0.043 -0.609Boston | -0.454 -0.043 0.497Glamorgn | -0.111 -0.000 0.112---------+------------------------------------- CENTER$ | SURVIVE$ | Dead Alive---------+-------------------------Tokyo | -0.181 0.181Boston | 0.107 -0.107Glamorgn | 0.074 -0.074---------+------------------------- CENTER$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+-------------------------------------------------Tokyo | -0.368 -0.191 0.214 0.345Boston | 0.044 0.315 -0.178 -0.181Glamorgn | 0.323 -0.123 -0.036 -0.164---------+-------------------------------------------------

Lambda / SE(Lambda)=================== THETA------------- 1.826------------- CENTER$ Tokyo Boston Glamorgn------------------------------------- 0.596 0.014 -0.586------------------------------------- AGE Under 50 50 to 69 70 & Over------------------------------------- 2.627 8.633 -8.649------------------------------------- SURVIVE$ Dead Alive------------------------- -11.548 11.548------------------------- TUMOR$ MinMalig MinBengn MaxMalig MaxBengn------------------------------------------------- 6.775 15.730 -1.718 -10.150------------------------------------------------- CENTER$ | AGE | Under 50 50 to 69 70 & Over---------+-------------------------------------Tokyo | 7.348 0.576 -5.648Boston | -5.755 -0.618 5.757Glamorgn | -1.418 -0.003 1.194---------+-------------------------------------

Page 653: Statistics I

I-633

Chapter 18

Initially, SYSTAT produces a frequency table for the data. We entered cases for 72 cells. The total frequency count across these cells is 764—that is, there are 764 women in the sample. Notice that the order of the factors is the same order we specified in the MODEL statement. The last variable (TUMOR$) defines the columns; the remaining variables define the rows.

The test of fit is not significant for either the Pearson chi-square or the likelihood-ratio test, indicating that your model with its three two-way interactions does not disagree with the observed frequencies. The model statement describes an association between study center and age, survival, and tumor status. However, at each center, the other three factors are independent. Because the overall goal is parsimony, we could explore whether any of the interactions can be dropped.

Raftery’s BIC (Bayesian Information Criterion) adjusts the chi-square for both the complexity of the model (measured by degrees of freedom) and the size of the sample. It is the likelihood-ratio chi-square minus the degrees of freedom for the current model times the natural log of the sample size. If BIC is negative, you can conclude that the model is preferable to the saturated model. When comparing alternative models, select the model with the lowest BIC value.

The index of dissimilarity can be interpreted as the percentage of cases that need to be relocated in order to make the observed and expected counts equal. For these data, you would have to move about 9.95% of the cases to make the expected frequencies fit.

CENTER$ | SURVIVE$ | Dead Alive---------+-------------------------Tokyo | -3.207 3.207Boston | 1.959 -1.959Glamorgn | 1.304 -1.304---------+------------------------- CENTER$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+-------------------------------------------------Tokyo | -3.862 -2.292 2.012 2.121Boston | 0.425 3.385 -1.400 -0.910Glamorgn | 3.199 -1.287 -0.289 -0.827---------+------------------------------------------------- Model ln(MLE): -160.563

The 3 most outlandish cells (based on FTD, stepwise):====================================================== CENTER$ | AGE | | SURVIVE$ ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$--------- -------- -------- --------- - - - - -154.685 11.755 0.001 7 1 1 1 2 -150.685 8.001 0.005 1 2 3 2 3 -145.024 11.321 0.001 16 3 1 1 1

Page 654: Statistics I

I-634

Loglinear Models

The expected frequencies are obtained by fitting the loglinear model to the observed frequencies. Compare these values with the observed frequencies. Values for corresponding cells will be similar if the model fits well.

After the expected values, SYSTAT lists the parameter estimates for the model you requested. Usually, it is of more interest to examine these estimates divided by their standard errors. Here, however, we display them in order to relate them to the expected values. For example, the observed frequency for the cell in the upper left corner (Tokyo, Under 50, Dead, MinMalig) is 9. To find the expected frequency under your model, you add the estimates (from each panel, select the term that corresponds to your cell):

and then use SYSTAT’s calculator to sum the estimates:

and SYSTAT responds 2.06. Take the antilog of this value:

and SYSTAT responds 7.846. In the panel of expected values, this number is printed as 7.852 (in its calculations, SYSTAT uses more digits following the decimal point). Thus, for this cell, the sample includes 9 women (observed frequency) and the model predicts 7.85 women (expected frequency).

The ratio of the parameter estimates to their asymptotic standard errors is part of the default output. Examine these values to better understand the relationships among the table factors. Because, for large samples, this ratio can be interpreted as a standard normal deviate (z score), you can use it to indicate significant parameters—for example, for an interaction term, significant positive (or negative) associations. In the CENTER$ by AGE panel, the ratio for young women from Tokyo is very large (7.348), implying a significant positive association, and that for older Tokyo women is extremely negative (–5.648). The reverse is true for the women from Boston. If you use the Column Percent option in XTAB to print column percentages for CENTER$ by AGE, you will see that among the women under 50, more than 50% are from Tokyo (52.1), while only 23% are from Boston. In the 70 and over age group, 14% are from Tokyo and 55% are from Boston.

theta 1.826 C*A 0.565CENTER$ 0.049 C*S -0.181AGE 0.145 C*T -0.368SURVIVE$ -0.456TUMOR$ 0.480

CALC 1.826 + 0.049 + 0.145 – 0.456 + 0.480 + 0.565 – 0.181 – 0.368

CALC EXP(2.06)

Page 655: Statistics I

I-635

Chapter 18

The Alive estimate for Tokyo shows a strong positive association (3.207) with survival in Tokyo. The relationship in Boston is negative (–1.959). In this study, the overall survival rate is 72.5%. In Tokyo, 79.3% of the women survived, while in Boston, 67.6% survived. There is a negative association for having a malignant tumor with minimal inflammation in Tokyo (–3.862). The same relationship is strongly positive in Glamorgan (3.199).

Cells that depart from the current model are identified as outlandish in a stepwise manner. The first cell has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the data are from a Poisson distribution). It is treated as a structural zero, the model is fit to the remaining cells, and the cell with the largest Freeman-Tukey deviate is identified. This process continues step by step, each time including one more cell as a structural zero and refitting the model.

For the current model, the observations in the cell corresponding to the youngest nonsurvivors from Tokyo with benign tumors and minimal inflammation (Tokyo, Under 50, Dead, MinBengn) differs the most from its expected value. There are 7 women in the cell and the expected value is 15.9 women. The next most unusual cell is 2,3,2,3 (Boston, 70 & Over, Alive, MaxMalig), and so on.

Medium Output

We continue the previous analysis, repeating the same model, but changing the PRINT (output length) setting to request medium-length results:

USE cancer LOGLIN FREQ = number LABEL age / 50=’Under 50’, 60=’50 to 69’, 70=’70 & Over’ ORDER center$ survive$ tumor$ / SORT=NONE MODEL center$*age*survive$*tumor$ = age # center$, + survive$ # center$, + tumor$ # center$ PRINT MEDIUM ESTIMATE / DELTA=0.5

Page 656: Statistics I

I-636

Loglinear Models

Notice that we use shortcut notation to specify the model. The output includes:

Standardized Deviates = (Obs-Exp)/sqrt(Exp)===========================================CENTER$ AGE SURVIVE$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+---------+---------+-------------------------------------------------Tokyo Under 50 Dead | 0.410 -2.237 -1.282 0.262 Alive | -0.392 1.464 -0.361 -0.074 + 50 to 69 Dead | 1.085 -1.048 2.034 -0.044 Alive | -0.519 0.065 -0.754 -0.876 + 70 & Over Dead | 0.774 0.414 -0.109 -0.619 Alive | -1.551 -0.843 0.507 -0.315---------+---------+---------+-------------------------------------------------Boston Under 50 Dead | 0.241 -1.471 2.403 -0.836 Alive | 0.018 -0.077 -0.318 -1.186 + 50 to 69 Dead | -0.918 -0.933 -0.798 0.486 Alive | -0.897 1.202 0.153 0.084 + 70 & Over Dead | 0.864 0.760 0.062 -0.932 Alive | 0.384 -0.777 -1.999 -0.565---------+---------+---------+-------------------------------------------------Glamorgn Under 50 Dead | 2.196 -0.981 -0.255 -0.959 Alive | -0.892 -0.374 0.195 -0.695 + 50 to 69 Dead | -0.004 -0.832 -0.977 -1.177 Alive | -0.568 1.089 -0.373 0.592 + 70 & Over Dead | -1.093 0.376 0.633 -0.743 Alive | 0.002 -0.567 -0.227 -0.171-----------------------------+-------------------------------------------------

Multiplicative Effects = exp(Lambda)==================================== THETA------------- 6.209------------- AGE Under 50 50 to 69 70 & Over------------------------------------- 1.156 1.559 0.555------------------------------------- CENTER$ Tokyo Boston Glamorgn------------------------------------- 1.050 1.001 0.951------------------------------------- SURVIVE$ Dead Alive------------------------- 0.634 1.578------------------------- TUMOR$ MinMalig MinBengn MaxMalig MaxBengn------------------------------------------------- 1.616 2.748 0.865 0.260-------------------------------------------------

Page 657: Statistics I

I-637

Chapter 18

The goodness-of-fit tests provide an overall indication of how close the expected values are to the cell counts. Just as you study residuals for each case in multiple regression, you can use deviates to compare the observed and expected values for each cell. A standardized deviate is the square root of each cell’s contribution to the Pearson chi-square statistic—that is, (the observed frequency minus the expected frequency)

CENTER$ | AGE | Under 50 50 to 69 70 & Over---------+-------------------------------------Tokyo | 1.760 1.044 0.544Boston | 0.635 0.958 1.644Glamorgn | 0.895 1.000 1.118---------+------------------------------------- CENTER$ | SURVIVE$ | Dead Alive---------+-------------------------Tokyo | 0.835 1.198Boston | 1.113 0.899Glamorgn | 1.077 0.929---------+------------------------- CENTER$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+-------------------------------------------------Tokyo | 0.692 0.826 1.238 1.412Boston | 1.045 1.370 0.837 0.834Glamorgn | 1.382 0.884 0.965 0.849---------+------------------------------------------------- Model ln(MLE): -160.563 Term tested The model without the term Removal of term from model ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------AGE. . . . . . -216.120 166.95 53 0.0000 111.11 2 0.0000CENTER$. . . . -160.799 56.31 53 0.3523 0.47 2 0.7894SURVIVE$ . . . -234.265 203.24 52 0.0000 147.41 1 0.0000TUMOR$ . . . . -344.471 423.65 54 0.0000 367.82 3 0.0000CENTER$* AGE. . . . . -196.672 128.05 55 0.0000 72.22 4 0.0000CENTER$* SURVIVE$ . . -166.007 66.72 53 0.0975 10.89 2 0.0043CENTER$* TUMOR$ . . . -178.267 91.24 57 0.0027 35.41 6 0.0000 Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------AGE. . . . . . -246.779 228.26 57 0.0000 172.43 6 0.0000CENTER$. . . . -224.289 183.29 65 0.0000 127.45 14 0.0000SURVIVE$ . . . -242.434 219.57 54 0.0000 163.74 3 0.0000TUMOR$ . . . . -363.341 461.39 60 0.0000 405.56 9 0.0000 The 5 most outlandish cells (based on FTD, stepwise):====================================================== CENTER$ | AGE | | SURVIVE$ ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$--------- -------- -------- --------- - - - - -154.685 11.755 0.001 7 1 1 1 2 -150.685 8.001 0.005 1 2 3 2 3 -145.024 11.321 0.001 16 3 1 1 1 -140.740 8.569 0.003 6 2 1 1 3 -136.662 8.157 0.004 11 1 2 1 3

Page 658: Statistics I

I-638

Loglinear Models

divided by the square root of the expected frequency. These values are similar to z scores. For the second cell in the first row, the expected value under your model is considerably larger than the observed count (its deviate is –2.237, the observed count is 7, and the expected count is 15.9). Previously, this cell was identified as the most outlandish cell using Freeman-Tukey deviates.

Note that LOGLIN produces five types of deviates or residuals: standardized, the observed minus the expected frequency, the likelihood-ratio deviate, the Freeman-Tukey deviate, and the Pearson deviate.

Estimates of the multiplicative parameters equal . Look for values that depart markedly from 1.0. Very large values indicate an increased probability for that combination of indices and, conversely, a value considerably less than 1.0 indicates an unlikely combination. A test of the hypothesis that a multiplicative parameter equals 1.0 is the same as that for lambda equal to 0; so use the values of (lambda)/S.E. to test the values in this panel. For the CENTER$ by AGE interaction, the most likely combination is women under 50 from Tokyo (1.76); the least likely combination is women 70 and over from Tokyo (0.544).

After listing the multiplicative effects, SYSTAT tests reduced models by removing each first-order effect and each interaction from the model one at a time. For each smaller model, LOGLIN provides:

� A likelihood-ratio chi-square for testing the fit of the model

� The difference in the chi-square statistics between the smaller model and the full model

The likelihood-ratio chi-square for the full model is 55.833. For a model that omits AGE, the likelihood-ratio chi-square is 166.95. This smaller model does not fit the observed frequencies (p value < 0.00005). To determine whether the removal of this term results in a significant decrease in the fit, look at the difference in the statistics: 166.95 – 55.833 = 111.117, p value < 0.00005. The fit worsens significantly when AGE is removed from the model.

From the second line in this panel, it appears that a model without the first-order term for CENTER$ fits (p value = 0.3523). However, removing any of the two-way interactions involving CENTER$ significantly decreases the model fit.

The hierarchical tests are similar to the preceding tests except that only hierarchical models are tested—if a lower-order effect is removed, so are the higher-order effects that include it. For example, in the first line, when CENTER$ is removed, the three interactions with CENTER$ are also removed. The reduction in the fit is significant (p < 0.00005). Although removing the first-order effect of CENTER$ does not

e lambda( )

Page 659: Statistics I

I-639

Chapter 18

significantly alter the fit, removing the higher-order effects involving CENTER$ decreases the fit substantially.

Example 2 Screening Effects

In this example, you pretend that no models have been fit to the CANCER data (that is, you have not seen the other examples). As a place to start, first fit a model with all second-order interactions finding that it fits. Then fit models nested within the first by using results from the HTERM (terms tested hierarchically) panel to guide your selection of terms to be removed.

Here’s a summary of your instructions: you study the output generated from the first MODEL and ESTIMATE statements and decide to remove AGE by TUMOR$. After seeing the results for this smaller model, you decide to remove AGE by SURVIVE$, too.

The output follows:

USE cancer LOGLIN FREQ = number PRINT NONE / CHI HTERM MODEL center$*age*survive$*tumor$ = tumor$..center$^2 ESTIMATE / DELTA=0.5 MODEL center$*age*survive$*tumor$ = tumor$..center$^2, - age*tumor$ ESTIMATE / DELTA=0.5 MODEL center$*age*survive$*tumor$ = tumor$..center$^2, - age*tumor$, - age*survive$ ESTIMATE / DELTA=0.5 MODEL center$*age*survive$*tumor$ = tumor$..center$^2, - age*tumor$, - age*survive$, - tumor$*survive$ ESTIMATE / DELTA=0.5

All two-way interactionsPearson ChiSquare 40.1650 df 40 Probability 0.46294 LR ChiSquare 39.9208 df 40 Probability 0.47378 Raftery’s BIC -225.6219 Dissimilarity 7.6426 Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------TUMOR$ . . . . -361.233 457.17 58 0.0000 417.25 18 0.0000SURVIVE$ . . . -241.675 218.06 48 0.0000 178.14 8 0.0000AGE. . . . . . -241.668 218.04 54 0.0000 178.12 14 0.0000CENTER$. . . . -213.996 162.70 54 0.0000 122.78 14 0.0000SURVIVE$* TUMOR$ . . . -157.695 50.10 43 0.2125 10.18 3 0.0171

Page 660: Statistics I

I-640

Loglinear Models

AGE* TUMOR$ . . . -153.343 41.39 46 0.6654 1.47 6 0.9613AGE* SURVIVE$ . . -154.693 44.09 42 0.3831 4.17 2 0.1241CENTER$* TUMOR$ . . . -169.724 74.15 46 0.0053 34.23 6 0.0000CENTER$* SURVIVE$ . . -156.501 47.71 42 0.2518 7.79 2 0.0204CENTER$* AGE. . . . . -186.011 106.73 44 0.0000 66.81 4 0.0000

Remove AGE * TUMORPearson ChiSquare 41.8276 df 46 Probability 0.64757 LR ChiSquare 41.3934 df 46 Probability 0.66536 Raftery’s BIC -263.9807 Dissimilarity 7.8682 Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------TUMOR$ . . . . -361.233 457.17 58 0.0000 415.78 12 0.0000SURVIVE$ . . . -242.434 219.57 54 0.0000 178.18 8 0.0000AGE. . . . . . -241.668 218.04 54 0.0000 176.65 8 0.0000CENTER$. . . . -215.687 166.08 60 0.0000 124.69 14 0.0000SURVIVE$* TUMOR$ . . . -158.454 51.61 49 0.3719 10.22 3 0.0168AGE* SURVIVE$ . . -155.452 45.61 48 0.5713 4.22 2 0.1214CENTER$* TUMOR$ . . . -171.415 77.54 52 0.0124 36.14 6 0.0000CENTER$* SURVIVE$ . . -157.291 49.29 48 0.4214 7.90 2 0.0193CENTER$* AGE. . . . . -187.702 110.11 50 0.0000 68.72 4 0.0000

Remove AGE * TUMOR$ and AGE * SURVIVE$Pearson ChiSquare 45.3579 df 48 Probability 0.58174 LR ChiSquare 45.6113 df 48 Probability 0.57126 Raftery’s BIC -273.0400 Dissimilarity 8.4720 Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------TUMOR$ . . . . -363.341 461.39 60 0.0000 415.78 12 0.0000SURVIVE$ . . . -242.434 219.57 54 0.0000 173.96 6 0.0000AGE. . . . . . -241.668 218.04 54 0.0000 172.43 6 0.0000CENTER$. . . . -219.546 173.80 62 0.0000 128.19 14 0.0000SURVIVE$* TUMOR$ . . . -160.563 55.83 51 0.2981 10.22 3 0.0168CENTER$* TUMOR$ . . . -173.524 81.75 54 0.0087 36.14 6 0.0000CENTER$* SURVIVE$ . . -161.264 57.23 50 0.2245 11.62 2 0.0030CENTER$* AGE. . . . . -191.561 117.83 52 0.0000 72.22 4 0.0000

Remove AGE * TUMOR$, AGE * SURVIVE$, and TUMOR$ * SURVIVE$Pearson ChiSquare 57.5272 df 51 Probability 0.24635 LR ChiSquare 55.8327 df 51 Probability 0.29814 Raftery’s BIC -282.7342 Dissimilarity 9.9530

Page 661: Statistics I

I-641

Chapter 18

The likelihood-ratio chi-square for the model that includes all two-way interactions is 39.9 (p value = 0.4738). If the AGE by TUMOR$ interaction is removed, the chi-square for the smaller model is 41.39 (p value = 0.6654). Does the removal of this interaction cause a significant change? No, chi-square = 1.47 (p value = 0.9613). This chi-square is computed as 41.39 minus 39.92 with 46 minus 40 degrees of freedom. The removal of this interaction results in the least change, so you remove it first. Notice also that the estimate of the maximized likelihood function is largest when this second-order effect is removed (–153.343).

The model chi-square for the second model is the same as that given for the first model with AGE * TUMOR$ removed (41.3934). Here, if AGE by SURVIVE$ is removed, the new model fits (p value = 0.5713) and the change between the model minus one interaction and that minus two interactions is insignificant (p value = 0.1214).

If SURVIVE$ by TUMOR$ is removed from the current model with four interactions, the new model fits (p value = 0.2981). The change in fit is not significant (p = 0.0168). Should we remove any other terms? Looking at the HTERM panel for the model with three interactions, you see that a model without CENTER$ by SURVIVE$ has a marginal fit (p value = 0.0975) and the chi-square for the difference is significant (p value = 0.0043). Although the goal is parsimony and technically a model with only two interactions does fit, you opt for the model that also includes CENTER$ by SURVIVE$ because it is a significant improvement over the very smallest model.

Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------TUMOR$ . . . . -363.341 461.39 60 0.0000 405.56 9 0.0000SURVIVE$ . . . -242.434 219.57 54 0.0000 163.74 3 0.0000AGE. . . . . . -246.779 228.26 57 0.0000 172.43 6 0.0000CENTER$. . . . -224.289 183.29 65 0.0000 127.45 14 0.0000CENTER$* TUMOR$ . . . -178.267 91.24 57 0.0027 35.41 6 0.0000CENTER$* SURVIVE$ . . -166.007 66.72 53 0.0975 10.89 2 0.0043CENTER$* AGE. . . . . -196.672 128.05 55 0.0000 72.22 4 0.0000

Page 662: Statistics I

I-642

Loglinear Models

Example 3 Structural Zeros

This example identifies outliers and then declares them to be structural zeros. You wonder if any of the interactions in the model fit in the example on loglinear modeling for a four-way table are necessary only because of a few unusual cells. To identify the unusual cells, first pull back from your “ideal” model and fit a model with main effects only, asking for the four most unusual cells. (Why four cells? Because 5% of 72 cells is 3.6 or roughly 4.)

Of course this model doesn’t fit, but following are selections from the output:

Next, fit your “ideal” model, identifying these four cells as structural zeros and also requesting PRINT / HTERM to test the need for each interaction term.

USE cancerLOGLIN FREQ = number ORDER center$ survive$ tumor$ / SORT=NONE MODEL center$*age*survive$*tumor$ = tumor$ .. center$ PRINT / CELLS=4 ESTIMATE / DELTA=0.5

Pearson ChiSquare 181.3892 df 63 Probability 0.00000 LR ChiSquare 174.3458 df 63 Probability 0.00000 Raftery’s BIC -243.8839 Dissimilarity 19.3853

The 4 most outlandish cells (based on FTD, stepwise):====================================================== CENTER$ | AGE | | SURVIVE$ ln(MLE) LR_ChiSq p-value Frequency | | | TUMOR$--------- -------- -------- --------- - - - - -203.261 33.118 0.000 68 1 1 2 2 -195.262 15.997 0.000 1 1 3 2 1 -183.471 23.582 0.000 25 1 1 2 3 -176.345 14.253 0.000 6 1 3 2 2

Page 663: Statistics I

I-643

Chapter 18

Defining Four Cells As Structural Zeros

Continuing from the analysis of main effects only, now specify your original model with its three second-order effects:

Following are selections from the output. Notice that asterisks mark the structural zero cells.

MODEL center$*age*survive$*tumor$ = , (age + survive$ + tumor$) # center$ZERO CELL=1 1 2 2 CELL=1 3 2 1 CELL=1 1 2 3 CELL=1 3 2 2PRINT / HTERMSESTIMATE / DELTA=0.5

Number of cells (product of levels): 72 Number of structural zero cells: 4 Total count: 664 Observed Frequencies====================CENTER$ AGE SURVIVE$ | TUMOR$ | MinMalig MinBengn MaxMalig MaxBengn---------+---------+---------+-------------------------------------------------Tokyo Under 50 Dead | 9.000 7.000 4.000 3.000 Alive | 26.000 *68.000 *25.000 9.000 + 50 to 69 Dead | 9.000 9.000 11.000 2.000 Alive | 20.000 46.000 18.000 5.000 + 70 & Over Dead | 2.000 3.000 1.000 0.0 Alive | *1.000 *6.000 5.000 1.000---------+---------+---------+-------------------------------------------------Boston Under 50 Dead | 6.000 7.000 6.000 0.0 Alive | 11.000 24.000 4.000 0.0 + 50 to 69 Dead | 8.000 20.000 3.000 2.000 Alive | 18.000 58.000 10.000 3.000 + 70 & Over Dead | 9.000 18.000 3.000 0.0 Alive | 15.000 26.000 1.000 1.000---------+---------+---------+-------------------------------------------------Glamorgn Under 50 Dead | 16.000 7.000 3.000 0.0 Alive | 16.000 20.000 8.000 1.000 + 50 to 69 Dead | 14.000 12.000 3.000 0.0 Alive | 27.000 39.000 10.000 4.000 + 70 & Over Dead | 3.000 7.000 3.000 0.0 Alive | 12.000 11.000 4.000 1.000-----------------------------+------------------------------------------------- * indicates structural zero cellsPearson ChiSquare 46.8417 df 47 Probability 0.47906 LR ChiSquare 44.8815 df 47 Probability 0.56072 Raftery’s BIC -260.5378 Dissimilarity 10.1680

Page 664: Statistics I

I-644

Loglinear Models

The model has a nonsignificant test of fit and so does a model without the CENTER$ by SURVIVAL$ interaction (p value = 0.4226).

Eliminating Only the Young Women

Two of the extreme cells are from the youngest age group. What happens to the CENTER$ by SURVIVE$ effect if only these cells are defined as structural zeros? HTERM remains in effect.

The output follows:

When the two cells for the young women from Tokyo are excluded from the model estimation, the CENTER$ by SURVIVE$ effect is not needed (p value = 0.3737).

Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------AGE. . . . . . -190.460 132.87 53 0.0000 87.98 6 0.0000SURVIVE$ . . . -206.152 164.25 50 0.0000 119.37 3 0.0000TUMOR$ . . . . -326.389 404.72 56 0.0000 359.84 9 0.0000CENTER$. . . . -177.829 107.60 61 0.0002 62.72 14 0.0000CENTER$* AGE. . . . . -158.900 69.75 51 0.0416 24.86 4 0.0001CENTER$* SURVIVE$ . . -149.166 50.28 49 0.4226 5.40 2 0.0674CENTER$* TUMOR$ . . . -162.289 76.52 53 0.0189 31.64 6 0.0000

MODEL center$*age*survive$*tumor$ =, (age + survive$ + tumor$) # center$ZERO CELL=1 1 2 2 CELL=1 1 2 3ESTIMATE / DELTA=0.5

Number of cells (product of levels): 72 Number of structural zero cells: 2 Total count: 671

Pearson ChiSquare 50.2610 df 49 Probability 0.42326 LR ChiSquare 49.1153 df 49 Probability 0.46850 Raftery’s BIC -269.8144 Dissimilarity 10.6372

Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------AGE. . . . . . -221.256 188.37 55 0.0000 139.25 6 0.0000SURVIVE$ . . . -210.369 166.60 52 0.0000 117.48 3 0.0000TUMOR$ . . . . -331.132 408.12 58 0.0000 359.01 9 0.0000CENTER$. . . . -192.179 130.22 63 0.0000 81.10 14 0.0000CENTER$* AGE. . . . . -172.356 90.57 53 0.0010 41.45 4 0.0000CENTER$* SURVIVE$ . . -153.888 53.63 51 0.3737 4.52 2 0.1045CENTER$* TUMOR$ . . . -169.047 83.95 55 0.0072 34.84 6 0.0000

Page 665: Statistics I

I-645

Chapter 18

Eliminating the Older Women

Here you define the two cells for the Tokyo women from the oldest age group as structural zeros.

The output is:

When the two cells for the women from the older age group are treated as structural zeros, the case for removing the CENTER$ by SURVIVE$ effect is much weaker than when the cells for the younger women are structural zeros. Here, the inclusion of the effect results in a significant improvement in the fit of the model (p value = 0.0019).

Conclusion

The structural zero feature allowed you to quickly focus on 2 of the 72 cells in your multiway table: the survivors under 50 from Tokyo, especially those with benign tumors with minimal inflammation. The overall survival rate for the 764 women is 72.5%, that for Tokyo is 79.3%, and that for the most unusual cell is 90.67%. Half of the Tokyo women under age 50 have MinBengn tumors (75 out of 151) and almost 10% of the 764 women (spread across 72 cells) are concentrated here. Possibly the protocol for study entry (including definition of a “tumor”) was executed differently at this center than at the others.

MODEL center$*age*survive$*tumor$ =, (age + survive$ + tumor$) # center$ZERO CELL=1 3 2 1 CELL=1 3 2 2ESTIMATE / DELTA=0.5

Number of cells (product of levels): 72 Number of structural zero cells: 2 Total count: 757

Pearson ChiSquare 53.4348 df 49 Probability 0.30782 LR ChiSquare 50.9824 df 49 Probability 0.39558 Raftery’s BIC -273.8564 Dissimilarity 9.4583

Term tested The model without the term Removal of term from model hierarchically ln(MLE) Chi-Sq df p-value Chi-Sq df p-value--------------- --------- -------- ---- -------- -------- ---- --------AGE. . . . . . -203.305 147.41 55 0.0000 96.42 6 0.0000SURVIVE$ . . . -238.968 218.73 52 0.0000 167.75 3 0.0000TUMOR$ . . . . -358.521 457.84 58 0.0000 406.86 9 0.0000CENTER$. . . . -209.549 159.89 63 0.0000 108.91 14 0.0000CENTER$* AGE. . . . . -177.799 96.39 53 0.0003 45.41 4 0.0000CENTER$* SURVIVE$ . . -161.382 63.56 51 0.1114 12.58 2 0.0019CENTER$* TUMOR$ . . . -171.123 83.04 55 0.0086 32.06 6 0.0000

Page 666: Statistics I

I-646

Loglinear Models

Example 4 Tables without Analyses

If you want only a frequency table and no analysis, use TABULATE. Simply specify the table factors in the same order in which you want to view them from left to right. In other words, the last variable defines the columns of the table and cross-classifications of the preceding variables the rows.

For this example, we use data in the CANCER file. Here we use LOGLIN to display counts for a 3 by 3 by 2 by 4 table (72 cells) in two dozen lines. The input is:

The resulting table is:

USE cancerLOGLIN FREQ = number LABEL age / 50=’Under 50’, 60=’59 to 69’, 70=’70 & Over’ ORDER center$ / SORT=NONE ORDER tumor$ / SORT =’MinBengn’, ’MaxBengn’, ’MinMalig’, ’MaxMalig’ TABULATE age * center$ * survive$ * tumor$

Number of cells (product of levels): 72 Total count: 764 Observed Frequencies====================AGE CENTER$ SURVIVE$ | TUMOR$ | MinBengn MaxBengn MinMalig MaxMalig---------+---------+---------+-------------------------------------------------Under 50 Tokyo Alive | 68.000 9.000 26.000 25.000 Dead | 7.000 3.000 9.000 4.000 + Boston Alive | 24.000 0.0 11.000 4.000 Dead | 7.000 0.0 6.000 6.000 + Glamorgn Alive | 20.000 1.000 16.000 8.000 Dead | 7.000 0.0 16.000 3.000---------+---------+---------+-------------------------------------------------59 to 69 Tokyo Alive | 46.000 5.000 20.000 18.000 Dead | 9.000 2.000 9.000 11.000 + Boston Alive | 58.000 3.000 18.000 10.000 Dead | 20.000 2.000 8.000 3.000 + Glamorgn Alive | 39.000 4.000 27.000 10.000 Dead | 12.000 0.0 14.000 3.000---------+---------+---------+-------------------------------------------------70 & Over Tokyo Alive | 6.000 1.000 1.000 5.000 Dead | 3.000 0.0 2.000 1.000 + Boston Alive | 26.000 1.000 15.000 1.000 Dead | 18.000 0.0 9.000 3.000 + Glamorgn Alive | 11.000 1.000 12.000 4.000 Dead | 7.000 0.0 3.000 3.000-----------------------------+-------------------------------------------------

Page 667: Statistics I

I-647

Chapter 18

Computation

Algorithms

Loglinear modeling implements the algorithms of Haberman (1973).

References

Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley-Interscience.Agresti, A. (1990). Categorical data analysis. New York: Wiley-Interscience.Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate

analysis: Theory and practice. Cambridge, Mass.: McGraw-Hill.Fienberg, S. E. (1980). The analysis of cross-classified categorical data, 2nd ed.

Cambridge, Mass.: MIT Press.Goodman, L. A. (1978). Analyzing qualitative/categorical data: Loglinear models and

latent structure analysis. Cambridge, Mass.: Abt Books.Haberman, S. J. (1973). Loglinear fit for contingency tables, algorithm AS 51. Applied

Statistics, 21, 218–224.Haberman, S. J. (1978). Analysis of qualitative data, Vol. 1: Introductory topics. New

York: Academic Press.Knoke, D. and Burke, P. S. (1980). Loglinear models. Newbury Park: Sage.Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.

Page 668: Statistics I
Page 669: Statistics I

649

I n d e x

A matrix, I-499accelerated failure time distribution, II-538ACF plots, II-642additive trees, I-62, I-68

AID, I-35, I-37Akaike Information Criterion, II-288alpha level, II-314, II-315, II-321alternative hypothesis, I-13, II-312analysis of covariance, I-431

examples, I-462, I-478model, I-432

analysis of variance, I-210, I-487algorithms, I-485ANOVA command, I-438assumptions, I-388between-group differences, I-394bootstrapping, I-438compared to loglinear modeling, I-619compared to regression trees, I-35contrasts, I-390, I-434, I-435, I-436data format, I-438examples, I-439, I-442, I-447, I-457, I-459, I-461,

I-462, I-464, I-470, I-472, I-475, I-478, I-480factorial, I-387hypothesis tests, I-386, I-434, I-435, I-436interactions, I-387model, I-432multivariate, I-393, I-396overview, I-431post hoc tests, I-389, I-432power analysis, II-311, II-318, II-319, II-351, II-

353, II-374, II-378Quick Graphs, I-438repeated measures, I-393, I-436residuals, I-432two-way ANOVA, II-319, II-353, II-378unbalanced designs, I-391unequal variances, I-388

usage, I-438within-subject differences, I-394

Anderberg dichotomy coefficients, I-120, I-126angle tolerance, II-496anisotropy, II-500, II-508

geometric, II-500zonal, II-500

A-optimality, I-246ARIMA models, II-627, II-637, II-650

algorithms, II-678ARMA models, II-632autocorrelation plots, I-374, II-630, II-633, II-642Automatic Interaction Detection, I-35

autoregressive models, II-630axial designs, I-242

backward elimination, I-379bandwidth, II-460, II-465, II-472, II-496

optimal values, II-466relationship with kernel function, II-466

BASIC, II-536basic statistics

See descriptive statistics, I-211beta level, II-314, II-315between-group differences

in analysis of variance, I-394

bias, I-379binary logit, I-550

compared to multinomial logit, I-552binary trees, I-33

biplots, II-298, II-299Bisquare procedure, II-162biweight kernel, II-463, II-472, II-473Bonferroni inequality, I-37

Page 670: Statistics I

650

Index

Bonferroni test, I-127, I-389, I-432, I-493, II-591bootstrap, I-19, I-20

algorithms, I-28bootstrap-t method, I-19command, I-20data format, I-20examples, I-21, I-24, I-25, I-26missing data, I-28naive bootstrap, I-19overview, I-17Quick Graphs, I-20usage, I-20

box plot, I-210Box-Behnken designs, I-238, I-264Box-Hunter designs, I-235, I-256Bray-Curtis measure, I-119, I-126

C matrix, I-499candidate sets

for optimal designs, I-245canonical correlation analysis, I-487

bootstrapping, II-417data format, II-416, II-405examples, II-418, II-409, II-425interactions, II-417model, II-415nominal scales, II-417overview, II-407partialled variables, II-415Quick Graphs, II-417rotation, II-416usage, II-417

canonical correlations, I-286

canonical rotation, II-299categorical data, II-199categorical predictors, I-35Cauchy kernel, II-463, II-472, II-473CCF plots, II-643

central composite designs, I-238, I-269central limit theorem, II-588centroid designs, I-241CHAID, I-36, I-37

chi-square, I-160Chi-square test for independence, I-148

circle modelin perceptual mapping, II-297

city-block distance, I-126classical analysis, II-604

classification functions, I-280classification trees, I-36

algorithms, I-51basic tree model, I-32bootstrapping, I-44commands, I-43compared to discriminant analysis, I-36, I-39data format, I-44displays, I-41examples, I-45, I-47, I-49loss functions, I-38, I-41missing data, I-51mobiles, I-31model, I-41overview, I-31pruning, I-37Quick Graphs, I-44saving files, I-44stopping criteria, I-37, I-43usage, I-44

cluster analysisadditive trees, I-68algorithms, I-84bootstrapping, I-70commands, I-69data types, I-70distances, I-66examples, I-71, I-75, I-78, I-79, I-81, I-82exclusive clusters, I-54hierarchical clustering, I-64k-means clustering, I-67missing values, I-84overlapping clusters, I-54overview, I-53Quick Graphs, I-70saving files, I-70usage, I-70

clustered data, II-47Cochran’s test of linear trend, I-168

Page 671: Statistics I

651

Index

coefficient of alienation, II-123, II-143coefficient of determination

See multiple correlationcoefficient of variation, I-211Cohen’s kappa, I-164, I-168communalities, I-332

compound symmetry, I-394conditional logistic regression model, I-552conditional logit model, I-554confidence curves, II-156confidence intervals, I-11, I-211

path analysis, II-285conjoint analysis

additive tables, I-88algorithms, I-112bootstrapping, I-95commands, I-95compared to logistic regression, I-92data format, I-95examples, I-96, I-100, I-103, I-107missing data, I-113model, I-93multiplicative tables, I-89overview, I-87Quick Graphs, I-95saving files, I-95usage, I-95

constraintsin mixture designs, I-242

contingency coefficient, I-165, I-168contour plots, II-506contrast coefficients, I-393contrasts

in analysis of variance, I-390convex hulls, II-505Cook’s distance, I-375Cook-Weisberg graphical confidence curves, II-156

coordinate exchange method, I-245, I-270correlation matrix, II-12correlations, I-55, I-115

algorithms, I-145binary data, I-126bootstrapping, I-129

canonical, II-407commands, I-128continuous data, I-125data format, I-129dissimilarity measures, I-126distance measures, I-126examples, I-129, I-132, I-134, I-135, I-137, I-140, I-

143, I-145missing values, I-124, I-146, II-12options, I-127power analysis, II-311, II-317, II-336, II-338Quick Graphs, I-129rank-order data, I-126saving files, I-129set, II-407usage, I-129

correlograms, II-509correspondence analysis, II-294, II-298

algorithms, I-156bootstrapping, I-150commands, I-150data format, I-150examples, I-151, I-153missing data, I-149, I-156model, I-149multiple correspondence analysis, I-149overview, I-147Quick Graphs, I-150simple correspondence analysis, I-149usage, I-150

covariance matrix, I-125, II-12covariance paths

path analysis, II-236covariograms, II-495Cramer’s V, I-165critical level, I-13Cronbach’s alpha, II-604, II-605

See descriptive statistics, I-214cross-correlation plots, II-643crossover designs, I-487crosstabulation

bootstrapping, I-172commands, I-171data format, I-172

Page 672: Statistics I

652

Index

examples, I-173, I-175, I-177, I-178, I-179, I-181, I-186, I-188, I-189, I-192, I-194, I-196, I-197, I-199, I-200

multiway, I-170one-way, I-158, I-160, I-166overview, I-157Quick Graphs, I-172standardizing tables, I-159two-way, I-158, I-161, I-167, I-168usage, I-172

cross-validation, I-37, I-280, I-380

D matrix, I-499D SUB-A (da), II-433dates, II-536

degrees-of-freedom, II-586dendrograms, I-57, I-70dependence paths

path analysis, II-235

descriptive statistics, I-1basic statistics, I-211, I-212bootstrapping, I-215commands, I-215Cronbach’s alpha, I-214data format, I-215overview, I-205Quick Graphs, I-215stem-and-leaf plots, I-213usage, I-215

design of experiments, I-92, I-250, I-251axial designs, I-242bootstrapping, I-252Box-Behnken designs, I-238central composite designs, I-238centroid designs, I-241commands, I-252examples, I-253, I-254, I-256, I-258, I-260, I-263, I-

264, I-265, I-266, I-269, I-270factorial designs, I-231, I-232lattice designs, I-241mixture designs, I-232, I-239optimal designs, I-232, I-244overview, I-227Quick Graphs, I-252

response surface designs, I-232, I-236screening designs, I-242usage, I-252

determinant criterionSee D-optimality

dichotomy coefficientsAnderberg, I-126Jaccard, I-126positive matching, I-126simple matching, I-126Tanimoto, I-126

difference contrasts, I-498difficulty, II-621discrete choice model, I-554

compared to polytomous logit, I-555discrete gaussian convolution, II-470discriminant analysis, I-487

bootstrapping, I-288commands, I-287compared to classification trees, I-36data format, I-288estimation, I-284examples, I-288, I-293, I-298, I-306, I-313, I-315, I-

321linear discriminant function, I-280linear discriminant model, I-276model, I-283multiple groups, I-282options, I-284overview, I-275prior probabilities, I-282Quick Graphs, I-288statistics, I-286stepwise estimation, I-284usage, I-288

discrimination parameter, II-621dissimilarities

direct, II-121indirect, II-121

distance measures, I-55, I-115distances

nearest-neighbor, II-503distance-weighted least squares (DWLS) smoother,

II-470dit plots, I-15

Page 673: Statistics I

653

Index

D-optimality, I-246dot histogram plots, I-15

D-PRIME (d’ ), II-444dummy codes, I-490Duncan’s test, I-390Dunnett test, I-493Dunn-Sidak test, I-127, II-603

ECVI, II-287edge effects, II-517effect size

in power analysis, II-315effects codes, I-383, I-490efficiency, I-244eigenvalues, I-286

ellipse modelin perceptual mapping, II-298

EM algorithm, I-362EM estimation, II-8

for correlations, I-127, II-12for covariances, II-12for SSCP matrix, II-12

endogenous variablespath analysis, II-236

Epanechnikov kernel, II-475, II-484, II-485equamax rotation, I-333, I-337Euclidean distances, II-121exogenous variables

path analysis, II-236expected cross-validation index, II-287exponential distribution, II-550exponential model, II-510, II-520

exponential smoothing, II-650external unfolding, II-296

F distributionnoncentrality parameter, II-327, II-356

factor analysis, I-331, II-294algorithms, I-362

bootstrapping, I-339commands, I-339compared to principal components analysis, I-334convergence, I-335correlations vs covariances, I-331data format, I-339eigenvalues, I-335eigenvectors, I-338examples, I-341, I-344, I-348, I-350, I-353, I-356iterated principal axis, I-335loadings, I-338maximum likelihood, I-335missing values, I-362number of factors, I-335overview, I-327principal components, I-335Quick Graphs, I-339residuals, I-338rotation, I-333, I-337save, I-338scores, I-338usage, I-339

factor loadings, II-616factorial analysis of variance, I-387factorial designs, I-231, I-232

analysis of, I-235examples, I-253fractional factorials, I-234full factorial designs, I-234

Fedorov method, I-245

Fieller bounds, I-581filters, II-652Fisher's LSD, I-389, I-493Fisher’s exact test, I-164, I-168Fisher’s linear discriminant function, II-294

fixed variancepath analysis, II-238

fixed-bandwidth methodcompared to KNN method, II-479for smoothing, II-477, II-479, II-484

Fletcher-Powell minimization, II-632

forward selection, I-379Fourier analysis, II-651, II-664fractional factorial designs, I-487

Page 674: Statistics I

654

Index

Box-Hunter designs, I-235examples, I-254, I-256, I-258, I-260, I-263homogeneous fractional designs, I-235Latin square designs, I-235mixed-level fractional designs, I-235Plackett-Burman designs, I-235Taguchi designs, I-235

Freeman-Tukey deviates, I-622

frequencies, I-20, I-44, I-95, I-129, I-150, I-172, I-215, I-288, I-339, I-403, I-438, I-501, I-565, I-626, II-15, II-64, II-127, II-165, II-206, II-223, II-249, II-301, II-359, II-389, II-417, II-437, II-475, II-515, II-548, II-592, II-609, II-653, II-685

frequency tables, See crosstabulationFriedman test, II-202

gamma coefficients, I-126

Gaussian kernel, II-463, II-472, II-473Gaussian model, II-498, II-508Gauss-Newton method, II-155, II-156general linear models

algorithms, I-546bootstrapping, I-501categorical variables, I-490commands, I-501contrasts, I-495, I-497, I-498, I-499data format, I-501examples, I-503, I-510, I-512, I-513, I-515, I-518, I-

520, I-523, I-532, I-535, I-536, I-540, I-544, I-545

hypothesis tests, I-495mixture model, I-492model estimation, I-488overview, I-487post hoc tests, I-493Quick Graphs, I-501repeated measures, I-491residuals, I-488stepwise regression, I-492usage, I-501

generalized least squares, II-247, II-683generalized variance, II-410

geostatistical models, II-494, II-495Gini index, I-38, I-41

GLMSee general linear models

global criterionSee G-optimality

Goodman-Kruskal gamma, I-126, I-165, I-168Goodman-Kruskal lambda, I-168G-optimality, I-246Graeco-Latin square designs, I-235Greenhouse-Geisser statistic, I-395Guttman mu2 monotonicity coefficients, I-119, I-126

Guttman’s coefficient of alienation, II-123Guttman’s loss function, II-143Guttman-Rulon coefficient, II-605

Hadi outlier detection, I-123, I-127Hampel procedure, II-162Hanning weights, II-626hazard function

heterogeneity, II-541heteroskedasticity, II-682heteroskedasticity-consistent standard errors, II-683hierarchical clustering, I-56, I-64hierarchical linear models

See mixed regressionhinge, I-207histograms

nearest-neighbor, II-515hole model, II-499, II-508Holt’s method, II-638Huber procedure, II-162Huynh-Feldt statistic, I-395

hyper-Graeco-Latin square designs, I-235hypothesis

alternative, I-13null, I-13testing, I-12, I-371

Page 675: Statistics I

655

Index

ID3, I-37incomplete block designs, I-487

independence, I-161in loglinear models, I-618

INDSCAL model, II-119inertia, I-148

inferential statistics, I-7, II-312instrumental variables, II-681internal-consistency, II-605interquartile range, I-207interval censored data, II-534inverse-distance smoother, II-470

isotropic, II-495item-response analysis

See test item analysisitem-test correlations, II-604

Jaccard dichotomy coefficients, I-120, I-126jackknife, I-18, I-20

jackknifed classification matrix, I-280

k nearest-neighbors methodcompared to fixed-bandwidth method, II-467for smoothing, II-465, II-472

Kendall’s tau-b coefficients, I-165, I-168Kendall’s tau-b coefficients, I-126kernel functions, II-460, II-462

biweight, II-463, II-472, II-473Cauchy, II-463, II-472, II-473Epanechnikov, II-463, II-472, II-473Gaussian, II-463, II-472, II-473plotting, II-463relationship with bandwidth, II-466tricube, II-463, II-472, II-473triweight, II-463, II-472, II-473uniform, II-463, II-472

k-exchange method, I-245k-means clustering, I-60, I-67Kolmogorov-Smirnov test, II-200

KR20, II-605kriging, II-506

ordinary, II-501, II-512simple, II-501, II-512trend components, II-501universal, II-502

Kruskal’s loss function, II-142Kruskal’s STRESS, II-123Kruskal-Wallis test, II-198, II-199Kukoc statistic 7Kulczynski measure, I-126kurtosis, I-211

lagsnumber of lags, II-496

latent trait model, II-604, II-606Latin square designs, I-235, I-258, I-487lattice, II-220lattice designs, I-241

Lawley-Hotelling trace, I-286least absolute deviations, II-154Levene test, I-388leverage, I-376likelihood ratio chi-square, I-164, I-620

compared to Pearson chi-square, I-620likelihood-ratio chi-square, I-168, I-622Lilliefors test, II-217linear contrasts, I-390linear discriminant function, I-280

linear discriminant model, I-276linear models

analysis of variance, I-431general linear models, I-487hierarchical, II-47linear regression, I-399

linear regression, I-11, I-371, II-330bootstrapping, I-403commands, I-403data format, I-403estimation, I-401

Page 676: Statistics I

656

Index

examples, I-404, I-407, I-410, I-413, I-417, I-420, I-424, I-426, I-427, I-428, I-429

model, I-400overview, I-399Quick Graphs, I-403residuals, I-373, I-400stepwise, I-379, I-401tolerance, I-401usage, I-403using correlation matrix as input, I-381using covariance matrix as input, I-381using SSCP matrix as input, I-381

listwise deletion, I-362, II-3Little MCAR test, II-1, II-11, II-12, II-31loadings, I-330, I-331LOESS smoothing, II-470, II-472, II-476, II-477, II-

480, II-489logistic item-response analysis, II-620

one-parameter model, II-606two-parameter model, II-606

logistic regressionalgorithms, I-609bootstrapping, I-565categorical predictors, I-558compared to conjoint analysis, I-92compared to linear model, I-550conditional variables, I-557confidence intervals, I-581convergence, I-560data format, I-565deciles of risk, I-561discrete choice, I-559dummy coding, I-558effect coding, I-558estimation, I-560examples, I-566, I-568, I-569, I-574, I-579, I-582, I-

591, I-598, I-600, I-604, I-607missing data, I-609model, I-557options, I-560overview, I-549post hoc tests, I-563prediction table, I-557print options, I-565quantiles, I-562, I-582Quick Graphs, I-565

simulation, I-563stepwise estimation, I-560tolerance, I-560usage, I-565weights, I-565

logit model, I-551

loglinear modelingbootstrapping, I-626commands, I-626compared to analysis of variance, I-619compared to Crosstabs, I-625convergence, I-621data format, I-626examples, I-627, I-638, I-641, I-645frequency tables, I-625model, I-621overview, I-617parameters, I-622Quick Graphs, I-626saturated models, I-619statistics, I-622structural zeros, I-623usage, I-626

log-logistic distribution, II-538log-normal distribution, II-538longitudinal data, II-47loss functions, I-38, II-151

multidimensional scaling, II-142LOWESS smoothing, II-627low-pass filter, II-640LSD test, I-432, I-493

madograms, II-509Mahalanobis distances, I-276, I-286

Mann-Whitney test, II-198, II-199MANOVA

See analysis of varianceMantel-Haenszel test, I-170

MAR, II-9Marquardt method, II-159Marron & Nolan canonical kernel width, II-466, II-

472

Page 677: Statistics I

657

Index

mass, I-148matrix displays, I-57

maximum likelihood estimates, II-152maximum likelihood factor analysis, I-334maximum Wishart likelihood, II-247MCAR, II-9MCAR test, II-1, II-12, II-31McFadden’s conditional logit model, I-554

McNemar’s test, I-164, I-168MDPREF, II-298, II-299MDS,

See multidimensional scalingmean, I-3, I-207, I-211mean smoothing, II-468, II-472, II-474means coding, I-384median, I-4, I-206, I-211median smoothing, II-468meta-analysis, I-382

MGLH, See general linear models

midrange, I-207minimum spanning trees, II-503

Minkowski metric, II-123MIS function, II-20missing value analysis

algorithms, II-46bootstrapping, II-15casewise pattern table, II-20data format, II-15EM algorithm, II-8, II-12, II-25, II-33, II-42examples, II-15, II-20, II-25, II-33, II-42listwise deletion, II-3, II-25, II-33MISSING command, II-14missing value patterns, II-15model, II-12outliers, II-12overview, II-1pairwise deletion, II-3, II-25, II-33pattern variables, II-2, II-42Quick Graphs, II-15randomness, II-9regression imputation, II-6, II-12, II-25, II-42saving estimates, II-12, II-15

unconditional mean imputation, II-4usage, II-15

mixed regressionalgorithms, II-117bootstrapping, II-64commands, II-64data format, II-64examples, II-65, II-74, II-84, II-104overview, II-47Quick Graphs, II-64usage, II-64

mixture designs, I-232, I-239analysis of, I-243axial designs, I-242centroid designs, I-241constraints, I-242examples, I-265, I-266lattice designs, I-241Scheffé model, I-243screening designs, I-242simplex, I-241

models, I-10estimation, I-10

mosaic plots, II-506

moving average, II-465, II-625, II-631moving-averages smoother, II-470mu2 monotonicity coefficients, I-126multidimensional scaling, II-294

algorithms, II-142assumptions, II-120bootstrapping, II-127commands, II-127configuration, II-123, II-126confirmatory, II-126convergence, II-123data format, II-127dissimilarities, II-121distance metric, II-123examples, II-128, II-130, II-132, II-136, II-140Guttman method, II-143individual differences, II-119Kruskal method, II-142log function, II-123loss functions, II-123matrix shape, II-123

Page 678: Statistics I

658

Index

metric, II-123missing values, II-144nonmetric, II-123overview, II-119power function, II-123Quick Graphs, II-127residuals, II-123Shepard diagrams, II-123, II-127usage, II-127

multilevel modelsSee mixed regression

multinomial logit, I-552compared to binary logit, I-552

multiple correlation, I-372multiple correspondence analysis, I-148multiple regression, I-376multivariate analysis of variance, I-396mutually exclusive, I-160

Nadaraya-Watson smoother, II-470nesting, I-487

Newman-Keuls test, I-390Newton-Raphson method, I-617nodes, I-33nominal data, II-199noncentral F distribution, II-327, II-356noncentrality parameters, II-327

nonlinear models, II-147algorithms, II-194bootstrapping, II-165commands, II-165computation, II-159, II-194convergence, II-159data format, II-165estimation, II-155examples, II-166, II-169, II-172, II-174, II-177, II-

179, II-180, II-182, II-185, II-190, II-192, II-193functions of parameters, II-161loss functions, II-151, II-156, II-163, II-164missing data, II-195model, II-156parameter bounds, II-159problems in, II-155

Quick Graphs, II-165recalculation of parameters, II-160robust estimation, II-162starting values, II-159usage, II-165

nonmetric unfolding model, II-119

nonparametric statisticsalgorithms, II-217bootstrapping, II-206commands, II-201, II-203, II-205data format, II-206examples, II-206, II-208, II-209, II-211, II-212, II-

214, II-216Friedman test, II-202independent samples tests, II-199, II-200Kolmogorov-Smirnov test, II-200, II-203Kruskal-Wallis test, II-199Mann-Whitney test, II-199one-sample tests, II-205overview, II-197Quick Graphs, II-206related variables tests, II-201, II-202sign test, II-201usage, II-206Wald-Wolfowitz runs test, II-205Wilcoxon signed-rank test, II-202Wilcoxon test, II-199

normal distribution, I-207

NPAR model, II-432nugget, II-499null hypothesis, I-12, II-312

oblimin rotation, I-333, I-337observational studies, I-229Occam’s razor, I-91odds ratio, I-168

omni-directional variograms, II-496optimal designs, I-232, I-244

analysis of, I-246A-optimality, I-246candidate sets, I-245coordinate exchange method, I-245, I-270D-optimality, I-246

Page 679: Statistics I

659

Index

efficiency criteria, I-246Fedorov method, I-245G-optimality, I-246k-exchange method, I-245model, I-247optimality criteria, I-246

optimality, I-244ORDER, II-537ordinal data, II-198ordinary least squares, II-247

orthomax rotation, I-333, I-337

PACF plots, II-642

pairwise deletion, I-362, II-3pairwise mean comparisons, I-389parameters, I-10parametric modeling, II-538partial autocorrelation plots, II-632, II-633, II-642partialing

in set correlation, II-411partially ordered scalogram analysis with coordi-

natesalgorithms, II-232bootstrapping, II-223commands, II-222convergence, II-222data format, II-223displays, II-221examples, II-224, II-225, II-228missing data, II-232model, II-222overview, II-219Quick Graphs, II-223usage, II-223

path analysisalgorithms, II-284bootstrapping, II-249commands, II-249confidence intervals, II-247, II-285covariance paths, II-236covariance relationships, II-245data format, II-249dependence paths, II-235

dependence relationships, II-243endogenous variables, II-236estimation, II-247examples, II-250, II-255, II-268, II-274exogenous variables, II-236fixed parameters, II-243, II-245fixed variance, II-238free parameters, II-243, II-245latent variables, II-247manifest variables, II-247measures of fit, II-285model, II-241, II-243overview, II-233path diagrams, II-233Quick Graphs, II-249starting values, II-247usage, II-249variance paths, II-236

path diagrams, II-233Pearson chi-square, I-161, I-166, I-168, I-618, I-622

compared to likelihood ratio chi-square, I-620Pearson correlation, I-117, I-123, I-125perceptual mapping

algorithms, II-308bootstrapping, II-301commands, II-301data format, II-301examples, II-302, II-303, II-304, II-306methods, II-299missing data, II-308model, II-299overview, II-293Quick Graphs, II-301usage, II-301

periodograms, II-639permutation tests, I-160phi coefficient, I-38, I-41, I-165, I-168Pillai trace, I-286Plackett-Burman designs, I-235, I-263

point processes, II-494, II-502polynomial contrasts, I-390, I-393, I-498polynomial smoothing, II-468, II-472, II-474pooled variances, II-588populations, I-7

Page 680: Statistics I

660

Index

POSET, II-219positive matching dichotomy coefficients, I-120

power, II-314power analysis

analysis of variance, II-311bootstrapping, II-359commands, II-358correlation coefficients, II-317, II-336, II-338correlations, II-311data format, II-359examples, II-360, II-363, II-366, II-369, II-374, II-

378generic, II-327, II-356, II-374one sample t-test, II-318, II-345one sample z test, II-341one-way ANOVA, II-318, II-351, II-374overview, II-311paired t-test, II-318, II-346, II-363power curves, II-359proportions, II-311, II-317, II-332, II-333, II-360Quick Graphs, II-359randomized block designs, II-311t-tests, II-311, II-318, II-345, II-346, II-348, II-369two sample t-test, II-318, II-348, II-369two sample z test, II-342two-way ANOVA, II-319, II-353, II-378usage, II-359z tests, II-311, II-341, II-342

power curves, II-359overlaying curves, II-363response surfaces, II-363

Power model, II-498, II-508preference curves, II-296preference mapping, II-294PREFMAP, II-299principal components analysis, I-327, I-328, I-487

coefficents, I-330compared to factor analysis, I-334compared to linear regression, I-329loadings, I-330

prior probabilities, I-282probability plots, I-15, I-373probit analysis

algorithms, II-392bootstrapping, II-389

categorical variables, II-387commands, II-388data format, II-389dummy coding, II-387effect coding, II-387examples, II-389, II-391interpretation, II-386missing data, II-392model, II-385, II-386overview, II-385Quick Graphs, II-389saving files, II-389usage, II-389

Procrustes rotations, II-298, II-299

proportional hazards models, II-539proportions

power analysis, II-311, II-317, II-332, II-333, II-360

p-value, II-312

QSK coefficient, I-126

quadrat counts, II-493, II-505, II-506quadratic contrasts, I-390quantile plots, II-540quantitative symmetric dissimilarity coefficient, I-

119quartimax rotation, I-333, I-337

quasi-independence, I-623Quasi-Newton method, II-155, II-156

random coefficient modelsSee mixed regression

random effectsin mixed regression, II-47

random fields, II-494random samples, I-8random variables, I-370

random walk, II-631randomized block designs, I-487, II-330

power analysis, II-311

Page 681: Statistics I

661

range, I-207, I-211, II-499rank-order coefficients, I-126

Rasch model, II-606receiver operating characteristic curves

See signal detection analysisregression

linear, I-11, I-399logistic, I-549two-stage least squares, II-681rank, II-395ridge, II-401

regression trees, I-35algorithms, I-51basic tree model, I-32bootstrapping, I-44commands, I-43compared to analysis of variance, I-35compared to stepwise regression, I-36data format, I-44displays, I-41examples, I-45, I-47, I-49loss functions, I-38, I-41missing data, I-51mobiles, I-31model, I-41overview, I-31pruning, I-37Quick Graphs, I-44saving files, I-44stopping criteria, I-37, I-43usage, I-44

reliability, II-605, II-607repeated measures, I-393, I-491

assumptions, I-394

response surface designs, I-232, I-236analysis of, I-239Box-Behnken designs, I-238central composite designs, I-238examples, I-264, I-269rotatability, I-237, I-238

response surfaces, I-92, II-156

right censored data, II-534RMSEA, II-287robust smoothing, II-468, II-472, II-474robustness, II-199ROC curves, II-431, II-432, II-437

root mean square error of approximation, II-286rotatability

in response surface designs, I-237rotatable designs

in response surface designs, I-238rotation, I-333running median smoothers, II-626running-means smoother, II-470

Sakitt D, II-433sample size, II-315, II-323samples, I-8sampling

See bootstrapsaturated models

loglinear modeling, I-619scalogram

See partially ordered scalogram analysis with co-ordinates

scatterplot matrix, I-117Scheffé model

in mixture designs, I-243Scheffe test, I-389, I-432, I-493

screening designs, I-242SD-RATIO, II-433seasonal decomposition, II-637second-order stationarity, II-495semi-variograms, II-496, II-509set correlations, II-407

assumptions, II-408measures of association, II-409missing data, II-429partialing, II-408See canonical correlation analysis

Shepard diagrams, II-123, II-127Shepard’s smoother, II-470

sign test, II-201signal detection analysis

algorithms, II-456bootstrapping, II-437chi-square model, II-433

Index

Page 682: Statistics I

662

Index

commands, II-436convergence, II-433data format, II-437examples, II-433, II-434, II-447, II-450, II-453, II-

454exponential model, II-433gamma model, II-433logistic model, II-433missing data, II-457nonparametric model, II-433normal model, II-433overview, II-431Poisson model, II-433Quick Graphs, II-437ROC curves, II-431, II-437usage, II-437variables, II-433

sill, II-499similarity measures, I-115simple matching dichotomy coefficients, I-120

simplex, I-241Simplex method, II-155, II-156simulation, II-502singular value decomposition, I-147, II-298, II-308skewness, I-209, I-211

positive, I-4slope, I-376smoothing, II-472, II-624

bandwidth, II-460, II-465, II-472biweight kernel, II-463, II-472, II-473bootstrapping, II-475, II-477Cauchy kernel, II-463, II-472, II-473commands, II-474confidence intervals, II-477data format, II-475discontinuities, II-470discrete gaussian convolution, II-470distance-weighted least squares (DWLS), II-470Epanechnikov kernel, II-463, II-472, II-473examples, II-476, II-477, II-480, II-489fixed-bandwidth method, II-465, II-472Gaussian kernel, II-463, II-472, II-473grid points, II-471, II-472, II-489inverse-distance, II-470k nearest-neighbors method, II-465

kernel functions, II-460, II-462, II-472, II-473LOESS smoothing, II-470, II-472,

II-480, II-489II-476, II-477,

Marron & Nolan canonical kernel width, II-466, II-472

mean smoothing, II-468, II-472, II-474median smoothing, II-468methods, II-460, II-468, II-474model, II-472moving-averages, II-470Nadaraya-Watson, II-470nonparametric vs. parametric, II-460overview, II-459polynomial smoothing, II-468, II-472, II-474Quick Graphs, II-475residuals, II-471, II-475robust smoothing, II-468, II-472, II-474running-means, II-470saving results, II-472, II-475, II-476Shepard’s smoother, II-470step, II-470tied values, II-471tricube kernel, II-463, II-472, II-473trimmed mean smoothing, II-472, II-474triweight kernel, II-463, II-472, II-473uniform kernel, II-463, II-472usage, II-475window normalization, II-466, II-472

Somers’ d coefficients, I-165, I-168sorting, I-5Sosa statistic, 21, 66, 98spaghetti plot, II-84spatial statistics, II-493

algorithms, II-530azimuth, II-509bootstrapping, II-515commands, II-513data, II-515dip, II-509examples, II-515, II-522, II-523, II-529grid, II-511kriging, II-501, II-506, II-512lags, II-509missing data, II-530model, II-493models, II-508

Page 683: Statistics I

663

Index

nested models, II-500nesting structures, II-508nugget, II-508nugget effect, II-499plots, II-506point statistics, II-506Quick Graphs, II-515sill, II-499, II-508simulation, II-502, II-506trends, II-506variograms, II-496, II-506, II-509

Spearman coefficients, I-119, I-126, I-165

Spearman-Brown coefficient, II-605specificities, I-332spectral models, II-624spherical model, II-497, II-508split plot designs, I-487split-half reliabilities, II-607

SSCP matrix, II-12standard deviation, I-3, I-207, I-211standard error of estimate, I-371standard error of kurtosis, I-211standard error of skewness, I-211standard error of the mean, I-11, I-211

standardization, I-55standardized alpha, II-605standardized deviates, I-147, I-622standardized values, I-6stationarity, II-495, II-633statistics

defined, I-1descriptive, I-1inferential, I-7See descriptive statistics

stem-and-leaf plots, I-3, I-206See descriptive statistics, I-213

step smoother, II-470

stepwise regression, I-379, I-392, I-556stochastic processes, II-494stress, II-122, II-142structural equation models

See path analysis

Stuart’s tau-c coefficients, I-165, I-168studentized residuals, I-373

subpopulations, I-209subsampling, I-18sum of cross-products matrix, I-125sums of squares

type I, I-391, I-396type II, I-409type III, I-392, I-397type IV, I-397

surface plots, II-506survival analysis

algorithms, II-572bootstrapping, II-548censoring, II-534, II-541, II-576centering, II-573coding variables, II-541commands, II-547convergence, II-578Cox regression, II-545data format, II-548estimation, II-543examples, II-549, II-552, II-553, II-557, II-560, II-

562, II-567, II-569exponential model, II-545graphs, II-545logistic model, II-545log-likelihood, II-574log-normal model, II-545missing data, II-573model, II-541models, II-575overview, II-533parameters, II-573plots, II-535, II-577proportional hazards models, II-576Quick Graphs, II-548singular Hessian, II-575stepwise, II-578stepwise estimation, II-543tables, II-545time varying covariates, II-546usage, II-548variances, II-579Weibull model, II-545

symmetric matrix, I-117

Page 684: Statistics I

664

Index

t distributions, II-584compared to normal distributions, II-586

t testsassumptions, II-588Bonferroni adjustment, II-591bootstrapping, II-592commands, II-592confidence intervals, II-591data format, II-592degrees of freedom, II-586Dunn-Sidak adjustment, II-591examples, II-593, II-595, II-597, II-599, II-601one-sample, II-318, II-345, II-586, II-590overview, II-583paired, II-318, II-346, II-363, II-587, II-590power analysis, II-311, II-318, II-345, II-346, II-

348Quick Graphs, II-592separate variances, II-588two-sample, II-318, II-348, II-369, II-587, II-589usage, II-592

Taguchi designs, I-235, I-260Tanimoto dichotomy coefficients, I-120, I-126tau-b coefficients, I-126, I-168tau-c coefficients, I-168test item analysis

algorithms, II-620bootstrapping, II-609classical analysis, II-604, II-605, II-607, II-620commands, II-609data format, II-609examples, II-613, II-614, II-617logistic item-response analysis, II-606, II-608, II-

620missing data, II-621overview, II-603Quick Graphs, II-609reliabilities, II-607scoring items, II-607, II-608statistics, II-609usage, II-609

tetrachoric correlation, I-120, I-121, I-126theory of signal detectability (TSD), II-431time domain models, II-624time series, II-623

algorithms, II-678ARIMA models, II-627, II-650bootstrapping, II-653clear series, II-645commands, II-644, II-646, II-649, II-650, II-651, II-

653data format, II-653examples, II-654, II-655, II-656, II-658, II-661, II-

662, II-663, II-665, II-666, II-670, II-676forecasts, II-648Fourier transformations, II-652missing values, II-623moving average, II-625, II-646overview, II-623plot labels, II-641plots, II-640, II-641, II-642, II-643Quick Graphs, II-653running means, II-626, II-646running medians, II-626, II-646seasonal adjustments, II-637, II-649smoothing, II-624, II-646, II-647, II-648stationarity, II-633transformations, II-644, II-645trends, II-648usage, II-653

tolerance, I-380T-plots, II-640trace criterion

See A-optimalitytransformations, I-209tree clustering methods, I-37tree diagrams, I-57triangle inequality, II-120tricube kernel, II-463, II-472, II-473

trimmed mean smoothing, II-472, II-474triweight kernel, II-463, II-472, II-473Tukey pairwise comparisons test, I-389, I-432, I-493Tukey’s jackknife, I-18twoing, I-38

two-stage least squaresalgorithms, II-692bootstrapping, II-685commands, II-685data format, II-685

Page 685: Statistics I

665

Index

estimation, II-681examples, II-686, II-688, II-691heteroskedasticity-consistent standard errors, II-

683lagged variables, II-683missing data, II-692model, II-683overview, II-681Quick Graphs, II-685usage, II-685

Type I error, II-314type I sums of squares, I-391, I-396Type II error, II-314type II sums of squares, I-397type III sums of squares, I-392, I-397type IV sums of squares, I-397

unbalanced designsin analysis of variance, I-391

uncertainty coefficient, I-168unfolding models, II-295uniform kernel, II-463, II-472

variance, I-211of estimates, I-237

variance component modelsSee mixed regression

variance of prediction, I-238

variance pathspath analysis, II-236

varimax rotation, I-333, I-337

variograms, II-496, II-506, II-515model, II-497

vector modelin perceptual mapping, II-297

Voronoi polygons, II-493, II-504, II-506

Wald-Wolfowitz runs test, II-205wave model, II-499

Weibull distribution, II-538weight, II-515weighted running smoothing, II-626weights, I-20, I-44, I-95, I-129, I-150, I-172, I-215, I-

288, I-339, I-403, I-438, I-501, I-565, I-626, II-15, II-64, II-127, II-165, II-206, II-223, II-249, II-301, II-389, II-417, II-437, II-475, II-548, II-592, II-609, II-653, II-685

Wilcoxon signed-rank test, II-202Wilcoxon test, II-199Wilks’ lambda, I-280, I-286Wilks’ trace, I-286

Winter’s three-parameter model, II-638within-subjects differences

in analysis of variance, I-394

Yates’ correction, I-164, I-168y-intercept, I-376Young’s S-STRESS, II-123

Yule’s Q, I-165, I-168Yule’s Y, I-165, I-168

z testsone-sample, II-341power analysis, II-311, II-341, II-342two-sample, II-342

Page 686: Statistics I