principles and procedures of exploratory data analysis€¦ · exploratory data analysis (eda) is a...

30
Psychological Methods 1997, Vol. 2, No. 2, 131-160 Copyright 1997 by the Am n Psychological Association, Inc. 1082-989X/97/$3.00 Principles and Procedures of Exploratory Data Analysis John T. Behrens Arizona State University Exploratory data analysis (EDA) is a well-established statistical tradition that pro- vides conceptual and computational tools for discovering patterns to foster hypoth- esis development and refinement. These tools and attitudes complement the use of significance and hypothesis tests used in confirmatory data analysis (CDA). Al- though EDA complements rather than replaces CDA, use of CDA without EDA is seldom warranted. Even when well-specified theories are held, EDA helps one interpret the results of CDA and may reveal unexpected or misleading patterns in the data. This article introduces the central heuristics and computational tools of EDA and contrasts it with CDA and exploratory statistics in general. EDA tech- niques are illustrated using previously published psychological data. Changes in statistical training and practice are recommended to incorporate these tools. The widespread availability of software for graphi- cal data analysis and calls for increased use of explor- atory data analysis (EDA) on epistemic grounds (e.g. Cohen, 1994) have increased the visibility of EDA. Nevertheless, few psychologists receive explicit train- ing in the beliefs or procedures of this tradition. Huberty (1991) remarked that statistical texts are likely to give cursory references to common EDA techniques such as stem-and-leaf plots, box plots, or residual analysis and yet seldom integrate these tech- niques throughout a book. A survey of graduate train- ing programs in psychology corroborates such an im- pression (Aiken, West, Securest, & Reno, 1990). In this investigation, 37 (20%) of the 186 responding departments reported teaching some aspect of EDA in introductory graduate courses. However, the percent- age of institutions indicating that most or all students could apply a learned technique was as follows: (a) detection and treatment of influential data, 8%; (b) modern graphical display, 15%; (c) data transforma- tions, 31%; (d) alternatives to ordinary least squares (OLS) regression, 3%. These low levels of competen- I gratefully acknowledge comments and criticisms of ear- lier versions of this article, which were provided byRay- mond Miller, Joe Rodgers, Larry Toothaker, Alex Yu, and Dan Huston. Correspondence concerning this article should be ad- dressed to John T. Behrens, Methodological Studies, Divi- sion of Psychology in Education, Arizona State University, Tempe, Arizona 85287-0611. Electronic mail may be sent via Internet to [email protected]. cies and the generally bleak picture of methodological instruction presented by Aiken et al. (1990) indicate that little EDA makes its way into graduate training and even less makes its way out as usable skills. This essay introduces researchers to the philosoph- ical underpinnings and general heuristics of EDA in three sections. First, the background, rationale, and basic principles of EDA are presented. Next, a primer covers heuristics, prototypical beliefs, and procedures of EDA using examples from psychological research. The final section addresses implications of this analy- sis for psychological method and training. Background and First Principles What Is EDA? Unaware of historical precedent, researchers may develop their own definition of EDA from denotations of its name. Sometimes the term is used to mean exploratory analysis in general. Mulaik (1984), for example, discussed a long history of generic "explor- atory statistics" in response to an article concerning EDA (Good, 1983), and yet scarcely mentioned the specific tradition of EDA to be discussed in this essay. Sometimes the model-building approach of Box (e.g., 1980) is considered exploratory, although it relies more heavily on probabilistic measures than does EDA. In this article, EDA refers to a specific tradition of data analysis that stems from the work of John Tukey and his associates, which dates back to the early 1960s. This tradition of EDA can be loosely charac- terized by (a) an emphasis on the substantive under- 131

Upload: others

Post on 21-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

Psychological Methods1997, Vol. 2, No. 2, 131-160

Copyright 1997 by the Am n Psychological Association, Inc.1082-989X/97/$3.00

Principles and Procedures of Exploratory Data Analysis

John T. BehrensArizona State University

Exploratory data analysis (EDA) is a well-established statistical tradition that pro-vides conceptual and computational tools for discovering patterns to foster hypoth-esis development and refinement. These tools and attitudes complement the use ofsignificance and hypothesis tests used in confirmatory data analysis (CDA). Al-though EDA complements rather than replaces CDA, use of CDA without EDA isseldom warranted. Even when well-specified theories are held, EDA helps oneinterpret the results of CDA and may reveal unexpected or misleading patterns inthe data. This article introduces the central heuristics and computational tools ofEDA and contrasts it with CDA and exploratory statistics in general. EDA tech-niques are illustrated using previously published psychological data. Changes instatistical training and practice are recommended to incorporate these tools.

The widespread availability of software for graphi-cal data analysis and calls for increased use of explor-atory data analysis (EDA) on epistemic grounds (e.g.Cohen, 1994) have increased the visibility of EDA.Nevertheless, few psychologists receive explicit train-ing in the beliefs or procedures of this tradition.Huberty (1991) remarked that statistical texts arelikely to give cursory references to common EDAtechniques such as stem-and-leaf plots, box plots, orresidual analysis and yet seldom integrate these tech-niques throughout a book. A survey of graduate train-ing programs in psychology corroborates such an im-pression (Aiken, West, Securest, & Reno, 1990). Inthis investigation, 37 (20%) of the 186 respondingdepartments reported teaching some aspect of EDA inintroductory graduate courses. However, the percent-age of institutions indicating that most or all studentscould apply a learned technique was as follows: (a)detection and treatment of influential data, 8%; (b)modern graphical display, 15%; (c) data transforma-tions, 31%; (d) alternatives to ordinary least squares(OLS) regression, 3%. These low levels of competen-

I gratefully acknowledge comments and criticisms of ear-lier versions of this article, which were provided by Ray-mond Miller, Joe Rodgers, Larry Toothaker, Alex Yu, andDan Huston.

Correspondence concerning this article should be ad-dressed to John T. Behrens, Methodological Studies, Divi-sion of Psychology in Education, Arizona State University,Tempe, Arizona 85287-0611. Electronic mail may be sentvia Internet to [email protected].

cies and the generally bleak picture of methodologicalinstruction presented by Aiken et al. (1990) indicatethat little EDA makes its way into graduate trainingand even less makes its way out as usable skills.

This essay introduces researchers to the philosoph-ical underpinnings and general heuristics of EDA inthree sections. First, the background, rationale, andbasic principles of EDA are presented. Next, a primercovers heuristics, prototypical beliefs, and proceduresof EDA using examples from psychological research.The final section addresses implications of this analy-sis for psychological method and training.

Background and First Principles

What Is EDA?Unaware of historical precedent, researchers may

develop their own definition of EDA from denotationsof its name. Sometimes the term is used to meanexploratory analysis in general. Mulaik (1984), forexample, discussed a long history of generic "explor-atory statistics" in response to an article concerningEDA (Good, 1983), and yet scarcely mentioned thespecific tradition of EDA to be discussed in this essay.Sometimes the model-building approach of Box (e.g.,1980) is considered exploratory, although it reliesmore heavily on probabilistic measures than doesEDA.

In this article, EDA refers to a specific tradition ofdata analysis that stems from the work of John Tukeyand his associates, which dates back to the early1960s. This tradition of EDA can be loosely charac-terized by (a) an emphasis on the substantive under-

131

Page 2: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

132 BEHRENS

standing of data that address the broad question of"what is going on here?" (b) an emphasis on graphicrepresentations of data; (c) a focus on tentative modelbuilding and hypothesis generation in an iterative pro-cess of model specification, residual analysis, andmodel respecification; (d) use of robust measures, re-expression, and subset analysis; and (e) positions ofskepticism, flexibility, and ecumenism regardingwhich methods to apply.

The goal of EDA is to discover patterns in data.Tukey often likened EDA to detective work. The roleof the data analyst is to listen to the data in as manyways as possible until a plausible "story" of the datais apparent, even if such a description would not beborne out in subsequent samples. Finch (1979) as-serted that "we claim for exploratory investigation nomore than that it is an activity directed toward theformation of analogy. The end of it is simply a state-ment that the data look as if they could reasonably bethought of in such and such a way" (p. 189).

Classical works in this tradition are Tukey's Ex-ploratory Data Analysis (1977); Mosteller andTukey's Data Analysis and Regression: A SecondCourse in Statistics (1977); Hoaglin, Mosteller, andTukey's studies (1983b, 1985, 1991); volumes three,four, and five of Tukey's collected works (Cleveland,1988; Jones, 1986a, 1986b); and Velleman and Hoa-glin's work (1981, 1992). Summaries of EDA havebeen presented by Hartwig and Dearing (1979), Lein-hardt and Leinhardt (1980), Leinhardt and Wasser-man (1979), and more recently by Behrens and Smith(1996) and Smith and Prentice (1993). Erickson andNosanchuk's (1992) text is for a first course in dataanalysis that presents a balanced presentation of bothEDA and confirmatory data analysis (CDA). Behrens(1996) provided on-line materials for teaching EDA.Although exploratory techniques have been devel-oped by others, Tukey and his associates began theendeavor and continue to lead the articulation of thepurpose and constraints necessary for reasonableEDA (cf. Hoaglin et al., 1991). Tukey (1969) recom-mended the EDA approach to psychologists at the1968 meeting of the American Psychological Asso-ciation in a paper entitled, "Analyzing Data: Sancti-fication or Detective Work?" Since that time, surpris-ingly few have responded.

The Need for EDA

Most psychologists are well trained in testing sta-tistical hypotheses at the end of an investigation. Nev-

ertheless, the scientific process of model building andtesting often requires learning from the data at allstages of research. For example, while conducting aregression analysis, one may be interested in assess-ing the specific hypothesis that a particular (3j = 0 ina model with X, and X2. When assessing the status ofprespecified statistical issues, the researcher is work-ing in what Mayer (1980) called the confirmatorymode. More often, however, researchers are con-cerned with a broader range of questions about thedata than the statistical significance of the partialedslopes: What if responses on X2 occurred only at twolevels rather than across all possible levels of thescale? Are there extreme values that unduly affect theestimation of the slopes? Is the shape of the data in thescatter plot like an ellipse, a horseshoe, or a banana?Is there something misleading me? When addressingsuch a broad set of questions, a researcher is workingin an exploratory mode. Because the goals of the twomodes of data analysis are different, the modes arecomplementary rather than antagonistic.

In contrast to EDA, most training in CDA fails toaddress the early and messy stages of data analysis.This practice constitutes what McGuire (1989) calledthe hypothesis testing myth. He argued that we do adisservice to training and practice by glossing over orignoring preliminary data analyses during which werefine hypotheses, evaluate and clarify our auxiliaryassumptions, and simply make sure our mental modelof the data is well aligned with reality.

Exploratory and Confirmatory

CDA is often likened to Anglo-Saxon jury trials:Researchers play the role of prosecutor, data collec-tion serves as the trial proceeding, and statisticalanalysis plays the role of jury decision (Kraemer &Thiemann, 1987; Tukey, 1977). The detective anal-ogy for EDA fits well with this jurisprudence modelbecause the role of the detective is to establish pre-trial evidence and hunches, the veracity of whichare tested at the trial. Kraemer and Thiemann pointedout that prosecutors examine preliminary evidencebefore deciding whether to prosecute or not. Theyequate this process with EDA and other pretrialevidence gathering such as power- or meta-analysis.By using both exploratory and confirmatory tech-niques, a data analyst collects complete pretrial evi-dence and brings the full weight of CDA to bear atthe trial.

In a trial, rules of presenting and evaluating evi-

Page 3: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 133

dence are as well established as the rules of statisticalinference. To make the strong claim of innocence orguilt (significance or nonsignificance), one uses spe-cific rules and procedures with strict interpretations ofthe data. In EDA, the goal is not to draw conclusionsregarding guilt and innocence but rather to investigatethe actors, generate hunches, and provide preliminaryevidence. EDA is more like an interrogation in whichclean and corrupted stories are told, whereas CDA istestimony regarding evidence that fits carefully laid-out trial procedures. The goal of EDA is indictment;the goal of CDA is conviction (Behrens & Smith,1996).

There is, however, a point at which the trial analogybreaks down. In a jury trial a witness may be used toboth formulate and test hunches. Alternatively, in sci-entific practice different data must be used for modelformulation (EDA) and testing (CDA). Failure to rec-ognize this important fact will lead to inflation ofType I error and overfilling. Along these lines Giere(1984, cited in Howson & Urbach, 1993) argued:

If the known facts were used in constructing the modeland were thus built into the resulting hypothesis . .. thenthe fit between these facts and the hypothesis providesno evidence that the hypothesis is true [since] these factshad no chance of refuting the hypothesis, (p. 408)

When sufficiently large samples are available, Ihe ex-ploratory dala analysl is likely to conducl EDA on onedala sel to generate hypotheses and assess Ihe modelon anolher. The importance of dislinguishing betweenmodel building and testing led Mosteller and Tukey(1977) lo state that "we plan to cross-validate care-fully wherever we can" (p. 40). Cross-validationmeans that when patterns are discovered, they areconsidered provisional (consistent with the EDAmode) until their presence is tested in different data.

Tukey (1972/1986b) discussed data analysis as acontinuum from EDA to CDA. Between the two isan intermediate mode called rough confirmatoryanalysis. In EDA the researcher entertains numeroushypotheses, looks for patterns, and suggests hypo-theses based on the data, with or without theoreticalgrounding. Working in this mode, the researcherbegins to delineate a set of plausible models andseeks rich descriptions of the data. In rough CDA,the researcher undertakes initial assessment of theplausible models using probabilistic approaches suchas confidence intervals or significance tests (cf. Be-hrens & Smith, 1996). In this step the researcher an-swers the question, "With what accuracy are the ap-

pearances already found to be believed?" (Tukey,1972/1986b, p. 760). In the confirmatory mode, re-searchers work to test specific hypotheses using astrict probabilistic framework following a decisiontheoretic approach.

When trained in all three modes of data analysis, aresearcher is likely to move fluidly between modesand work in multiple modes on the same problem. Forexample, a researcher may have strict hypothesesabout main effects in a factorial analysis of variance(ANOVA) and yet have no hypotheses concerningpossible interactions. Working in a strict confirmatorymode, the researcher would compute only the maineffects test and ignore possible interactions. Workingin multiple modes, a researcher would likewise statethe hypothesis for main effects and test them usingstrict CDA. At the same time, however, the researcherworking in multiple modes would explore possibleinteractions with statistical graphics, resistant sum-mary statistics, and even loosely interpreted signifi-cance tests. Patterns of unexpected outcomes wouldbe regarded as starting points for hypothesis genera-tion and future testing rather than as statistical con-clusions. In addition, the researcher familiar withEDA will also explore data patterns associated withthe hypothesized mam effect to make sure the CDAwas not misled by unrecognized patterns that can leadto conclusions inconsistent with the data.

Tukey summarized the relation between thesemodes of data analysis, arguing "(a) both explorationand confirmation are important, (b) exploration comesfirst, (c) any given study can, and usually should,combine both" (Tukey, 1980/1986e, p. 822; cf.Tukey, 1980). Tukey (1982/1986d) presented a moredetailed analysis of levels and types of data analysisfollowing this framework. Joreskog and Sorbom(1993) presented a similar discussion of situations of(a) model generating, (b) analysis of competing mod-els, and (c) strictly confirmatory analysis of a singlemodel, all in the context of structural equation mod-eling (p. 115).

Researchers who want to know more about theirdata than they have hypothesized sometimes use con-firmatory methods while working in a pseudoconfir-matory mode. Examples of this kind of behavior in-clude interpreting unexpected interactions as ifhypothesized and computing t tests or chi-square ho-mogeneity tests on myriad possibilities. This is notEDA. In these cases, researchers have exploratorygoals but are approaching them by using confirmatorytools, assumptions, and conclusions improperly. EDA

Page 4: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

134 BEHRENS

helps avoid these improper approaches by being clearabout hypothesis specificity and conclusion strengthand by providing a language for different stages andpurposes of data analysis.

EDA and Other Exploratory Methods

Exploratory statistics is an interesting point ofconvergence between classical CDA and EDA. Anumber of apparently confirmatory techniques areexploratory in their goals, including stepwise regres-sion, some forms of factor analysis, cluster analysis,discriminant analysis, and many applications ofstructural equation modeling. These and other meth-ods are exploratory when the researcher is trying todetermine a "best" set of variables or the "best"model for a sample rather than testing a prespecifiedmodel for a specific population. For example, so-called confirmatory factor analysis via structuralequation models becomes exploratory when a numberof alternate models are assessed. The exploratory na-ture of these techniques underscores the idea that dataexploration and the integration of empirical and theo-retical knowledge are well-established aspects of sci-entific psychology.

Given that EDA is not simply a set of techniquesbut an attitude toward the data (Tukey, 1977), areresearchers conducting EDA when they compute ex-ploratory factor analysis or other exploratory statis-tics? The answer depends on how the analysis is con-ducted. A researcher may conduct an exploratoryfactor analysis without examining the data for pos-sible rogue values, outliers, or anomalies; fail to plotthe multivariate data to ensure the data avoid patho-logical patterns; and leave all decision making up tothe default computer settings. Such activity would notbe considered EDA because the researcher may beeasily misled by many aspects of the data or the com-puter package. Any description that would come fromthe factor analysis itself would rest on too many un-assessed assumptions to leave the exploratory dataanalyst comfortable. Henderson and Velleman (1981)demonstrated how an interactive (EDA based) ap-proach to stepwise regression can lead to markedlydifferent results than would be obtained by automatedvariable selection. This occurs because the researcherplots the data and residuals at each stage and therebyconsiders numerous patterns in the data while thecomputer program is blind to all aspects of the dataexcept the R2.

Although holding exploratory goals alone does notnecessarily imply EDA, use of exploratory proceduressuch as the plotting of simple summaries or the tabu-lation of simple descriptive statistics does not neces-sarily imply EDA either. In many cases, simple de-scriptive statistics or plots may hide importantpatterns as much as they reveal others.

Summary

EDA emphasizes that at different stages of researchthere are different types of questions, different levelsof hypothesis specificity used, and different levels ofconclusion specificity that are warranted. EDA doesnot call for the abandonment of CDA but rather forthe broadening of data analysis to incorporate a widerange of attitudes and techniques appropriate to thedifferent stages and questions in scientific work. Atthe same time, EDA is seen as indispensable in anyinvestigation: "Exploratory data analysis can neverbe the whole story, but nothing else can serve asthe foundation stone—as the first step" (Tukey, 1977,p. 3).

Beliefs, Heuristics, and Trademarks

Although Tukey often argues that EDA is an atti-tude rather than a set of tools, a number of heuristicshave been devised for EDA. To find patterns, revealstructure, and make tentative model assessments,EDA emphasizes the use of graphics and the processof iterative model fit and residual analysis. To avoidbeing fooled by unwarranted assumptions about thedata, EDA is a much more data-driven approach todata analysis than CDA. Because a complete catalog-ing of techniques is beyond the scope of this article,this section discusses major themes of EDA andpresents examples.

It cannot be overemphasized that an appropriatetechnique for EDA is determined not by computationbut rather by a procedure's purpose and use. Whetherresiduals are obtained from a computer program in-tended for CDA or EDA is not important. What isimportant is to obtain a rich description of the dataand to understand the relationship between the modeland patterns of residuals. The techniques describednext have been helpful in EDA, but techniques aresecondary to the goal of building rich mental modelsof the data. The reader may note that the proceduresdescribed are highly related, not simply a laundry list.

Page 5: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 135

Each aspect of EDA is used in concert with otheraspects so that a single isolated procedure is seldomused. Recommendations presented here are not nec-essarily unique to EDA. What is unique is the con-figuration of beliefs and procedures.

Understand the Context

To some, the analogy of the data analyst as detec-tive connotes someone entering an unknown arenaand cleverly finding patterns that may or may notreflect "true" effects. This connotation has led someto characterize EDA as naive empiricism gone amok(MacDonald, 1983). This description is inappropriatefor both a detective and someone conducting dataanalysis in an exploratory mode.

EDA shares a view of the interaction of priorknowledge and data analysis similar to the postposi-tivist position put forward by Donald Campbell.Campbell (1988) argued that quantitative knowing isdependent on qualitative knowing. This view holdsthat, in quantitative data analysis, numbers map ontoaspects of reality. Numbers themselves are meaning-less unless the data analyst understands the mappingprocess and the nexus of theory and categorization inwhich objects under study are conceptualized. Thisapproach closely matches Tukey's concern about ster-ilized, decontextualized approaches to data analysis.In his satirical list of "badmandments," Tukey(1986a) chided, "Never tell your statistical consultantabout the two most important recent papers in thefield of your own RESEARCH" (p. 205, emphasis inthe original). Similarly, Tukey (1979) argued that sub-stantive concerns must take precedence over statisti-cal convenience.

Data analysts working in either a confirmatory orexploratory mode are sometimes considered techni-cians who provide recipes for "number crunching."Countering this characterization, Tukey and others(Bode, Mosteller, Tukey, & Winsor, 1986) suggestedthe training of scientific generalists. They pointed outthat "statistics, as the doctrine of planning experi-ments and observations and of interpreting data, has acommon relation to all sciences" (p. 3). They con-cluded that the complexity of science requires trainingof generalists with broad depth and experience in ar-eas such as economics, psychology, and natural sci-ences. Boring (1919) expressed similar sentimentswhen he wrote that "statistical ability, divorced froma scientific intimacy with the fundamental observa-tions, leads nowhere" (p. 332).

Use Graphic Representations of Data

Graphical analysis is central to EDA. Tukey (1977)summed up the role of graphics in EDA by saying that"the greatest value of a picture is when it forces us tonotice what we never expected to see'' (p. vi). Graphi-cal summaries are almost universally sought to aug-ment algebraic summaries because graphics can por-tray numerous data values simultaneously, whilealgebraic summaries often sum over important attri-butes of the data or fail to suggest important patterns.For example, Cleveland (1985, reprinted in Behrens& Smith, 1996) provided the data in Table 1, whichrelates the average intelligence quotient for fathersand sons at each level of a number of social classes asreported by Burt (1961). Although the general posi-tive trend is evident in the table, the plot of the data inFigure 1 shows that the function underlying these datais so straight that it calls the veracity of the data intoquestion. Even this simple plot is striking because itserves Tukey's function of showing what we did notexpect. For an exploratory data analyst, graphical rep-resentation is the primary language.

Classical references in the area of statistical graph-ics include Benin (1983); Chambers, Cleveland,Kleiner, and Tukey (1983); Cleveland (1985, 1993);Cleveland and McGill (1988); Wainer and Thissen(1981, 1993); and Tufte (1983, 1990). Cook andWeisberg (1994) presented a comprehensive treat-ment of graphics for regression analysis.

Tukey (1977) developed a number of graphical de-vices used by exploratory analysts that are gainingwidespread use because of their incorporation in com-mon statistical packages. For example, Figure 2 is astem-and-leaf plot of effect sizes from studies exam-ining sex differences reported by Feingold (1994). Inthis meta-analysis, negative effect sizes indicate stud-ies in which males have average scores lower than

Table 1Intelligence Quotient (IQ> Data Provided by Burt (1961)

Social class Adult mean IQ Child mean IQ

Higher professionalLower professionalClericalSkilledSemiskilledUnskilled

139.7130.6115.9108.297.884.9

120.8114.7107.8104.698.992.6

Note. From The Elements of Graphing Data (p. 97). by W. S.Cleveland, 1985, Monterey. CA: Wadsworth. Copyright 1985 byW. S. Cleveland. Reprinted with permission.

Page 6: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

136 BEHRENS

ChildIQ

Higher Professional

Lower Professional .

Skilled

Clerical

Semi-skilled

90 105 120 135

Parent IQ

Figure 1. Plot of average intelligence quotient (IQ) forsons versus average IQ for fathers across different socialclasses. Original data are from Hurt (1961), and the originalplot is from The Elements of Graphing Data (p. 98), byW. S. Cleveland, 1985, Monterey, CA: Wadsworth. Copy-right 1985 by W. S. Cleveland. Reprinted with permission.

females, and positive effect sizes indicate studies inwhich males have higher average scores than females.Meta-analytic studies are especially well suited forEDA because the small number of effects and vari-ability in sample sizes can lead to wild imbalances inthe influence of individual observations. The proce-dures for sensitivity analysis (Greenhouse & lyengar,

-6-5-4-3-2-1-00123

4488135

0 2 2 3 5 568

0Figure 2. Stem-and-leaf plot of effect sizes for studiesexamining differences in anxiety across sexes as reported byFeingold (1994).

1994) and visual analysis (Light, Singer, & Willett,1994) of meta-analytic data described in The Hand-book of Research Synthesis (Cooper & Hedges, 1994)rely extensively, and explicitly, on EDA.

The stem-and-leaf plot shown in Figure 2 repre-sents a type of frequency table organized graphicallyto resemble a histogram while retaining informationabout the exact value of each observation. The leftside of the plot are the ' "stems" that mark intervals orbins; the right side of the plot contains "leaves,"which represent the detail of numbers occurring ineach bin. In this case the numbers on the left-handside of the plot indicate the lOths place value of effectsizes from the study, and those on the right indicatethe lOOths place that occur in each bin. In this way theplot indicates that the smallest effect sizes are -0.64and -0.54 whereas the largest is 0.30. The alignmentof the numbers allows a quick sense of frequency anddistribution shape, and close examination of indi-vidual values (the leaves) provides additional infor-mation about distributions within each bin. Stem-and-leaf plots will vary in appearance according to thenumber of bins used. In this case, as in all cases ofgraphic analysis, there is no single "plot of the data"but rather only one of many possible plots. When alarge number of data points are examined, the stem-and-leaf plot may become cumbersome.

A dot plot can be an effective tool to examine asingle distribution or compare a number of distribu-tions. Figure 3 is a dot plot of effect sizes character-izing the differences between males and femalesacross four personality traits from Feingold (1994).Each dot in the display represents the value of oneobservation. The general pattern is clear. Overall,males tend to be less anxious than females, are gen-erally more assertive, and have higher locus of controland self-esteem scores. Great variability across thesemeasures and an obvious high outlier for assertivenessare also easy to detect.

In the early stages of data analysis, it is often pref-erable to plot data directly because summaries mayhide data or distort one's visual impression of data.This maxim, however, needs to be balanced againstthe need for multiplicity of graphic summaries and theneed for parsimonious representation of numerousdata points (Yu & Behrens, 1995). When seeking ad-ditional structure in univariate distributions or when anumber of distributions need to be compared, a boxplot is often used. This is a shortened name for theoriginal box-and-whisker plot. Although box plotscome in many varieties (see Frigge, Hoaglin, & Ig-

Page 7: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 137

& Iw (?)

LocusAnxiety Assertiveness of Self Esteem

Control

Variable Measured

Figure 3. Dot plot of sex difference effect sizes for different affective and cognitive vari-ables reported by Feingold (1994).

lewicz, 1989), a common form is shown in Figure 4,which portrays the data from Figure 3. The box plotoffers a five-number summary in schematic form. Theends of a box mark the first and third quartiles, andthe median is indicated with a line positioned withinthe box.1 The ranges of most or all of the data in thetails of the distribution are marked using lines extend-ing away from the box, creating "whiskers" or"tails." Rules governing the construction of the whis-kers vary. One method suggested by Tukey (1977)was to extend the whisker to the most extreme value,not exceeding a distance of 1.5 times the interquartilespread (interquartile spread is the scale value of the75th percentile minus the value at the 25th percentile).In this scheme the tails will cover the middle 99.3% ofa Gaussian distribution. Data values occurring pastthis point are typically displayed individually, asshown in Figure 4.

Comparing the boxplot to the dotplot, one cansee that the box plot offers information about the lo-cation of key elements in the distribution (includingoutliers) and omits more subtle details. The sum-marizing function of this plot is especially usefulwhen a number of distributions are being compared.Other forms of the boxplot have been developed toindicate the confidence interval of the median byshading the center of the box or indenting the boxalong the length of the interval (cf. McGill, Tukey,& Larsen, 1978). Other modifications superimposedots over the boxes (Berk, 1994) or alter the appear-ance of the box (Stock & Behrens, 1991). Emerson

and Strenio (1983) presented a complete treatment ofbasic boxplot design.

Kernel density smoothers are graphic devices thatprovide estimates of a population shape, as seen inFigure 5. This smooth shape is arrived at by taking therelative frequency of data at each x value and aver-aging it with that of the surrounding data (Scott,1992). Figure 5 is a kernel density smooth of theFeingold anxiety data depicted in Figures 3 and 4.What is clear from this graphic is the near bimodalityof the data that is hidden in the boxplot and not ob-vious in the dotplot. By varying the type of averagingacross the data at each point, the size of the windowof averaging, or the weighting function used aroundeach point, the appearance of the plot can be varied tomake the plot appear more jagged or more smooth.Overlaying density functions from each distributionallows direct comparison of their shapes, as shown inFigure 6. From this we can see that the underlyingdistributions are quite similar, with the exception ofthe second mode of the anxiety distribution. Furtheranalysis of these data is warranted to ascertain wheth-er there are unique study characteristics associatedwith this group. This example underscores the impor-tance of multiple depictions of data and the impor-

1 More precisely, the key elements of the box plot arebased on statistics that Tukey called "hinges." These arerobust measures that generally match the quartiles, althoughslight differences may occur in some cases. See Frigge et al.(1989) for a discussion of these issues.

Page 8: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

138 BEHRENS

: *£ «,

Anxiety AssertivenessLocus

ofControl

Variable Measured

Figure 4. Box plot of effect sizes displayed in Figure 3. Data in the tails of the distributionare marked using lines extending away from the box. The O indicates outliers that deservespecial attention.

tanee of rejecting the notion of "the plot" of a set ofdata. Scott (1992) provided a full treatment of uni-variate and multivariate density-smoothing functions.

A major component of the detective work of EDAis the rough assessment of hunches, a quick look atthe question "could it be that.. ." or "what if it is thecase that " As Tukey and Wilk (1986) argued,citing Chamberlain (1965), "science is the holding ofmultiple working hypotheses." To hold and assessmultiple working hypotheses, data analysts dependheavily on interactive computer graphics. Interactivegraphics can be acted on directly by touching themwith the cursor or other pointing device. Another im-portant innovation, linked plots, are organized so thata change made to the color or shape of a point repre-senting an observation in one plot automatically changesthe appearance of the observation in all other plots.

In the case of the Feingold data, interactive graph-ics allow the selection of the observations in thesecond mode of the anxiety effect size data by draw-ing a rectangle around the observations of interest.When this is done, the observations are highlightedin all the linked windows. To determine whetherthere is a common effect in these data based on thecountry in which the study occurred, a plot of thedata organized by country is opened. Figure 7 is anillustration of how linking the two plots allows quickdetermination of covariation across variables. Thisfigure is a picture of a computer screen obtainedusing Data Desk (Data Description, Inc., 1995), al-though linking is common in most EDA-oriented

programs. The small window on top of the dot plotis a palette that allows rapid change of observationshape by selecting observations and pointing to thedesired shape. The highlighted portions of the barchart reflect the highlighted (second mode) por-tions of the dot plot. Of the nine highlighted obser-vations, three are from the United States and twoeach are from Israel, Canada, and Sweden. All ofthe positive effect sizes are from studies conductedin the United States, suggesting the possibility ofcountry-related effects. These data are limited in sizebut are consistent with the tentative hypothesis thatsex differences in reported anxiety vary as a func-tion of country of origin. Such an idea provides di-rection for conducting evaluations of other data sets andfor conducting cross-cultural studies in the future.

Because it is impossible to anticipate all relevantaspects of data in either experimental or nonexperi-mental work, it is difficult to overstate the value ofgraphics. The multiplicity of data patterns that canmatch a single mean led early psychologists to con-sistently report means with histograms. Changes fromthis convention were heatedly discussed. By 1935, aneditorial in Comparative Psychology (Dunlap, 1935)asked, "... Should we not exclude reports in whichthe group averages of performance are presentedwithout interpretive distributions?" (p. 3). Even thestaunchest proponents of CDA argued for balance inexploratory and confirmatory methods. Fisher's Sta-tistical Methods for Research Workers (1925) includedan entire chapter on "diagrams" that begins noting:

Page 9: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 139

Effect Size for Male Minus Female Scores on Anxiety

Figure 5. Density estimation plot of the anxiety effect size data indicating bimodality in theanxiety measures.

The preliminary examination of most data is facilitatedby the use of diagrams. Diagrams prove nothing, butbring outstanding features readily to the eye; they aretherefore no substitute for such critical tests as may beapplied to the data, but are valuable in suggesting suchtests, and in explaining the conclusions founded uponthem. (p. 24 of the llth edition)

Develop Models in an Iterative Process ofTentative Model Specification andResidual Assessment

When working in the exploratory mode, the dataanalyst takes the goal of developing a plausible de-

scription of the data using the framework: data = fit+ residual. Following a graphical analogy it is some-times said that data = smooth + rough.

These formulas reflect the fact that the aim ofdata analysis is to fit or summarize the data and thatall description fails to some degree as reflected inthe residuals. Even the use of the mean or medianis a fit in this view. The boxplot is valuable becauseit indicates both the fit and residual of a single setof data. The fit is the median or mean, single valuesthat describe the data well. Residuals are devia-tions from that point. The singling out of outliers ispart of the important process of identifying observa-

Esteem

Anxiety First Mode

Effect Size for Male Minus Female Scores

Figure 6. Density estimation plot for the distributions depicted in Figure 4.

Page 10: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

140 BEHRENS

tions that are very poorly described by summary sta-tistics.

To create quantitative descriptions of data, the ex-ploratory data analyst conducts an iterative process ofsuggesting a tentative model, examining residualsfrom the model to assess model adequacy, and modi-fying the model in view of the residual analysis. Thisoccurs in a cyclical process that should lead the ana-lyst, by successive approximation, toward a good de-scription. This process was inherent in the examina-tion of the meta-analysis data discussed previously. Asingle-mode model of the anxiety data was assumedas a starting point, but graphical analysis of residuals(data around the second mode) suggested that such afit would hide important structure.

Although most psychologists are familiar withresidual analysis from the regression literature,workers in EDA extend this framework to conceptu-alize all model development (Goodall, 1983a).Consider, for example, the data presented in Table 2from Lauver and Jones (1991). These authors ex-tended previous work in career—self-efficacy theoryfollowing Lent and Hackett (1987), who notedthe need to collect occupational preference datafrom ethnically diverse groups. Lauver and Jonesconducted a typical analysis consisting of a seriesof chi-square tests of homogeneity across ethnici-ties at each level of occupation. This approach ig-nores the effect of career differences, inflates Type Ierror, and does not address the possibility of inter-actions in addition to main effect. After findingoccupations with "significant" differences, Lauverand Jones noted the ethnicity with the highest

Table 2Percentage of Respondents Perceiving Each Career asan Option by Ethnicity as Reported by Lauver andJones (1991)

Ethnicity

NativeOccupation American %

X-ray technicianMedical technicianPhysical therapistSocial workerBookkeeperSecretaryFashion shop managerReceptionistLibrarianElectricianElectronics technicianVeterinarianProbation officerArmed forcesAccountantLawyerAuto salespersonPhotographer

283537644752473536495434587647633466

Hispanic%

234040563856564723475138407151683365

White%

193441543748454118424355296249622667

percentage of students considering that career anoption. This overlooks the pattern of differencesacross the ethnicities and fails to build a descrip-tive model of the structure of the data. Becausethis type of data was being published for the firsttime, a rich description of the structure of thedata from an exploratory view would have been

50

40

30

20

10

ny__A C H HK 1 1 isl N S T

country

1u

r

o.a

0.4

d 0.0

-0.4

• -1 s

1 " 1 '1

1 » 1

i i ",

I!

1.1 .

Anxiety Assert veness Locus/Control Self Esteem

Type [~~

Figure 7. Image of computer screen during interactive graphic session with brushingand linking.

Page 11: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 141

preferred. After these and other data have beenpublished and specific hypotheses generated, moreconfirmatory approaches may be appropriate.

Building a Two-Way Fit

To build a tentative model of the two dimensions ofthe table, one may apply the fit-plus-residual frame-work iteratively in both dimensions to form a two-way fit. In this approach, each value in the table ismodeled as the sum of ethnicity, occupation, andoverall effects. First, a tentative fit or description ofthe level of options seen in each ethnicity is found bycalculating the median percentage of options in eachcolumn. This provides fits of 47, 47, and 42.5 for theNative American, Hispanic, and White groups, re-spectively. After this first step, unexpected patternsbegin to appear: On average, fewer White studentsrate occupations as options than their Native Ameri-can or Hispanic counterparts. To complete the initialpass at decomposing data into fit and residual, residu-als are computed by subtracting each data value fromthe median value for the corresponding ethnicity. Thebottom of Table 3 displays fits for each ethnicity with

Table 3Residual Percentage of Individuals Viewing Each Careeras an Option After a First Pass at Removing theEthnicity Fit

Ethnicity

OccupationNative

American Hispanic White

X-ray technicianMedical technicianPhysical therapistSocial workerBookkeeperSecretaryFashion shop managerReceptionistLibrarianElectricianElectronics technicianVeterinarianProbation officerArmed forcesAccountantLawyerAuto salespersonPhotographer

Ethnicity fits

-19-12-10

17050

-12-11

27

-1311290

16-13

19

47

-24

-7-79

-9990

-2404

-9-7244

21-14

18

47

-23.5-8.5-1.511.5-5.5

5.52.5

-1.5-24.5-0.5

0.512.5

-13.519.56.5

19.5-16.5

24.5

42.5

residuals in the center. Each fit plus residual equalsthe original cell data.

Having subtracted out the ethnicity effects from therows, patterns of occupation-related effects begin toemerge in the residuals. An occupation's residualsreflect how different it is from the median occupationwithin each ethnicity. Negative numbers indicate low-option occupations and positive residuals indicatehigh-option occupations. X-ray technician, librarian,and auto salesperson stand out as low-option profes-sions. The armed forces leads as a high-option pro-fession.

To model the values associated with these occu-pation effects, the two-way fit continues by calcu-lating the median residual for each occupation. Thisprovides a summary of the occupation effects, simi-lar in form to the initial summary of the ethnicityeffects. Next, cell values from Table 3 are subtractedfrom the occupation medians, allowing each cell ofthe original table to be recreated by adding the occu-pation and ethnicity fits to the residual. This processis then extended to find an overall fit by fitting theethnicity and occupation fits. After these fits havebeen obtained, the process is repeated by iterativelyrefitting the residuals in each direction of the tableuntil additional patterns cannot be extracted (Tukey,1977).

The final results of such an analysis are presentedin Table 4, with occupations reordered by the size oftheir fits. Each datum in the original table can berecreated by adding the overall, ethnicity, and occu-pation fits to the residual. When the overall, occupa-tion, and ethnicity fits are perfectly additive, residualsequal zero. Residuals indicate interaction effects overand above the main effects modeled in the ethnicityand occupation fits. For example, 29% of the Whiterespondents consider work as a probation officer anoption. This is nine percentage points less than thepredicted value of 38 obtained by adding the overallfit (45) plus the White fit (-2) plus the probationofficer fit (-5). In contrast, the Native Americangroup considers this occupation as an option 17%more often than one would expect given the overall fit(45), the Native American fit (+1), and the probationofficer fit (-5).

The use of residuals in EDA differs from that inCDA in several ways. First, although the logic ofreducing residuals by use of an improved model isinherent in ANOVA and regression, it is seldom dis-cussed explicitly outside of model comparison ap-proaches to these techniques (e.g., Maxwell &

Page 12: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

142 BEHRENS

Table 4Additive Occupation and Ethnicity Fits With Residuals and Overall Fit FromMedian Smoothing of Table 2

Residuals by ethnicity

Occupation

Armed forcesPhotographerLawyerSocial workerSecretaryAccountantElectronics technicianElectricianFashion shop managerReceptionistProbation officerPhysical therapistBookkeeperVeterinarianMedical technicianAuto salespersonLibrarianX-ray technician

Ethnicity fits

NativeAmerican

40

-270

-521

-1-917-4

7-5-2

0124

1

Hispanic

004050009490

-104000

0

White

-7400

-10

-6-300

-930

190

-5-3-2

-2

Occupationfits

2620191166622

-2-5-5-6-7-9

-12-22-22

45(Overall fit)

Note. Rows are reordered by size of fit.

Delaney, 1990). Second, CDA generally assesses thesize of residuals in global summary statistics such asthe mean squared error (MSE). Because the MSE isbased on the sum of squared residuals, the size ofindividual residuals is aggregated and the pattern ofresiduals obscured. After data are well understood andCDA is asking the constrained question concerningthe relative size of residuals compared with modeleffects, F statistics and related techniques may be ap-propriate. However, when the underlying form of thedata is not well understood, an exploratory data ana-lyst is more likely to ask "Where are the good andbad fits and why?" rather than the more specific ques-tions addressed in CDA.

This analysis represents a valuable start for under-standing how perceptions of occupations vary acrossethnicity. A bivariate structure of the table is sug-gested that offers detail about the size of effects wellbeyond noting the ethnicity with the highest options ineach of the six significant chi-square tests reported byLauver and Jones (1991). In sharp contrast to mostapplications of CDA, detailed analysis of residualswas used both to assess the model and to understand

the data by examining their departure from the model.In EDA these residuals represent important deviationsfrom expectations that inform us about the structure ofthe data rather than simply "error" that should beminimized.

In this example, the table consisted of percentages,yet the two-way fit is general enough to apply to othertypes of values, including frequencies and means.Tukey (1986c) noted that such decompositions oftables based on multiplicative or, as in this case,additive models were long considered a standard toolin data analysis. He noted the additive model un-derlies ANOVA for crossed and nested factors,whereas the multiplicative model underlies the chi-square test of independence in contingency tables.This accounts for the fact that, when using meansmoothing on cell means, the two-way fit providesthe same results as the procedures recommended byRosenthal and Rosnow (1991) for interpreting inter-action effects in ANOVA. From the perspective ofEDA, Rosenthal and Rosnow are recommending theuse of F ratios for hypothesis testing and two-wayfits with residual analysis for parallel EDA to help

Page 13: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 143

build a rich description. Most programs for computinglog-linear models will give similar results of pa-rameter estimates and cell residuals following a mul-tiplicative model. Hoaglin et al. (1991) discussed thetwo-way fit in detail using mean smoothing for anumber of ANOVA designs.

An elegant graphic representation of two-way fitsand residuals is available, although its presentation islengthy and beyond the scope of this article. Inter-ested readers may consult Tukey (1977) for its origi-nal treatment or Becker, Chambers, and Wilks (1988)or Statistical Sciences, Inc. (1993) for some computerimplementations. Behrens and Smith (1996) shouldbe consulted for an example using data from instruc-tional psychology.

Use Robust and Resistant Methods

In the analysis of the two-way table, fits were basedon medians rather than means. In EDA, robustestimators such as the median are generally pre-ferred. Hoaglin, Mosteller, and Tukey (1983a) de-fined robustness as a concern for the degree to whichstatistics are insensitive to underlying assumptions.Mallows (1979) discussed three aspects of robust-ness: resistance, smoothness, and breadth. Resis-tance concerns being insensitive to minor pertur-bations in the data and weaknesses in the modelused. Smoothness concerns the degree to which tech-niques are affected by gradual introduction of baddata. Breadth is the degree to which a statistic isapplicable in a wide range of situations. Robustnessis important in EDA because the underlying form ofthe data cannot always be presumed, and statisticsthat can be easily fooled (like the mean) may mis-lead.

Several approaches are available to assess the re-sistance of a statistic (cf. Goodall, 1983b), includ-ing the breakdown point (Hampel, 1971). Hampel(1974) defined the breakdown point as "the smallestpercentage of free contamination which can carry thevalue of the estimator over all bounds" (p. 388). Dis-cussing the resistance of regression lines, Emersonand Hoaglin (1983) explained a breakdown point asfollows:

Operationally, we can think of dispatching data points"to infinity" in haphazard or even troublesome direc-tions until the calculated slope and intercept can tolerateit no longer and break down by going off to infinity aswell. We ask how large a fraction of the data—no matterhow they are chosen—can be so drastically changedwithout greatly changing the fitted line. (p. 159).

Other statistics can be assessed in a similar manner.For example, the percentage of data points that can bearbitrarily changed in a set of data without changingthe mean is 0. In contrast, half the data of a distribu-tion can be altered to infinity before the medianchanges, thereby giving the median a breakdownpoint of 0.5.

Additional resistant measures include the trimean,which is a measure of central tendency based on thearithmetic average of the value of the first quartile, thethird quartile, and the median counted twice. The me-dian absolute distance from the median is a measureof dispersion that follows its name exactly. Winsoriz-ing (pulling tail values of a distribution in to match apreset extreme score) or trimming (dropping valuespast a preset extreme score) may also be used. Someresearchers object to the differential weighting af-forded data in these cases. This differential weightingis, however, no different from procedures commonlyused by instructors who drop a student's lowest scoreor Olympic judging that is based on a mean scorefollowing the elimination of the highest and lowestscores. As in psychological work, these strategiesseem justified if the results downplay errant valueswhile offering an otherwise expected summary. Inone of his most influential papers, Fisher (1922) ar-gued that "assuredly an observer need be exposed tono criticism, if after recording data which are notprobably normal in distribution, he prefers to adoptsome value other than the arithmetic mean" (p. 323).Lind and Zumbo (1993) presented an overview ofrobustness issues in psychological research as didWainer (1977a).

Although problematic, data requiring resistant sum-maries are not uncommon in psychological work. Forexample, Paap and Johansen (1994) reported the re-sults of reaction time (RT) experiments aimed atevaluating their memory model of word verification.Among the data reported is the frequency with whicheach word used in the experimental task occurs in astandard corpus. The distribution of this variable isdepicted in Figure 8. In addition to the extreme skew,the distribution is marked by an extreme outlier rep-resenting the word "that" with word frequency of10,595 in the reference corpus. The second most fre-quent word used from this corpus is "than" with afrequency of 1,789. The mean word frequency is 267,and the median frequency is 47. The failure of themean and median to give a common indication un-derscores the value of resistant measures. The graphicdisplay and these numbers suggest that the mean can

Page 14: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

144 BEHRENS

uICT<U

in

I

CD_D

3 Lseee iee«e ise«e

Word Frequency in Lexicon

Figure 8. A dot plot of the distribution of word frequen-cies used by Paap and Johansen (1994).

easily mislead the researcher from the bulk of the dataand that the median is a good fit for most of the datapoints.

Pay Attention to Outliers

Although resistant measures guard against misin-formation from small perturbations, sometimes per-turbations are so great that inclusion of the bulk ofdata along with well-documented oddities leads tomeaningless summary statistics. In EDA, extreme orotherwise unusual data are noted as outliers so theymay be treated differently or call increased attentionto a phenomenon. The problem of outliers has a longhistory. Hampel, Ronchetti, Rousseeuw, and Stahel(1986) noted that discussion of the omission of out-liers goes back as far as Bernoulli (1777/1961) andBessel and Baeyer (1838). Hampel et al. providedadditional references and notes, including Bernoulli'sremark that rejection of outliers was commonplaceamong astronomers of his time. The discussion con-cerning the separation of extreme values has notended (cf. Barnett & Lewis, 1994; Hawkins, 1980;Hoaglin & Iglewicz, 1987).

When working in an exploratory mode, comparisonof patterns in data that include all or only a subset ofdata is considered acceptable if the actions taken andthe rationale are documented. This intrusion of sub-jectivity is deemed important because failure to seekoutliers supposes all data are of equal importance andsimilar to the underlying process being observed.Hawkins (1980) defined an outlier as "an observationwhich deviates so much from other observations as toarouse suspicions that it was generated by a differentmechanism" (p. 1), whereas Barnett and Lewis(1994) defined an outlier as "an observation (or sub-set of observations) which appears to be inconsistentwith the remainder of that set of data" (p. 7). As thedetective analogy suggests, the outlying data are tell-ing a different story from the rest of the data, and totry to summarize all of the data with a single model orstatistic leads to a case of combining apples and or-anges.

Temporarily setting aside an observation allows adiagnostic assessment of the role of the value in thesummary statistics. For example, the effect of theword "that" in the Paap and Johansen experimentscan be assessed by computing the mean both with andwithout the word included. When the word is re-moved, the mean of the data drops to 183 from theoriginal 267. This change is considerable because theobservation comprises only 1/128 or 0.8% of the data.This temporary diagnostic setting aside may lead thedata analyst to set the observation aside for the re-mainder of the analysis or continue with it in the dataset. A common extension of "setting one aside" isthe generalized jackknife procedures (Efron, 1982;Mosteller & Tukey, 1977). When conducting a jack-knife procedure, the data analyst repeatedly removessubsets of the data and recomputes a statistic of in-terest with the eye for deviations in the statistic acrosssubsamples. Homogeneity of the statistics reflects ho-mogeneity of information in the data, whereas vari-ability in the statistics reflects variability in the data,as seen previously. Although once considered only asEDA techniques, such procedures have become main-stream methods in areas including regression diagnos-tics that use a "leave one out" approach in measuressuch as Cook's distance and diffitts (cf. Atkinson,1985; Cook & Weisberg, 1994).

An idea closely related to outliers is that of fringe-liers. Fringeliers are unusual points that are not asclearly deviant as outliers but may appear with un-usual frequency in unexpected ways (Wainer, 1977a).A group of observations clumped three standard de-

Page 15: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 145

viations from the mean would be one example. Aswith outliers, relating the structure of fringeliers to thephenomenon being studied is the best possible out-come. Hadi and Simonoff (1993) discussed a numberof similar issues for outlier detection in multivariatemodels. Although the data analyst working in the ex-ploratory mode is likely to set an observation aside ifit allows sensible description of the remaining data,such a decision must consider other possible repre-sentations of the data, weigh gains and losses of in-formation, and document the status of any data thatwere set aside.

Identification of outliers serve not only to improvethe model of the remaining data but also to call at-tention to important aspects of the data that were notoriginally considered. Sometimes the outliers provideinformation about research-related processes such astypographical errors, malfunctions in recording ma-chinery, or programming errors. On other occasions,outliers may point to important aspects of the phe-nomenon being studied that were unanticipated. Al-though outliers may be set aside from the bulk of theanalysis, they are not simply dismissed. On the con-trary, they should be considered in as much detail aspossible to understand the process that generatedthem. Insofar as outliers tell us our original expecta-tions were wrong, they provide feedback for correct-ing our mental and computational models. Beveridge(1950) discussed a number of examples of scientificdiscovery motivated by the quest to understand suchanomalous results.

Reexpress the Original Scales

Reexpression is deemed an important part of thedata analyst's toolbox because, as Mosteller andTukey (1977) argued, "numbers are primarily re-corded or reported in a form that reflects habit orconvenience rather than suitability for analysis" (p.89). The term "reexpression" is preferred in EDA tothe more common usage of "transformation," be-cause it avoids the connotation of radical change ofthe underlying information.

Reexpressions of data have long been used in ex-perimental psychology. Arcsine transformations ofpercentages are typically recommended in theANOVA context (e.g., Winer, 1971), and raw scoresare often reexpressed as standard scores. Neverthe-less, transformations are suspect in many subdisci-plines of psychology and outright rejected in others.The most common concern is that reexpression leavesdata analysis as a subjective process where a trans-

formation can be chosen to "prove anything." Theseconcerns can be mollified by noting (a) reexpressionof numerical values is common in everyday life andpsychological work and (b) the goal of reexpression isto find a scale that represents the phenomenon in ameaningful way.

Reexpressing data as standard scores or log func-tions is a familiar practice in psychological research.Daily life holds experience of reexpressed scales aswell. Hoaglin (1988) noted a number of examples ofreexpression from everyday experience, including theRichter scale for earthquakes (a logarithmic scale);gasoline consumption (the reciprocal of the gas usedtimes the number of miles driven); and camera lenses(f-stops are spaced in a logarithmic scale). Many dataanalysts reexpress their data without concern. Sumsare commonly multiplied by the reciprocal of thesample size to obtain a mean. When frequency dataare not interpretable in a straightforward manner, thefrequency in a particular category times the reciprocalof the total possible frequency (the relative frequency)is commonly used. Sometimes this scale is furtheraltered by multiplying the relative frequencies by 100to obtain percentages.

In a complete treatment of reexpression, Emersonand Stoto (1983; cf. Emerson, 1991) gave several rea-sons why transformations are desirable, including thefact that they may (a) "facilitate interpretation in anatural way," (b) "promote symmetry in a batch,"(c) "promote stable spread in several batches," and(d) "promote a straight line relationship between twovariables" (p. 104).

Most of the examples just given work toward thegoal of facilitating interpretation in a natural way.Promoting symmetry is often desirable because it al-lows for comparison of scores across different parts ofa single distribution. Stable spread across batches ofdata likewise allows comparison of scores across mul-tiple distributions. Promoting a straight line relation-ship is also valuable because most regression analysesfollow a linear form. In many cases, an appropriatereexpression solves several of these goals. Wainer(1977b) showed how different conclusions from verysimilar experimental data were likely caused by thepresence of an outlier in data when organized as RT,although when it was reexpressed as speed data (I/RT) the extreme values were nonproblematic. By pro-moting symmetry, this reexpression also facilitatedinterpretation.

Reexpression is essential in EDA because it ad-dresses the often neglected issue of scaling. Research-

Page 16: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

146 BEHRENS

ers may focus on precisely aligning their theoreticalhypotheses with statistical tests and yet fail to col-lect preliminary evidence concerning the distribu-tional form their measurements take. Although arelationship between two variables may be hypoth-esized on the basis of theoretical work, the statisticalanalysis occurs on an empirical realization that is aresult of the underlying form of the constructs and theway in which the constructs are measured. Failure todelve into a detailed analysis of the form of the dis-tributions and the reexpressions that make them mostinterpretable can lead to glaring misinterpretations ofthe data.

Putting It All Together: A Reexamination of thePaap and Johansen Data

The preceding sections have examined a numberof foundational tools in the EDA toolbox. The themecommon to all the procedures described previouslyis not the use of canonical technique but a willing-ness to use any technique that helps ensure a richmental model of the data that fits closely with thetrue form of the data. This requires a high degreeof interactivity with the data and a familiarity witha wide range of techniques. To illustrate how theseprinciples and procedures interact, data publishedin the Journal of Experimental Psychology: HumanPerception and Performance, which were intro-duced previously in the discussion of resistance, arenow reexamined from an EDA perspective. In thisarticle, Paap and Johansen (1994) reported numerousanalyses of several data sets used to support theirtheory of word verification, including RT data for128 words from a standard experimental lexicon.Variables associated with each word include theaverage RT to respond to each word in an experi-mental task, the number of high-frequency neigh-bors (HFN), neighborhood size (NS), word fre-quency in the lexicon (WF), and the summed bi-gram frequency (SBF). A neighbor is a word createdby changing a single letter in an original word.HFN are words similar to the original word thatoccur often in standard use. The SBF is a measureof position-specific bigrams that occur in the word.The authors summarized their expectations towardthese data arguing "the only variable that directlydetermines word RT is the number of HFNs . . .Thus, NS, SBF and WF are all indirectly effectsthat should not account for any of the variance inword RT once the effects of HFN were partialedout" (pp. 144-145).

Paap and Johansen tested this hypothesis by us-ing OLS multiple linear regression, which is com-monly referred to as "multiple regression." Thefuller name, however, reminds us of the assumptionof linearity inherent in that procedure and the sen-sitivity of the regression line to extreme valueswhen the OLS approach is used. While keeping aneye on their original hypothesis, working in an ex-ploratory mode allows broader questions such as:How are the independent variables related to eachother and the dependent variable? What patterns un-derlie the results reported by Paap and Johansen?What can be done to improve the model? What canwe find that we did not expect? How might we befooled by the summaries?

A first look. When working with multivari-ate data such as these, a common strategy is toexamine variables individually and then in bivariateand higher order configurations. Figure 9 depictsthe shapes of distributions from this analysis usingboxplots. An analyst working on these data shouldview histograms, density plots, and dot plots aswell. Before suggesting first aid for these messydistributions using outlier handling or reexpression,it is often helpful to assess how the shapes of thesedistributions affect assessment of bivariate and higherorder relationships in the data. This can be donegraphically using a scatter plot matrix (also calleda generalized draftsman's display) described inChambers et al. (1983) and shown in Figure 10.The plot presents all pairwise combinations of thefive variables of interest. The graphic may be thoughtof as a pictorial correlation matrix with scatterplots replacing correlation coefficients. In this ver-sion of a scatter plot matrix, normal probabilityplots are presented in the matrix diagonals, withvariable labels indicating the associated row andcolumn scales. For example, on the top row of plots,the RT measure is plotted on each vertical axis whilethe *-axes vary from HFN to NS to WF and SBF asone moves from left to right in that row. The HFNvariable is plotted on the vertical axis of all plots inthe second row and on the horizontal axis of plots inthe second column. The top right-most plot indicatesRT on the vertical axis and SBF on the horizontalaxis.

Normal probability plots are a diagnostic aid usedto assess the degree to which the empirical distribu-tion matches the Gaussian distribution. This is accom-plished by calculating the fraction of data below eachdata value (i.e., the quantile) and computing the z

Page 17: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

1050

975-

900

825

750

675

600

525

450

EXPLORATORY DATA ANALYSIS

O

-r-

JL

-

IHFN

24-

21 •

IS-

15-

12-

9

6

3

12000-

10500

9000-

7500-

6000-

4500

J 3000

1500

0

NS

147

w

52500

45000

37500

30000

22500-

15000

7500

SBF

Figure 9. Box plots of word verification variables from Paap and Johansen (1994) indicat-ing severe nonnormality and outliers. The locations of data in the upper and lower quartilesare marked using lines extending away from the box. The O indicates outliers that deservespecial attention. The * indicates extreme outliers. (RT = reaction time; HFN = high-frequency neighbors; NS = neighborhood size; WF = word frequency; SBF = summedbigram frequency.)

score for points with corresponding quantiles in theGaussian distribution. When the scale values are plot-ted against the expected z score, a straight line isobtained if the distribution is Gaussian. Curves in thenormal probability plot indicate skew, whereas Sshapes indicate shorter than expected tail regions.Cleveland (1993) presented a complete discussion ofthe normal probability plot and the more generalquantile-quantile plot. The farthest left plots in Rows1 and 2 of Figure 10 indicate moderate positive skewin RT and HFN while the NS plot indicates relativenormality, and WF and SBF plots indicate markeddeviation from Gaussian shape. Individual outliers aremarked as "xs." Although these patterns were visiblein the box plots, normal probability plots are an im-portant adjunct because they are compared directlyagainst the normal distribution and display each pieceof datum rather than the five-number summary of thebox plot. The extremity of the word ' 'that'' in WF canbe seen in the bivariate plots.

One natural method for summarizing the bivariaterelationships between variables is to use the formulafor a line as a fit from which to derive residuals. If anOLS fit is used (as is the default in most computerpackages), the line is easily affected by extreme val-ues such as the outliers in WF and SBF, which rep-resent the common words "that" and "than." In in-teractive data analysis environments common to

EDA, quick assessment of such effects is straightfor-ward. In this case, regression lines were added to anumber of the scatter plots in Figure 10 to indicate theOLS predictions that would be computed with andwithout the two outlying values. This was accom-plished by selecting options from pull-down menusaccessible on the scatter plots themselves (Data De-scription, Inc., 1995). In each case, the regression linenearest the outlier indicates prediction lines with theoutliers included.

It is clear from these plots that the extreme outliersare dramatically different from the bulk of the dataand disproportionately influence the fit from the leastsquares line. Likewise, these extreme points are arti-ficially inflating or deflating the correlation thatholds in the mass of the data. For the case of WF,the outliers pull the line toward a slope of zero whencompared against the negative slope that exists whenthe outliers are set aside. In addition to the difficultywith the regression lines being disproportionally af-fected by these points, their presence compressesthe variability in the bulk of the data and may hideimportant patterns. Because these two outliers ap-pear to be qualitatively different from the bulk ofthe data, unduly influence the OLS summary, andmay distort the visual impression of the data, it isadvisable to set them aside for some portion of theanalysis.

Page 18: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

148 BEHRENS

IP- RT"ns... UfcLHJ p KT-W...UEJUP KTTI5...Ub!

IP HS-ns...LJHi IP rc.-VF..4Jfc!

Figure 10. Scatter plot matrix of word verification data from Paap and Johansen (1994).Outliers are indicated with the "at" symbol. Regression lines have been added to indicatepredicted values as they would occur with and without the outliers included. (RT = reactiontime; NS = neighborhood size; HFN = high-frequency neighbors; SBF = summed bigramfrequency, WF = word frequency.)

A Better Description. Temporarily setting asidethe two outlying data points and reconstructing thescatter plot matrix leads to the display in Figure 11. Inthis plot the relationships with SBF are clearer (al-though not very strong), and curvilinear relationshipsbetween WF and both RT and HFN are visible. Thesecurvilinear relationships are not completely unex-pected. The curved form of the data is reflected in thebunching up of the data in the lower left corner of thetwo-dimensional plot. This is likely to occur given thebunching of data in the lower part of each of theunivariate distributions.

A straightforward way to find an appropriate de-scription for the curved function is to find a reexpres-sion of the univariate distributions that leads them to

a roughly Gaussian shape. By finding the degree towhich the univariate distributions need to be reex-pressed to be Gaussian, one also finds the degree towhich the line of fit must be bent to meet the data.When reexpressing variables in EDA, one may usethe notion of a ladder of reexpression. A number ofversions of the ladder exist. In each case the rungs ofthe ladder represent an exponential value to whichscores may be raised. In the simplest case, movementup the ladder refers to raising scores to a higherpower. Moving down the ladder refers to raisingscores to decreasing negative exponents (reciprocals).Exponents along this ladder and the correspondingreexpression are listed below for the range of expo-nents from —2 to +2.

Page 19: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 149

It WN-n»Pto...ntlI It- HFNKSPM...UU

Figure 11. Scatter plot matrix of word verification variables from Paap and Johansen (1994)with two outliers removed and curvilinear relation between WF and RT as well as WF andHFN exposed. (RT = reaction time; NS = neighborhood size; HFN = high-frequencyneighbors; SBF = summed bigram frequency, WF = word frequency.)

Exponent Reexpression

-1

7-iX

-1

x

x2

logic (*)

V*(no change)

The center of the ladder is an exponent of 1, whichleaves a score unchanged. An exponent of 0 alwaysreturns a value of 1 so this is typically replaced by aIog10 transformation, which fits in well between X~'/2

and X1'2 transformations (Mosteller & Tukey, 1977).Because negative exponentiation alone reverses theordering of observations, most analysts prefer using-X~" rather than simply X~", as reflected above. Tofacilitate reexpression, the most advanced EDA pro-grams provide menu options for transformationsalong the ladder of reexpression as well as slider barsthat can be moved up and down similar ladders (suchas the Box-Cox family of reexpressions). Histograms,box plots, normal probability plots, and any othergraphics or summary tables are updated automatically

Page 20: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

150 BEHRENS

as the slider values change. Such systems allow quickassessment of a large number of reexpressions.

A choice of transformation is recommended bymoving up or down the ladder in the direction of thebulk of the data on the scale. Positively skewed dis-tributions with the bulk of the data lower on the scalecan be normalized by moving down the ladder ofreexpression; distributions with the bulk of the datahigh on the scale can be normalized by moving up theladder of reexpression. In the present case, the WFvariable has the bulk of the data in the lower portionof the distribution, so moving down the ladder is ap-propriate. Starting with WF1 (the unchanged data), wemove down to WF1'2, which is the square root of WF,followed by WF°, which is assigned the value oflogj0(WF), and -WF~I/2, which is equal to minus oneover the square root of WF. Box plots of each of thesetransformations for all data, including the outliers, arepresented in Figure 12. As the reader may see, reex-pression to a log transformation provides an approxi-mately Gaussian distribution, whereas more extremereexpression leads to distortion in the opposite direc-tion and less extreme reexpression fails to correct theshape. In practice, normal probability plots rather thanbox plots would be used to assess normality. Boxplots, however, effectively and compactly communi-cate the effect of the reexpressions.

Panel a of Figure 13 is a scatter plot of RT re-gressed on logjo(WF) with the two outliers indicatedby "Xs" and regression lines for models with all thedata as well as from outlier-deleted data only. Panel bis a plot of the regression residuals versus log10(WF).

12000

10000

BOOO-

6000

4000-

2000*

105

90

75

60 .

45

30

15 .

0

These plots indicate that the log reexpression im-proves the fit dramatically, whereas the similarity ofregression lines with and without the outliers (Panel a)indicates that setting aside the outliers does little tochange the regression line. In this dimension the out-liers may be considered natural extensions of the tailof this log-normal distribution. A very misleading fitwould occur in any model assuming RT = a + i(WF)rather than RT = a + fc(log(WF)). The residual plotsalso indicate three words whose residuals are excep-tionally large as indicated by their positions above thebulk of the data in the left, center, and right side of theresidual plots. These points indicate the values of thewords "oaf," "mere," and "came" respectively,with RTs much longer than otherwise expected giventheir word frequency. Analyses of the SBF variablelead to the conclusion that the SBF variable is roughlyGaussian with the exception of the two outliers.

To properly specify a linear model using the WFvariable, it should be reexpressed to log(WF). To ap-propriately include SBF, the two extreme pointsshould be set aside and noted for their impact on theanalysis. Including the two outliers in subsequentanalyses would serve no purpose but to demonstratethat the majority of the SBF pattern cannot be wellmodeled because of two rogue points. Setting themaside will allow appropriate modeling of the bulk ofthe data. This is a practical application of the principlethat it is better to be somewhat right than preciselywrong. All of this information suggests the corpus ofwords used in this study requires additional attention.

How did we do ? To assess the total effect of the

4.50

3.75

3.00

2.25

1.50

0.75

0.00

-0.00

]

-0.15

-0.30

-0.45 .

-0.60 •

-0.75

[ -0.90

-1.05

I

~

L

J_

VTOF -1/VW

Figure 12. Box plots of reexpressions of word frequency (WF) moving down the ladder ofreexpression. The location of data in the upper and lower quartiles are marked using linesextending away from the box. The O indicates outliers that deserve special attention. The *indicates extreme outliers. (RT = reaction time; HFN = high-frequency neighbors; NS =neighborhood size; WF = word frequency; SBF = summed bigram frequency.)

Page 21: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 151

0 1 2 3

Log(Word Frequency)

0 1 2 3

LogfWord Frequency)

(a) (b)

Figure 13. Panel a: Scatter plot of reaction time (RT) versus log (word frequency [WF]) withregression lines for all data and outlier deleted data. Panel b: Residuals from regression ofdata in Panel a.

work we have done up to this point, we replot thescatter plot matrix with the SBF outliers removed andthe reexpressed WF variable as shown in Figure 14.Comparing the quality of regression lines for predict-ing data in Figure 14 with that of Figure 10 under-scores the value of EDA in steering researchers to-ward an appropriate model. The reexpressioncorrected the curvilinearity in the WF-RT relation-ship as well as in the WF-HFN relationship. Becauseit is sometimes difficult to understand the reexpres-sion being used, readers may benefit from seeing pre-dicted values from reexpressed variables plotted in thescale of the original variables as shown in Figure 15.Panel a portrays the predicted values of RT fromlog(WF) plotted against log(WF), and Panel b por-trays the same values plotted against their correspond-ing WF values. When viewed in conjunction with theRT versus WF plot in Figure 11, Figure 15 reveals therelation between log(WF) and WF is a natural reex-pression that catches the curve in the data that is oth-erwise missed by failing to reexpress the data.

Interestingly, Paap and Johansen (1994) noted thatlog transformations have been computed in other re-search labs and have led to substantive conclusionsdifferent from their own. They therefore reanalyzedthe data described here using the log(WF) transfor-mation on the grounds of historical precedence in RTexperiments. Focusing primarily on the size of thecorrelations and the role of the logged variable in amultiple regression analysis predicting RT, the im-provement in fit observed here was interpreted quitedifferently.2

In summary, when plain WF was entered as a predictor,the number of HFNs and NS were significant predictors.However, when log WF was used, only log WF was asignificant predictor. The skittishness of the variables in

these analyses may have occurred because the collinear-ity problem between the predictors actually becameworse when the log transform was applied. The correla-tion of -.23 between plain WF and the number of HFNsballooned to -.65 for the log WF. The greater the col-linearity between two predictors, the less confident onecan be that the statistical model has identified the realwinner.... Because of the collinearity problem, somewill see the hole (effects of log WF) where others see thedoughnut (effects of NS and the number of HFNs) in ourdata. (pp. 1145-1146)

Without the log transformation, these authors foundwhat they predicted: a significant relationship be-tween RT and HFN and a nonsignificant relation be-tween RT and WF. Alternatively, the logarithmic re-expression led to a nonsignificant correlation betweenRT and HFN and a significant correlation between RTand log(WF), results inconsistent with their theory.Without understanding the shapes of the distributionsinvolved and the effect of curvilinearity and outliers,these authors were left to hypothesize "skittishness"and "ballooning" variables, collinearity, and a posi-tive-thinking bakery theory for choosing among sta-tistical models. The simple graphics used here, how-ever, explain the situation quite well. WF has acurvilinear (logarithmic) relationship with RT andHFN. This curvilinearity is a violation of an assump-tion of the linear regression model used. Therefore, nosignificant slopes can be found, as indicated in Figure

2 The analysis discussed in the passage quoted here dis-cusses a model with a term for the summed log bigramfrequency rather than the summed bigram frequency dis-cussed in this article. This difference did not affect therelationship among RT, HFN, and WF discussed here and isomitted for the sake of simplicity.

Page 22: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

lt> RT-roo Plot fCTL..LII-l l ib RT-HFNPtotlor ..Dill Ifr KTH8 Plot for I.. f]fl| Ib HTTP Plot forl.. Hpl'l l> RT'SeF Plg< for ..

Ib LO-n»PWfcrL..ljPI Ib LO-SSfPkXtor ..

Figare K Scatter plot matrix of the Paap and Johansen (1994) data with log reexpressionof WF and two outliers removed. (RT = reaction time; NS = neighborhood size; HFN =high-frequency neighbors; SBF = summed bigram frequency, WF = word frequency.)

10. The log transformation specifies the degree ofbend in the data so it can be accommodated by theregression model RT = a + i(HFN) + fc(NS) +6(SBF) + fo(log(WF)). The correct log(WF) modelspecification reveals a strong curvilinear relationshipbetween RT and WF as well as HFN and WF.

The disappearance of the HFN effect needs to beunderstood in the context of the multiple regressionmodels used. In such models, relationships betweeneach predictor variable and the criterion are adjustedfor the presence of all the other predictor variables.

When WF is included in the equation in its originalform, the suppressed measure of relationship leads tolittle correction in the HFN-RT relationship. When,however, WF is appropriately reexpressed to accountfor the curvilinearity, its relation with HFN is prop-erly expressed as high linear relationship and the re-lation between HFN and RT is adjusted downward totake into account the now large correlation betweenHFN and log(WF).

These important aspects of the analysis can be in-ferred from Figure 14, correlation matrices, and the

Page 23: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 153

0.00 0.75 1.50 225Log(Word Frequency)

720

680

640

600-

560

i

v,300 600 900

Word Frequency

(b)

Figure 75. Plot of predicted values from linear regression of reaction time (RT) on high-frequency neighbor, neighborhood size, log(word frequency), and summed bigram frequencyon (Panel a) scale of log(WF) and (Panel b) scale of WF. Note how the predicted valuesproperly model the curve of the data in the original scale.

slope estimates of the multiple regressions. Neverthe-less, the exploratory analyst may want additional in-formation because partial correlations and conditionalslopes may also be distorted by aberrant data patterns.To obtain a more detailed description of the data asrepresented in the machinery of the multiple regres-sion and to assess the validity of the computationalmodel, partial regression plots can be used. A partialregression plot takes advantage of the fact that adjust-ing one predictor, such as HFN, for its relationshipwith another predictor, such as log(WF), is equivalentto regressing HFN onto log(WF) and using the residu-als for subsequent analyses. Because the residuals arethe data after the effect of the model have been sub-tracted, the residual from HFN = a + fc(log(WF)) areWF-corrected HFN data. Any analysis with these re-siduals would be equivalent to a partial regression ofHFN. According to Velleman (1992), "A partial re-gression plot graphs y with the linear effects of theother x-variables removed against x with the lineareffects of the other variables removed"' (pp. 23-24).

Panel a of Figure 16 is the partial regression plotbetween RT and HFN when each is adjusted for NS,WF, and SBF. Note that the slope of the line indicatesthat a relationship exists between these two variablesafter their adjustment for relations with other vari-ables. Panel b is a partial regression plot between RTand HFN when both are adjusted for NS, log(WF),and SBF. The absence of relationship between the twosets of residuals in Panel b reflects the small partialcorrelation between these variables that has occurredbecause the properly specified model leads to appro-priate measures of relationship with WF and RT andHFN. The WF variable is not skittish but stronglycurvilinear in a world in which these data analystsassumed all relationships are linear.

Life without EDA. The interactive analysis de-scribed here contrasts Paap and Johansen's (1994) useof CDA alone. Working in an exploratory mode, aninitial model was sought, transformations were at-tempted, residuals were used for model evaluation,and the cycle of model searching continued. This pro-

200

100-

0

-100

rlfei-

- 4 0 4

Residual HFN

-2.5 0.0 2.5

Residual HFN

(a) W

Figure 16. Partial regression plots of reaction time (RT) and high-frequency neighbor(HFN). Each variable is adjusted for linear relations with all other predictors. Additionalexplanatory variables are neighborhood size (NS), word frequency (WF), and summed big-ram frequency (SBF) for Panel a and NS, log(WF), and SBF for Panel b.

Page 24: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

154 BEHRENS

cess quickly revealed numerous unexpected aspects ofthe data with important consequences for model de-velopment. Without the detailed description and openattitude available in EDA, Paap and Johansen wereleft with seemingly conflicting statistics from theblack box of the hypothesis tests. They began withspecific hypotheses concerning what variables wouldbe related under what conditions, but, because of alack of detailed familiarity with the data, they failed tospecify a model even close to the empirical outcome.This underscores the idea that theoretical hypothesesneed to be balanced with rich knowledge of the databeing examined. Even when firm hypotheses are helda priori, working in the exploratory mode is alwaysuseful to find out what we did not expect. The onlycaveat required is that conclusions obtained as theresult of exploratory analyses are considered explor-atory and that confirmation of such conclusions willoccur only when CDA is undertaken on different data.

The analysis reported here is a small part of anexploratory analysis of this data. Although the loga-rithmic transformation appears to lead to quite goodmodel specification, the RT variable has some depar-ture from symmetry and may benefit from a 1/RTreexpression that would put it in the scale of speed.Other types of regression diagnostics and plots couldhave been used such as three-dimensional rotatingplots and the assessment of other outliers.

Conclusion: Psychological Method and EDA

EDA is a well-established tradition in the statisticalliterature. The goal of EDA is to find patterns in thedata that allow researchers to build rich mental mod-els of the phenomenon being examined. Examiningthe Feingold (1994) and Lauver and Jones (1991)data, we found EDA useful when there is little explicittheoretical background to guide prediction and thefirst stages of model building is desired. Examiningthe Paap and Johansen data, we also saw that, evenwhen a priori hypotheses exist, EDA can perform avaluable service by providing rich descriptions of thedata that can inform the research whether their mentalmodels are even close enough to the underlying datapatterns to consider CDA. In either case, tools forEDA provide a much wider range of information thanthe answers to a specific probabilistic question.

Theory development and testing are hallmarks ofscientific psychology. Good theory development andtesting integrate a wide range of information about thedata being evaluated, so researchers can be certain

they have not missed important aspects of the phe-nomenon and have not been fooled by pathologicaldata patterns or model misspecification. EDA pro-motes good theory development and testing by help-ing researchers ensure their models are aligned withreality and they are not being misled by more re-moved summaries. As long as researchers are clearabout what activity is exploratory and what is confir-matory, and the strength of conclusions from eachmode are appropriate, EDA will facilitate, rather thanretard, theory development and testing. An increase inour knowledge about the data is always beneficial aslong as its limits are clear. From this analysis a num-ber of recommendations can be made.

First, EDA should be recognized as an importantaspect of data analysis whose conduct and publicationare valued. By admitting EDA as an acceptable set ofprocedures, researchers can avoid the improper use ofCDA techniques for the purposes of data exploration.As long as EDA remains a covert activity, researcherswill continue to improperly use CDA for data explo-ration through model underspecification and overtest-ing. An increase in EDA will focus more resources atthe preliminary stages of investigations and less at theadvanced stages. In so doing, the number of irrepro-ducible results may be reduced by the substitution ofadequate model building for the cataloging of signifi-cant effects. Further, the detail in modeling affordedby EDA may improve our understanding of phenom-enon otherwise hidden behind simple summary statis-tics and tests, as seen in the Paap and Johansen (1994)data. In this regard, editors and reviewers should fol-low the lead of Loftus (1993), whose first editorialstatement for Memory and Cognition included head-ings of "Figures are Good" and "Data Analysis: APicture is Worth a Thousand Words."

This is not to say that all exploratory work shouldbe published, but rather that all published and initialwork should be explored. The field would greatlybenefit if all published reports included the statement"we examined the data in detail and found the pat-terns underlying the summary statistics were not ob-viously pathological.'' More detailed reporting wouldalso be welcome. When auxiliary exploratory analysiscannot fit into a standard journal format, additionalgraphics and reports may be distributed over the In-ternet or by other electronic means. Behrens andDugan (1996) provides an example of such supple-mental graphic reporting.

Second, quantitative analysis should be thought of"more as applied epistemology and less as applied

Page 25: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 155

mathematics" (Behrens & Smith, 1996). When con-sidering statistics as applied mathematics rather thanapplied epistemology, many messy real-world issuesare often swept under the rug. Instruction addressingthe assumptions that must be met for a statistic to bemeaningful almost always focuses on assumptionsabout theoretical distributions rather than assumptionsabout the world. Sharing this value with EDA, Box(1976) labeled the overemphasis on theoretical issues"mathematistry" for which he prescribed practicalexperience and trust of the scientist's intuitions. Byfocusing on understanding the data in whatever way isreasonable (not only probabilistically), EDA opensthe data analyst to consider the wide range of ways ofknowing about data. This ecumenical view leaves re-searchers considering mathematics as an epistemictool rather than a complete answer in itself. Math-ematics should be used based on how helpful it is inunderstanding data, not simply on its syntactical cor-rectness. Such a position will minimize what has beenreferred to as Type III error: "precisely solving thewrong problem, when you should have been workingon the right problem" (Mitroff, Kilmann, & Barabba,1979, p. 140, cited in Barabba, 1991).

Third, graduate programs should integrate instruc-tion in confirmatory statistics with alternative dataanalytic methods. Instruction in EDA offers students aview of data analysis from outside traditional statis-tics. Such an alternate view may allow new apprecia-tion and understanding of CDA. Other complemen-tary methods include meta-analysis (Glass, 1976;Glass, McGaw, & Smith, 1981), Bayesian analysis(Howson & Urbach, 1993; Winkler, 1993), intervalestimation approaches, and hybrid combinations(Box, 1980). Just as history and systems of psychol-ogy are taught in psychology, might students not ben-efit from a history and systems of data analysis? Theappropriate size of such curricular additions will varyacross programs. At the very least, the idea of multi-conceptual approaches could be incorporated in al-ready existing classes.

Fourth, data analysts should recognize that subjec-tivity and potential bias are inherent in all data analy-sis, exploratory or otherwise. One great danger inovermathematizing data analysis is believing that thereliability and precision of mathematics itself imbuereliability and precision to the data and the data analy-sis. The artifactual nature of psychological investiga-tion has been well established by Rosenthal (1966),Rosnow (1981), Danziger (1990), and others. Under-standing of the role of cognitive, historical, and social

artifact in data analysis is also emerging. Rosenthal(Cooper & Rosenthal, 1980; Rosenthal & Gaito,1963, 1964) demonstrated a consistent overweighingof "significance" in light of varying sample sizes,and Bar-Hillel (1989; Bar-Hillel & Falk, 1982) illus-trated the subjectivity inherent in translating math-ematical concepts into natural language. Flow chartsand expert systems suggest data analysis is a purelyrational process, yet choice of data analytic behavioris ultimately dependent on the same psychologicalfactors that affect cognition and behavior in otherspheres of life. Bias in data analysis will not be mol-lified by assent to stricter design and control of TypeI error, but by the detailed analysis of data that ex-cludes alternate statistical explanations as demon-strated previously.

Fifth, psychologists should consider the possibilitythat their craft can improve the conduct of EDA anddata analysis in general. For example, Simon (1973;Simon, Langley, & Bradshaw, 1981) has long heldthat logics of discovery are possible and psychologi-cally tractable, a position supported by the construc-tion of the BACON program (cf. Langley, Simon,Bradshaw, & Zytkow, 1987). Gigerenzer (1991)noted that the heuristics encoded in BACON are quitesimilar to those of EDA and mentioned Tukey (1977)specifically. Investigations are still needed to examinethe processes involved in comprehending commonstatistical graphics (cf. Simkin & Hastie, 1987; Koss-lyn, 1989; Lewandowsky & Spence, 1989, 1990) aswell as those specific to EDA (cf. Behrens, Stock, &Sedgwick, 1990; Stock & Behrens, 1991). The statis-tical community recognizes the potential of transdis-ciplinary work and has provided open invitations tothe psychological community (Kruskal, 1982; Mo-steller, 1988; Tukey & Wilk, 1986).

Given dramatic improvements in computationalability and increased sensitivity to the psychologicaland social aspects of data analysis, the time is ripe fora broad conceptualization of data analysis that in-cludes the principles and procedures of EDA. Lestthese recommendations seem dogmatic, the finalword is left for Neyman and Pearson (1928) from"On the Use and Interpretation of Certain Test Cri-teria for Purposes of Statistical Inference." This ar-ticle represented the first great break from the Fish-erian view (introducing alternative distributions andType II error) and the beginning of current practice.Their attitude toward mechanized inference can easilybe deduced. It is, in fact, good counsel for consider-ation of any method:

Page 26: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

156 BEHRENS

The process of reasoning, however, is necessarily anindividual matter, and we do not claim that the methodwhich has been most helpful to ourselves will be ofgreatest assistance to others. It would seem to be a casewhere each individual must reason out for himself hisown philosophy, (p. 230).

References

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R.(1990). Graduate training in statistics, methodology, andmeasurement in psychology: A review of Ph.D. programsin North America. American Psychologist, 45, 721-734.

Atkinson, A. C. (1985). Plots, transformation, and regression:An introduction to graphical methods of diagnostic regres-sion analysis. Oxford, England: Oxford University Press.

Bar-Hillel, M. (1989). Discussion: How to solve probabilityteasers. Philosophy of Science, 56, 348-358.

Bar-Hillel, M., & Falk, R. (1982). Some teasers concerningconditional probabilities. Cognition, 11, 109—122.

Barabba, V. P. (1991). Through a glass less darkly. Journalof the American Statistical Association, 86, 1-8.

Barnett, V., & Lewis, T. (1994). Outliers in statistical data(2nd ed.). New York: Wiley.

Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988). TheNew S Language: A programming environment for dataanalysis and graphics. Monterey, CA: Wadsworth &Brooks/Cole.

Behrens, J. T. (1996). Course materials for EDP 691:Graphical and exploratory data analysis. Availablehttp://research.ed.asu.edu/classes/eda.

Behrens, J. T., & Dugan, J. G. (1996). A graphical tour of theWhite Racial Identity Attitude Scale data in hyper-text andVRML. Available http://research.ed.asu.edu/reports/wrias.

Behrens, J. T., & Smith, M. L. (1996). Data and data analy-sis. In D. Berliner & B. Calfee (Eds.), The handbook ofeducational psychology (pp. 945—989). New York: Mac-millan.

Behrens, J. T., Stock, W. A. & Sedgwick, C. E. (1990).Judgment errors in elementary box-plot displays. Com-munications in Statistics B: Simulation and Computation,19, 245-262.

Berk, K. N. (1994). Data analysis with student Systat. Cam-bridge, MA: Course Technology, Inc.

Bernoulli, D. (1961). The most probable choice betweenseveral discrepant observations and the formation there-from of the most likely induction. Biometrika, 48, 3-13.(Original work published 1777)

Berlin, J. (1983). Semiology of graphics (W. J. Berg,Trans.). Madison: University of Wisconsin Press.

Bessel, F. W., & Baeyer, J. J. (1838). Gradmessung inOstpreussen und ihre Verhindung mil Preussischen

und Russischen Dreiecksketten. Berlin: Druckerei derKoniglichen Akademie der Wissenshaften.

Beveridge, W. I. B. (1950). The art of scientific investiga-tion. New York: Vintage Books.

Bode, H., Mosteller, F., Tukey, J. W., & Winsor, C. (1986).The education of scientific generalist. In L. V. Jones (Ed.),The collected works of John W. Tukey, Volume HI: Philoso-phy and principles of data analysis: 1949-1964. PacificGrove, CA: Wadsworth. (Original work published 1949)

Boring, E. G. (1919). Mathematical vs. scientific signifi-cance. Psychological Bulletin, 16, 335-339.

Box, G. E. (1976). Science and statistics. Journal of theAmerican Statistical Association, 71, 791-799.

Box, G. E. (1980). Sampling and Bayes' inference in sci-entific modeling and robustness. Journal of the RoyalStatistical Society (A), 143, 383^130.

Burt, C. (1961). Intelligence and social mobility. BritishJournal of Statistical Psychology, 14, 3-23.

Campbell, D. T. (1988). Descriptive epistemology: Psycho-logical, sociological, and evolutionary. In E. S. Overman(Ed.), Methodology and epistemology for social science:Selected papers of Donald T. Campbell (pp. 435-486).Chicago: University of Chicago Press.

Chamberlain, T. C. (1965). The method of multiple workinghypotheses. Science, 148, 754-759.

Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey,P. A. (1983). Graphical methods for data analysis. Bel-mont, CA: Wadsworth.

Cleveland, W. S. (1985). The elements of graphing data.Monterey, CA: Wadsworth.

Cleveland, W. S. (Ed.). (1988). The collected works of JohnW. Tukey: Vol. V. Graphics. Belmont, CA: Wadsworth.

Cleveland, W. S. (1993). Visualizing data. Summit, NJ: Ho-bart Press.

Cleveland, W. S., & McGill, M. E. (1988). Dynamic Graph-ics for Statistics. Monterey, CA: Wadsworth.

Cohen, J. (1994). The earth is round (p < .05). AmericanPsychologist. 49, 997-1003.

Cook, R. D., & Weisberg, S. (1994). An introduction toregression graphics. New York: Wiley.

Cooper, H. M., & Hedges, L. V. (Eds.). (1994). The hand-book of research synthesis. New York: Russell SageFoundation.

Cooper, H. M., & Rosenthal, R. (1980). Statistical versustraditional procedures for summarizing research findings.Psychological Bulletin, 87, 422-449.

Danziger, K. (1990). Constructing the subject: Historicalorigins of psychological research. New York: CambridgeUniversity Press.

Data Description, Inc. (1995). Data desk, 5.0 (computersoftware). Ithaca, NY: Data Description.

Page 27: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 157

Dunlap, K. (1935). The average animal. Journal of Com-parative Psychology, 19, 1—3.

Efron, B. E. (1982). The jackknife, the bootstrap, and otherresampling methods. Philadelphia, PA: Society for Indus-trial and Applied Mathematics.

Emerson, J. D. (1991). Introduction to transformation. InD. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fun-damentals of exploratory analysis of variance (pp. 365—400). New York: Wiley.

Emerson, J. D., & Hoaglin, D. C. (1983). Resistant lines fory versus x. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey(Eds.), Understanding robust and exploratory dataanalysis (pp. 129-165). New York: Wiley.

Emerson, J. D., & Stoto, M. A. (1983). Transforming data.In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.),Understanding robust and exploratory data analysis (pp.97-128). New York: Wiley.

Emerson, J. D., & Strenio, J. (1983). Boxplots and batchcomparisons. In D. Hoaglin, F. Mosteller, & J. W. Tukey(Eds.), Understanding robust and exploratory dataanalysis (pp. 58-96). New York: Wiley.

Erickson, B. H., & Nosanchuk, T. A. (1992). Understand-ing data (2nd ed.). Toronto, Ontario, Canada: Universityof Toronto Press.

Feingold, A. (1994). Gender differences in personality: Ameta-analysis. Psychological Bulletin, 116, 429—4-56.

Finch, P. D. (1979). Description and analogy in the practiceof statistics. Biometrika, 66, 195-208.

Fisher, R. A. (1922). On the mathematical foundations oftheoretical statistics. Philosophical Transactions of theRoyal Society of London. Series A, 222, 309-368.

Fisher, R. A. (1925). Statistical methods for research work-ers. London: Oliver and Boyd.

Frigge, M., Hoaglin, D. C., & Iglewicz, B. (1989). Someimplementations of the boxplot. The American Statisti-cian, 43, 50-54.

Giere, R. N. (1984). Understanding Scientific Reasoning.(2nd ed.). New York: Holt, Rinehart & Winston.

Gigerenzer, G. (1991). From tools-to-theories: A heuristicof discovery in cognitive psychology. Psychological Re-view, 98, 254-267.

Glass, G. V. (1976). Primary, secondary, and meta-analysisof research. Educational Researcher, S, 3-8.

Glass, G. V., McGaw, B., & Smith, M.L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.

Good, I. J. (1983). The philosophy of exploratory dataanalysis. Philosophy of Science, 50, 238-295.

Goodall, C. (1983a). Examining residuals. In D. C. Hoaglin,F. Mosteller, & J. W. Tukey (Eds.), Understanding ro-bust and exploratory data analysis (pp. 211-246). NewYork: Wiley.

Goodall, C. (1983b). M-Estimators of location: An outlineof the theory. In D. C. Hoaglin, F. Mosteller, & J. W.Tukey (Eds.), Understanding robust and exploratorydata analysis (pp. 339^03). New York: Wiley.

Greenhouse, J. B., & lyengar, S. (1994). Sensitivity analysisand diagnostics. In H. Cooper & L. V. Hedges (Eds.), Thehandbook of research synthesis (pp. 383-398). NewYork: Russell Sage Foundation.

Hadi, A. S., & Simonoff, J. S. (1993). Procedures for theidentification of multiple outliers in linear models. Jour-nal of the American Statistical Association, 88, 1264-1272.

Hampel, F. R. (1971). A general qualitative definition of ro-bustness. Annals of Mathematical Statistics, 42, 1887-1896.

Hampel, F. R. (1974). The influence curve and its role inrobust estimation. Journal of the American Statistical As-sociation, 69, 383-393.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Sta-hel, W. A. (1986). Robust statistics: The approach basedon influence functions. New York: Wiley.

Hartwig, F., & Bearing, B. E. (1979). Exploratory dataanalysis. Beverly Hills, CA: Sage.

Hawkins, D. M. (1980). Identification of outliers. NewYork: Chapman & Hall.

Henderson, H. V., & Velleman, P. F. (1981). Building mul-tiple regression models interactively. Biometrics, 37,391^11.

Hoaglin, D. C. (1988). Transformations in everyday expe-rience. Chance, I, 40-45.

Hoaglin, D. C., & Iglewicz, B. (1987). Fine-tuning someresistant rules for outlier labeling. Journal of the Ameri-can Statistical Association, 82, 1147-1149.

Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (1983a). In-troduction to more refined estimators. In D. C. Hoaglin,F. Mosteller, & J. W. Tukey (Eds.), Understanding ro-bust and exploratory data analysis (pp. 283—296). NewYork: Wiley.

Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.).(1983b). Understanding robust and exploratory dataanalysis. New York: Wiley.

Hoaglin, D. C., Mosteller, F., & Tukey. J. W. (Eds.). (1985).Exploring data tables, trends, and shapes. New York:Wiley.

Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991).Fundamentals of exploratory analysis of variance. NewYork: Wiley.

Howson, C., & Urbach, P. (1993). Scientific reasoning: TheBayesian approach (2nd ed.). Peru, IL: Open Court.

Huberty, C. J. (1991). Introduction to the practice of statistics[Review]. Journal of Educational Statistics, 16, 77-81.

Jones, L. V. (Ed.). (1986a). The collected works of John W.

Page 28: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

158 BEHRENS

Tukey. Volume III: Philosophy and principles of dataanalysis: 1949-1964. Belmont, CA: Wadsworth.

Jones, L. V. (Ed.). (1986b). The collected works of John W.Tukey. Volume IV: Philosophy and principles of dataanalysis (1965-1986). Belmont, CA: Wadsworth.

Joreskog, K., & Sorbom, D. (1993). L1SREL S: Structuralequation modeling with the S1MPL1S command lan-guage. Chicago: Scientific Software International.

Kosslyn, S. M. (1989). Understanding charts and graphs.Applied Cognitive Psychology, 3, 185-225.

Kraemer, H. C., & Thiemann, S. (1987). How many sub-jects?: Statistical power analysis in research. BeverlyHills, CA: Sage.

Kruskal, W. H. (1982). Criteria for judging statistical graph-ics. Utilitas Mathematics, 21B, 283-310.

Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M.(1987). Scientific discovery. Cambridge, MA: MTT Press.

Lauver, P. J., & Jones, R. M. (1991). Factors associatedwith perceived career options in American Indian, White,and Hispanic rural high school students. Journal of Coun-seling Psychology, 38, 159-166.

Leinhardt, G., & Leinhardt, S. (1980). Exploratory dataanalysis: New tools for the analysis of empirical data. InD. Berliner (Ed.), Review of Research in Education (Vol.8, pp. 85-157).

Leinhardt, S., & Wasserman, S. S. (1979). Exploratory dataanalysis: An introduction to selected methods. In K. F.Schuessler (Ed.), Sociological methodology (pp. 311-365). San Francisco: Jossey-Bass.

Lent, R. W., & Hackett, G. (1987). Career self-efficacy:Empirical status and future directions. Journal of Voca-tional Behavior, 30, 347-382.

Lewandowsky, S., & Spence, I. (1989). Discriminatingstrata in scatterplots. Journal of the American StatisticalAssociation, 84, 682-688.

Lewandowsky, S., & Spence, I. (1990). The perception ofstatistical graphs. Sociological Methods and Research,18, 200-242.

Light, R. J., Singer, J. D., & Willett, J. B. (1994). The visualpresentation and interpretation of meta-analyses. In H.Cooper & L. V. Hedges, (Eds.), The handbook of re-search synthesis (pp. 439-454). New York: Russell SageFoundation.

Lind, J. C., & Zumbo, B. D. (1993). The continuity prin-ciple in psychological research: An introduction to robuststatistics. Canadian Psychology, 34, 407-414.

Loftus, G. R. (1993). Editorial comment. Memory and Cog-nition, 21, 1-3.

MacDonald, K. I. (1983). Exploratory data analysis: A pro-cess and a problem. In D. McKay, N. Schofield, & P.

Whiteley (Eds.), Data analysis and the social sciences(pp. 256-284). London: Pinter.

Mallows, C. L. (1979). Robust methods—Some examplesof their use. The American Statistician, 33, 179.

Maxwell, S. E., & Delaney, H. D. (1990). Designing experi-ments and analyzing data. Belmont, CA: Wadsworth.

Mayer, L. S. (1980). The use of exploratory methods ineconomic analysis: Analyzing residential energy demand.In J. Kmenta & J. B. Ramsey (Eds.), Evaluation of econ-ometric models (pp. 15—45). New York: Academic Press.

McGill, R., Tukey, J. W., & Larsen, W. A. (1978). Varia-tions of box plots. The American Statistician, 32, 12-16.

McGuire, W. J. (1989). A perspectivist approach to the stra-tegic planning of programmatic scientific research. In B.Gholson, W. R. Shadish, Jr., R. A. Neimeyer, & A. C.Houts (Eds.), Psychology of science: Contributions to meta-science. Cambridge, England: Cambridge University Press.

Mitroff, 1.1., Kilmann, R. K., & Barabba, V. P. (1979).Management information versus misinformation systems.In G. Zaltman (Ed.), Management principles for non-profit agencies and organizations (p. 104). New York:AMACOM.

Mosteller, F. (1988). Broadening the scope of statistics andstatistics education. The American Statistician, 42, 93-99.

Mosteller, F., & Tukey, J. W. (1977). Data analysis andregression: A second course in statistics. Reading, MA:Addison-Wesley.

Mulaik, S. A. (1984). Empiricism and exploratory statistics.Philosophy of Science, 52, 410-430.

Neyman, J., & Pearson, W. S. (1928). On the use and in-terpretation of certain test criteria for purposes of statis-tical inference: Part I. Biometrika, 20a, 175-240.

Paap, K. R., & Johansen, L. S. (1994). The case of the van-ishing frequency effect: A retest of the verificationmodel. Journal of Experimental Psychology: HumanPerception and Performance, 20, 1129-1157.

Rosenthal, R. (1966). Experimenter effects in behavioralresearch. New York: Appleton-Century-Crofts.

Rosenthal, R., & Gaito, J. (1963). The interpretation of lev-els of significance by psychological researchers. Journalof Psychology, 55, 33-38.

Rosenthal, R., & Gaito, J. (1964). Further evidence for thecliff effect in the interpretation of levels of significance.Psychological Reports, 15, 570.

Rosenthal, R., & Rosnow, R. L. (1991). Essentials of be-havioral research: Methods and analysis. New York:McGraw-Hill.

Rosnow, R. L. (1981). Paradigms in transition: The meth-odology of social inquiry. New York: Oxford UniversityPress.

Page 29: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

EXPLORATORY DATA ANALYSIS 159

Scott, D. W. (1992). Multivariate density estimation:

Theory, practice, and visualization. New York: Wiley.

Simkin, D., & Hastie, R. (1987). An information processing

analysis of graph perception. Journal of the American

Statistical Association, 82, 454-465.

Simon, H. A. (1973). Does scientific discovery have a

logic? Philosophy of Science, 40, 471^*80.

Simon, H. A., Langley, P. W., & Bradshaw, G. L. (1981).

Scientific discovery as problem solving. Synthese, 47,

1-27.

Smith, A. F., & Prentice, D. A. (1993). Exploratory data

analysis. In G. Keren & C. Lewis (Eds.), A handbook for

data analysis in the behavioral science: Statistical issues

(pp. 349-390). Hillsdale, NJ: Erlbaum.

Statistical Sciences, Inc. (1993). S-PLUSfor Windows Us-

er's Manual, Version 3.1. Seattle, WA: Statistical Sci-

ences, Inc.

Stock, W. A., & Behrens, J. T. (1991). Box, line, and mid-

gap plots: Effects of display characteristics on the accu-

racy and bias of estimates of whisker length. Journal of

Educational Statistics, 16, 1-20.

Tierney, L. (1990). LISP-STAT: An object-oriented environ-

ment for statistical computing and dynamic graphics.

New York: Wiley.

Tufte, E. R. (1983). The visual display of quantitative in-

formation. Cheshire, CT: Graphics Press.

Tufte, E. R. (1990). Envisioning information. Cheshire, CT:

Graphics Press.

Tukey, J. W. (1969). Analyzing data: Sanctification or de-

tective work? American Psychologist, 24, 83—91.

Tukey, J. W. (1977). Exploratory data analysis. Reading,

MA: Addison-Wesley.

Tukey, J. W. (1979). Methodology, and the statistician's

responsibility for both accuracy and relevance. Journal of

the American Statistical Association, 74, 786—793.

Tukey, J. W. (1980). We need both exploratory and confir-

matory. The American Statistician, 34, 23-25.

Tukey, J. W. (1986a). Data analysis and behavioral science

or learning to bear the quantitative man's burden by shun-

ning badmandments. In L. V. Jones (Ed.), The collected

works of John W. Tukey. Volume III: Philosophy and

principles of data analysis: 1949-1964 (pp. 187-389).

Belmont, CA: Wadsworth.

Tukey, J. W. (1986b). Data analysis, computation and math-

ematics. In L. V. Jones (Ed.), The collected works of John

W. Tukey. Volume IV: Philosophy and principles of data

analysis: 1965-1986 (pp. 753-775). Belmont, CA: Wad-

sworth. (Original work published 1972)

Tukey, J. W. (1986c). The future of processes of data analy-

sis. In L. V. Jones (Ed.), The collected works of John W.

Tukey: Vol. 4. Philosophy and principles of data analy-

sis: 1965-1986 (pp. 517-547). Monterey, CA: Wad-

sworth & Brooks/Cole.

Tukey, J. W. (1986d). Introduction to styles of data analysis

techniques. In L. V. Jones (Ed.), The collected works of

John W. Tukey. Volume IV: Philosophy and principles of

data analysis: 1965-1986 (pp. 969-983). Belmont, CA:

Wadsworth. (Original work published 1982)

Tukey, J. W. (1986e). Methodological comments focused

on opportunities. In L. V. Jones (Ed.), The collected

works of John W. Tukey. Volume IV: Philosophy and

principles of data analysis: 1965-1986 (pp. 819-867).

Belmont, CA: Wadsworth. (Original work published

1980)Tukey, J. W., & Wilk, M. B. (1986). Data analysis and sta

tistics: An expository overview. In L. V. Jones (Ed.), The

collected works of John W. Tukey. Volume IV: Philoso-

phy and principles of data analysis: 1965-1986 (pp. 549-

578). Belmont, CA: Wadsworth.

Velleman, P. F. (1992). Data desk: The new power of sta-

tistical vision. Ithaca, NY: Data Description, Inc.

Velleman, P. F., & Hoaglin, D. C. (1981). Applications, ba-

sics, and computing of exploratory data analysis. Boston:

Duxbury Press.

Velleman, P. F., & Hoaglin, D. C. (1992). Data analysis. In

D. C. Hoaglin & D. S. Moore (Eds.), MAA notes number

21: Perspectives on contemporary statistics (pp. 19—39).

Washington, DC: Mathematical Association of America.

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal,

interval and ratio typologies are misleading. The Ameri-

can Statistician, 47, 65-68.

Wainer, H. (1977a). Robust statistics: A survey and some

prescriptions. Journal of Educational Statistics, 1, 285-

312.

Wainer, H. (1977b). Speed vs. reaction time as a measure of

cognitive performance. Memory and Cognition, 5, 278—

280.

Wainer, H., & Thissen, D. (1981). Graphical data analysis.

Annual Review of Psychology, 32, 191-241.

Wainer, H., & Thissen, D. (1993). Graphical data analysis.

In G. Keren & C. Lewis (Eds.), A handbook for data

analysis in the behavioral sciences: Statistical issues (pp.

391-457). Hillsdale, NJ: Erlbaum.

Winer, B. J. (1971). Principles of experimental design. New

York: McGraw-Hill.

Winkler, R. L. (1993). Bayesian statistics: An overview. In

G. Keren & C. Lewis (Eds.), A handbook for data analy-

sis in the behavioral sciences: Methodological issues (pp.

201-232). Hillsdale, NJ: Erlbaum.

Yu, C., & Behrens, J. T. (1995). Applications of multivari-

ate visualization to behavioral sciences. Behavior Re-

search Methods, Instruments & Computers, 27, 264-271.

(Appendix follows)

Page 30: Principles and Procedures of Exploratory Data Analysis€¦ · Exploratory data analysis (EDA) is a well-established statistical tradition that pro - vides conceptual and computational

160 BEHRENS

Appendix

Software for EDA

As computer graphic capabilities widen in commonlyused machines, software for graphic and exploratory analy-ses gain in popularity. Nevertheless, a few programs have adecisively strong EDA emphasis. In this article, all thegraphics with the exception of kernel density estimates wereproduced in Data Desk on an Apple Power Macintosh com-puter. As illustrated previously, Data Desk is an exceptionaltool for EDA, having been designed from its inception to bean EDA technology. The documentation that accompaniesthe software may be the single best source of technical andpractical information concerning EDA. Data Desk offers acompletely graphical interface for EDA and requires noprogramming. Another heavily EDA oriented program isS-plus, the only software package I know of that supportsgraphics for the two-way fit. S-plus is a completely exten-sible object-oriented programming language available forUNIX and MS-Windows environments; however, it has lessgraphical interactivity than Data Desk. S-plus is commonlyused in the statistical graphics research community, andthere is a large archive of user-created S-plus functionsavailable at the Carnegie Mellon Statlib at http://lib.stat.cmu.edu.

The kernel density estimate plots shown previously wereproduced in XLISP-STAT (Tiemey, 1990), a LISP-basedsystem of statistical functions and graphics that is com-pletely extensible and highly interactive. XLISP-STAT isalso gaining a wide following in the statistical graphicscommunity. XLISP-STAT does not require LISP program-

ming skills for most tasks and is available for UNIX, Mac-intosh, and Windows environments for free. Copies of theprogram can be obtained at http://stat.umn.edu.

Other programs incorporate EDA procedures as well. Sy-stat has a wide variety of graphics and some interactivity asdoes SAS-JMP, although SAS-JMP reflects some moreCDA philosophies than may be convenient for strong EDAwork such as the strong association of levels of measure-ment with types of analyses (see Velleman & Wilkinson,1993, for a discussion of this concept). Almost all softwarepackages are now emphasizing the strength and beauty ofgraphical analysis with access to box plots, stem-and-leafplots, and so on. Consumers should beware that EDA func-tions best in highly interactive environments that supportquick question assessment. This involves complex interfaceissues that cannot be solved by the simple inclusion of anew plot in the list of options.

Readers interested in learning more about current issuesin data visualization and advances in statistical computingshould consult the Journal of Computational and GraphicalStatistics, which was inaugurated in 1992. When consider-ing specific products, readers may want to consult TheAmerican Statistician, which periodically contains softwarereviews by statistical computing experts.

Received June 7, 1996Revision received August 26, 1996

Accepted October 12, 1996 •