a new approach to fitting linear models in high dimensional spaces · 2011. 3. 27. · this thesis...

A New Approachto Fitting Linear

Modelsin High DimensionalSpaces

YongWang

This thesisis submittedin partialfulfilment of therequirementsfor thedegreeofDoctorof Philosophyin ComputerScienceat TheUniversityof Waikato

November2000

c�

2000 YongWang

Abstract

This thesispresentsa new approachto fitting linear models,called“paceregres-

sion”, which alsoovercomesthe dimensionalitydeterminationproblem. Its opti-

mality in minimizing theexpectedpredictionlossis theoreticallyestablished,when

the numberof free parametersis infinitely large. In this sense,paceregression

outperformsexisting proceduresfor fitting linearmodels.Dimensionalitydetermi-

nation,aspecialcaseof fitting linearmodels,turnsoutto beanaturalby-product.A

rangeof simulationstudiesareconducted;theresultssupportthetheoreticalanaly-

sis.

Throughthethesis,adeeperunderstandingis gainedof theproblemof fitting linear

models. Many key issuesarediscussed.Existing procedures,namelyOLS, AIC,

BIC, RIC, CIC, CV(d), BS(m),RIDGE,NN-GAROTTE andLASSO,arereviewed

andcompared,boththeoreticallyandempirically, with thenew methods.

Estimatinga mixing distribution is an indispensablepart of paceregression. A

measure-basedminimum distanceapproach,including probability measuresand

nonnegativemeasures,is proposed,andstronglyconsistentestimatorsareproduced.

Of all minimum distancemethodsfor estimatinga mixing distribution, only the

nonnegative-measure-basedonesolvesthe minority clusterproblem,what is vital

for paceregression.

Paceregressionhasstriking advantagesover existing techniquesfor fitting linear

models. It alsohasmoregeneralimplicationsfor empiricalmodeling,which are

discussedin thethesis.

iii

Acknowledgements

First of all, I would like to expressmy sincerestgratitudeto my supervisor, Ian

Witten, who hasprovideda varietyof valuableinspiration,assistance,andsugges-

tions throughoutthe period of this project. His detailedconstructive comments,

bothtechnicallyandlinguistically, on this thesis(andlots of otherthings)arevery

muchappreciated.I have benefitedenormouslyfrom his eruditionandenthusiasm

in variouswaysin my research.Hecarefullyconsideredmy financialsituations,of-

tenbeforeI havedonesomyself,and,offeredresearchandteachingassistantwork

to helpmeout; thusI wasableto studywithout any financialworry.

I would alsolike to thankmy co-supervisor, Geoff Holmes,for his productivesug-

gestionsto improve my researchandits presentation.As thechief of themachine

learningprojectWeka,he financially supportedmy attendanceat several interna-

tional conferences.His kindnessandencouragementthroughoutthe last few years

aremuchvalued.

I owea particulardebtof gratitudeto AlastairScott(Departmentof Statistics,Uni-

versityof Auckland),who becamemy co-supervisorin thefinal yearof thedegree,

becauseof thesubstantialstatisticscomponentof thework. His kindness,expertise,

andtimely guidancehelpedmetremendously. He providedvaluablecommentson

almostevery technicalpoint, aswell asthewriting in thethesis;encouragedmeto

investigatea few furtherproblems;andmadeit possiblefor meto usecomputersin

hisdepartmentandhenceto implementpaceregressionin S-PLUS/R.

Thanksalsogo to RayLittler andThomasYee,who spenttheir precioustime dis-

v

cussingpaceregressionwith me andofferedilluminating suggestions.Kai Ming

Ting introducedme to the M5 numericpredictor, the re-implementationof which

wastaken asthe initial stepof the PhD projectandeventually led me to become

interestedin thedimensionalitydeterminationproblem.

Mark Apperley, the headof the Departmentof ComputerScience,University of

Waikato,helpedmefinancially, alongwith many otherthings,by offering a teach-

ing assistantship,a part-timelecturership,funding for attendinginternationalcon-

ferences,andfunding for visiting theUniversityof Calgaryfor six months(anar-

rangementmadeby Ian Witten). I am alsograteful to the Waikato Scholarships

Office for offering thethree-yearUniversityof WaikatoFull Scholarship.

I havealsoreceivedvarioushelpfrom many othersin theDepartmentof Computer

Science,in particular, Eibe Frank,Mark Hall, StuartInglis, Steve Jones,Masood

Masoodian,David McWha, GordonPaynter, BernhardPfahringer, Steve Reeves,

Phil Treweek,andLenTrigg. It wassodelightfulplayingsoccerwith thesepeople.

I would liketo expressmy appreciationto many peoplein theDepartmentof Statis-

tics,Universityof Auckland,in particular, PaulCopertwait, BrianMcArdle, Renate

Meyer, RussellMillar, Geoff Pritchard,andGeorge Seber, for allowing me to sit

in on their courses,providing lecturenotes,andmarkingmy assignmentsandtests,

which helpedmeto learnhow to applystatisticalknowledgeandexpressit grace-

fully. Thanksarealsodueto JenniHolden,JohnHuakau,RossIhaka,Alan Lee,

ChrisTriggs,andothersfor their hospitality.

Finally, andmostof all, I am gratefulto my family andextendedfamily for their

tremendoustoleranceandsupport—inparticular, my wife, Cecilia, andmy chil-

dren,CampbellandMichelle. It is a pity that I oftenhave to leave them,although

stayingwith themis soenjoyable.Thetime-consumingnatureof thiswork looksso

fascinatingto CampbellandMichelle thatthey planto goto theirdaddy’suniversity

first, insteadof aprimaryschool,afterthey leave their kindergarden.

vi

Contents

Abstract iii

Acknowledgements v

List of Figures xi

List of Tables xiii

1 Intr oduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Modelingprinciples. . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Theuncertaintyof estimatedmodels . . . . . . . . . . . . . . . . . 6

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Issuesin fitting linear models 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Linearmodelsandthedistancemeasure. . . . . . . . . . . . . . . 17

2.3 OLS subsetmodelsandtheir ordering . . . . . . . . . . . . . . . . 19

2.4 Asymptotics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Shrinkagemethods . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Dataresampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 TheRIC andCIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 EmpiricalBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8.1 EmpiricalBayes . . . . . . . . . . . . . . . . . . . . . . . 26

vii

2.8.2 Steinestimation. . . . . . . . . . . . . . . . . . . . . . . . 28

2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Paceregression 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Orthogonaldecompositionof models. . . . . . . . . . . . . . . . . 32

3.2.1 Decomposingdistances. . . . . . . . . . . . . . . . . . . . 33

3.2.2 Decomposingtheestimationtask . . . . . . . . . . . . . . 34

3.2.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Contributionsandexpectedcontributions . . . . . . . . . . . . . . 37

3.3.1 �� and �� . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 �� and �� . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3 �� . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 �� . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Therole of �� and �� in modeling . . . . . . . . . . . . . . . 46

3.5 Modelingwith known �� . . . . . . . . . . . . . . . . . . . . . 55

3.6 Theestimationof �� . . . . . . . . . . . . . . . . . . . . . . . 65

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.8.1 Reconstructingmodelfrom updatedabsolutedistances. . . 68

3.8.2 TheTaylorexpansionof �� with respectto � � . . . 69

3.8.3 Proofof Theorem3.11 . . . . . . . . . . . . . . . . . . . . 70

3.8.4 Identifiability for mixturesof �� "!"� distributions . . . . 73

4 Discussion 75

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2 Finite # vs. # -asymptotics. . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Regressionfor partialmodels. . . . . . . . . . . . . . . . . . . . . 77

4.5 Remarkson modelingprinciples . . . . . . . . . . . . . . . . . . . 78

4.6 Orthogonalizationselection. . . . . . . . . . . . . . . . . . . . . . 80

4.7 Updatingsigned $%'& ? . . . . . . . . . . . . . . . . . . . . . . . . . . 81

viii

5 Efficient, reliable and consistentestimation of an arbitrary mixing dis-

trib ution 85

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Existingestimationmethods . . . . . . . . . . . . . . . . . . . . . 89

5.2.1 Maximumlikelihoodmethods . . . . . . . . . . . . . . . . 89

5.2.2 Minimum distancemethods . . . . . . . . . . . . . . . . . 90

5.3 GeneralizingtheCDF-basedapproach. . . . . . . . . . . . . . . . 93

5.3.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Themeasure-basedapproach. . . . . . . . . . . . . . . . . . . . . 100

5.4.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4.2 Approximationwith probabilitymeasures. . . . . . . . . . 103

5.4.3 Approximationwith nonnegativemeasures . . . . . . . . . 106

5.4.4 Considerationsfor finite samples. . . . . . . . . . . . . . . 108

5.5 Theminority clusterproblem . . . . . . . . . . . . . . . . . . . . . 111

5.5.1 Theproblem . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.5.2 Thesolution . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.5.3 Experimentalillustration . . . . . . . . . . . . . . . . . . . 114

5.6 Simulationstudiesof accuracy . . . . . . . . . . . . . . . . . . . . 117

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6 Simulation studies 123

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2 Artificial datasets:An illustration. . . . . . . . . . . . . . . . . . . 125

6.3 Artificial datasets:Towardsreality . . . . . . . . . . . . . . . . . . 132

6.4 Practicaldatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5 Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7 Summary and final remarks 145

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.2 Unsolvedproblemsandpossibleextensions . . . . . . . . . . . . . 147

ix

7.3 Modelingmethodologies. . . . . . . . . . . . . . . . . . . . . . . 149

Appendix Help filesand sourcecode 153

A.1 Help files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

A.2 Sourcecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

References 193

x

List of Figures

3.1 ��(��)�� and ��*��+�,� , for ��.-0/�13254(136"1�!�1,4 . . . . . . . . . . . . 43

3.2 Two-componentmixture � -function . . . . . . . . . . . . . . . . . 53

4.1 Uni-componentmixture . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2 Asymmetricmixture . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.3 Symmetricmixture . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1 Experiment6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Experiment6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.3 Experiment6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.4 ��(��)�,�� in Experiment6.4 . . . . . . . . . . . . . . . . . . . . . . 133

6.5 Mixture � -function( 78-9/�2;: ) . . . . . . . . . . . . . . . . . . . . . 135

6.6 Mixture � -function( 78-9/�2;< ) . . . . . . . . . . . . . . . . . . . . . 136

xi

List of Tables

5.1 Clusteringreliability . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2 Clusteringaccuracy: Mixturesof normaldistributions . . . . . . . . 119

5.3 Clusteringaccuracy: Mixturesof � � � distributions . . . . . . . . . . 120

6.1 Experiment6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2 Experiment6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.3 Experiment6.3,for #=�.-0/�1">� , and # . . . . . . . . . . . . . . . . . . 130

6.4 Experiment6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.5 ? � in Experiment6.5 . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.6 Experiment6.5( 78-9/=2@: ) . . . . . . . . . . . . . . . . . . . . . . . 138

6.7 Experiment6.5( 78-9/=2@< ) . . . . . . . . . . . . . . . . . . . . . . . 139

6.8 Experiment6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.9 Practicaldatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.10 Experimentsfor practicaldatasets . . . . . . . . . . . . . . . . . . 143

xiii

Chapter 1

Intr oduction

1.1 Intr oduction

Empirical modelingbuilds modelsfrom data,asopposedto analytical modeling

which derivesmodelsfrom mathematicalanalysison the basisof first principles,

backgroundknowledge,reasonableassumptions,etc. Onemajor difficulty in em-

pirical modelingis thehandlingof thenoiseembeddedin thedata.By noisehere

we meanthe uncertaintyfactors,which changefrom time to time andareusually

bestdescribedby probabilisticmodels. An empiricalmodelingproceduregener-

ally utilizesa modelprototypewith someadjustable,freeparametersandemploys

an optimizationcriterion to find the valuesof theseparametersin an attemptto

minimize theeffect of noise. In practice,“optimization” usuallyrefersto thebest

explanationof thedata,wheretheexplanationis quantifiedin somemathematical

measure;for example,leastsquares,leastabsolutedeviation,maximumlikelihood,

maximumentropy, minimumcost,etc.

Thenumberof freeparameters,moreprecisely, thedegreesof freedom, playsanim-

portantrole in empiricalmodeling.Onewell-known phenomenonconcerningthese

bestdataexplanationcriteriais thatany givendatacanbeincreasinglyexplainedas

thenumberof freeparametersincreases—intheextremecasewhenthis numberis

equalto thenumberof observations,thedatais fully explained.This phenomenon

1

is intuitive,for any effectcanbeexplainedgivenenoughreasons.Nevertheless,the

modelthatfully explainsthedatain this way usuallyhaslittle predictivepower on

futureobservations,becauseit explainsthenoisesomuchthattheunderlyingrela-

tionshipis obscured.Instead,a modelthatincludesfewer parametersmayperform

significantlybetterin prediction.

While freeparametersareinevitably requiredby themodelprototype,it seemsnat-

ural to ask the question: how many parametersshouldenter the final estimated

model? This is a classicissue,formally known asthe dimensionalitydetermina-

tion problem(or the dimensionalityreduction, modelselection, subsetselection,

variable selection, etc. problem). It was first brought to wide awarenessby H.

Akaike (1969) in the context of time seriesanalysis.Sincethen,it hasdrawn re-

searchattentionfrom many scientificcommunities,in particular, statisticsandcom-

puterscience(seeChapter2).

Dimensionalitydeterminationis akey issuein empiricalmodeling.Althoughmost

researchinterestin this problemfocuseson fitting linear models,it is widely ac-

knowledgedthat the problemplaguesalmostall typesof model structurein em-

pirical modeling.1 It is inevitably encounteredwhen building various types of

model, suchas generalizedadditive models(Hastieand Tibshirani, 1990), tree-

basedmodels(Breiman et al., 1984; Quinlan, 1993), projection pursuit (Fried-

man and Stuetzle,1981), multiple adaptive regressionsplines(Friedman,1991),

local weightedmodels(Cleveland,1979; Fan and Gijbels, 1996; Atkesonet al.,

1997;Loader, 1999),rule-basedlearning(Quinlan,1993),andinstance-basedlearn-

ing (Li, 1987;Aha et al., 1991),to namea few. Thereis a large literaturedevoted

to theissueof dimensionalitydetermination.Many procedureshavebeenproposed

to addresstheproblem—mostof themapplicableto both linearandothertypesof

modelstructure(seeChapter2). Monographsthataddressthegeneraltopicsof di-

mensionalitydetermination,in particularfor linearregression,includeLinhart and1It doesnot seemto be a problemfor the estimationof a mixing distribution (seeChapter5),

sincelargersetof candidatesupportpointsdoesnot jeopardize,if not improve,theestimationof themixing distribution. This observation,however, doesnot generalizeto every kind of unsupervisedlearning.For example,thelossfunctionemployedin a clusteringapplicationdoestake thenumberof clustersinto accountfor, say, computationalreasons.

2

Zucchini (1989),Rissanen(1989),Miller (1990),BurnhamandAnderson(1998),

andMcQuarrieandTsai(1998).

Sofar this problemis oftenconsideredasanindependentissueof modelingprinci-

ples.Unfortunately, it turnsout to besofundamentalin empiricalmodelingthat it

castsashadow over thegeneralvalidity of many existingmodelingprinciples,such

astheleastsquaresprinciple,the(generalized)maximumlikelihoodprinciple,etc.

This is becauseit takesbut onecounter-exampleto refutethecorrectnessof a gen-

eralization,andall theseprinciplesfail to solve the dimensionalitydetermination

problem.We believe its solutionrelatescloselyto fundamentalissuesof modeling

principles,which have beenthe subjectof controversy throughoutthe history of

theoreticalstatistics,without any consensusbeingreached.

R. A. Fisher(1922)identifiesthreeproblemsthatarisein datareduction:(1) model

specification,(2) parameterestimationin the specifiedmodel, and (3) the distri-

bution of theestimates.In muchof the literature,dimensionalitydeterminationis

consideredasbelongingto thefirst typeof problem,while modelingprinciplesbe-

long to thesecond.However, theconsiderationsin theprecedingparagraphimply

thatdimensionalitydeterminationbelongsto thesecondproblemtype.To makethis

point moreexplicit, we modify Fisher’s first two problemsslightly, into (1) speci-

fying thecandidatemodelspace,and(2) determiningtheoptimalmodelfrom the

modelspace.Thismodificationis consistentwith modernstatisticaldecisiontheory

(Wald,1950;Berger, 1985),andwill betakenasthestartingpointof our work.

This thesisproposesa new approachto solving the dimensionalitydetermination

problem,which turnsout to beaspecialcaseof general modeling, i.e., thedetermi-

nationof thevaluesof freeparameters.Thekey ideais to considertheuncertainty

of estimatedmodels(introducedin Section1.3), which resultsin a methodology

similar to empiricalBayes(briefly reviewedin Section2.8). Theissueof modeling

principlesis briefly coveredin Section1.2.Two principleswill beemployed,which

naturally leadto a solutionto the dimensionalitydeterminationproblem,or more

generally, fitting linear models. We will theoreticallyestablishoptimality in the

senseof # -asymptotics(i.e.,aninfinite numberof freeparameters)for theproposed

3

procedures.This impliesthetendency of increasingperformanceas # increasesand

justifies the appropriatenessof our proceduresbeingappliedin high-dimensional

spaces.Like many otherapproaches,we focusfor simplicity on fitting linearmod-

els with normally distributednoise,but the ideascanprobablybe generalizedto

many othersituations.

1.2 Modeling principles

In theprecedingsection,we discussedtheoptimizationcriterionusedin empirical

modeling. This relatesto the fundamentalquestionof modeling: amongthe can-

didatemodels,which one is the best? This is a difficult andcontroversial issue,

and thereis a large literaturedevoted to it; see,for example,Berger (1985) and

Brown (1990)andthereferencescitedtherein.

In our work, two separatebut complementaryprinciplesareadoptedfor success-

ful modeling. The first one follows the utility theory in decisionanalysis(von

Neumannand Morgenstern,1944; Wald, 1950; Berger, 1985), whereutility is a

mathematicalmeasureof thedecision-maker’s satisfaction. Theprinciple for best

decision-making,includingmodelestimation,is to maximizethe expectedutility.

Sincewearemoreconcernedwith loss—thenegativeutility—we rephrasethis into

thefundamentalprinciplefor empiricalmodelingasfollows.

The Minimum ExpectedLossPrinciple. Choosethemodelfrom thecandidate

modelspacewith smallestexpectedlossin predictingfutureobservations.

To minimize the expectedloss over future observations,we often needto make

assumptionsso that future observationscanbe relatedto the observationswhich

areusedfor modelconstruction.By assumptionswe meanstatisticaldistribution,

backgroundknowledge,loss information,etc. Of course,the performanceof the

resultingmodeloverfutureobservationsalsodependsonhow fartheseassumptions

arefrom futurereality.

4

Further, we definethata bettermodelimpliessmallerexpectedlossandequivalent

modelshave equalexpectedloss. This principle is then appliedto find the best

modelamongall candidates.

It mayseemthattheprincipledoesnotaddresstheissueof modelcomplexity (e.g.,

thenumberof parameters),which is importantin many practicalapplications.This

is in factnot true,becausethedefinitionof thelosscouldalsoincludelossincurred

dueto modelcomplexity—for example,the lossfor comprehensionandcomputa-

tion. Thusmodelcomplexity canbenaturallytakeninto accountwhenminimizing

theexpectedloss.In practice,however, thiskind of lossis oftendifficult to quantify

appropriately, andevenmoredifficult to manipulatemathematically. Whenthis is

thecase,asecondparsimony principlecanbecomehelpful.

The Simplicity Principle. Of theequivalentmodels,choosethesimplest.

Wewill applythis principleto approximatelyequivalentmodels,i.e.,whenthedif-

ferencebetweentheexpectedlossesof themodelsin considerationis tolerable.It is

worthpointingout that,thoughsimilar, thisprinciplediffersfrom thegenerallyun-

derstoodOccam’s razor, in that the former considersequalexpectedlosses,while

the latter equaldataexplanation. We believe that it is inappropriateto compare

modelsbasedon dataexplanation,andreturnto this in thenext section.

“Tolerable”hereis meantonly in the senseof statisticalsignificance.In practice,

modelcomplexity may be further reducedwhenotherfactors,suchasdatapreci-

sion,predictionsignificanceandcostof modelconstruction,areconsidered.These

issues,however, arelesstechnicalandwill notbeusedin our work.

Althoughtheaboveprinciplesmayseemsimpleandclear, theirapplicationis com-

plicatedin many situationsby thefactthatthedistributionof thetruemodelis usu-

ally unknown, andhencetheexpectedlossis uncomputable.Partly dueto this rea-

son,a numberof modelingprinciplesexist which arewidely usedandextensively

investigated.Someof themostimportantonesaretheleastsquaresprinciple(Leg-

endre,1805;Gauss,1809),themaximumlikelihoodprinciple(Fisher, 1922,1925;

5

Barnard,1949;Birnbaum,1962),themaximumentropy principle(Shannon,1948;

KullbackandLeibler, 1951;Kullback,1968),theminimaxprinciple(vonNeuman,

1928;von NeumannandMorgenstern,1944;Wald, 1939,1950), the Bayesprin-

ciple (Bayes,1763),the minimum descriptionlength(MDL) principle (Rissanen,

1978,1983),Epicurus’s principleof multiple explanations,andOccam’s razor. In-

steadof discussingtheseprincipleshere,we examinea few majoronesretrospec-

tively in Section4.5,afterour work is presented.

In contrastto the above analyticalapproach,modelevaluationcanalsobe under-

takenempirically, i.e., by data.Thedatausedfor evaluationcanbeobtainedfrom

anotherindependentsampleor from dataresampling(for the latter, seeEfron and

Tibshirani,1993;ShaoandTu, 1995). While empiricalevaluationhasfoundwide

applicationin practice,it is moreanimportantaid in evaluatingmodelsthanagen-

eral modelingprinciple. One significant limitation is that resultsobtainedfrom

experimentingwith datarely heavily ontheexperimentdesignandhenceareincon-

clusive. Further, they areoftencomputationallyexpensive andonly apply to finite

candidates.Almostall majorexistingdataresamplingtechniquesarestill underin-

tensive theoreticinvestigation—andsometheoreticresultsin fitting linear models

havealreadyrevealedtheir limitations;seeSection2.6.

1.3 The uncertainty of estimatedmodels

It is known that in empiricalmodeling,the noiseembeddedin datais propagated

into the estimatedmodels,causingestimationuncertainty. Fisher’s third problem

concernsthiskind of uncertainty. An exampleis theestimationof confidenceinter-

val. Thecommonapproachto decreasingtheuncertaintyof anestimatedmodelis

to increasethenumberof observations,assuggestedby theCentralLimit Theorem

andtheLawsof LargeNumbers.

Traditionally, as mentionedabove, a modelingprocedureoutputsthe model that

bestexplainsthedataamongthecandidatemodels.Sinceeachestimatedmodelis

6

uncertainto someextent, themodelthatoptimally explainsthedatais thewinner

in acompetitionbasedonthecombinedeffectof boththetruepredictivepowerand

theuncertaintyeffect. If therearemany independentcandidatemodelsin thecom-

petition,the“optimum” canbesignificantlyinfluencedby theuncertaintyeffectand

hencetheoptimaldataexplanationmaybegoodonly by chance.An extremesitua-

tion is thatwhenall candidateshavenopredictivepower—i.e.,only theuncertainty

effect is explained.Then,themorethecandidateexplainsthedata,thefartherit is

awayfrom thetruemodelandhencetheworseit performsoverfutureobservations.

In thisextremecase,theexpectedlossof thewinnerincreaseswithoutboundasthe

numberof candidatesincreases—aconclusionthat canbe easilyestablishedfrom

orderstatistics.Thecompetitionphenomenonis alsoinvestigatedby Miller (1990),

wherehedefinestheselectionor competitionbiasof theselectedmodel.This bias

generallyincreaseswith thenumberof candidatemodels.

This impliesthata modelingprocedurewhich is solelybasedon themaximization

of dataexplanationis notappropriate.To takeanextremecase,it couldalwaysfind

the existenceof “patterns” in a completelyrandomsequence,provided only that

enoughnumberof candidatemodelsareused. This kind of modelingpracticeis

known in unsavory termsas“datadredging”,“fishing expeditions”,“datamining”

(in theold sense2), and“torturing thedatauntil they confess”(Miller, 1990,p.12).

An appropriatemodelingprocedure,therefore,shouldtake the uncertaintyeffect

into consideration. It would be ideal if this effect could be separatedfrom the

model’s dataexplanation,sothatthecomparisonbetweenmodelsis basedon their

predictivepower only. Inevitably, without knowing thetruemodel,it is impossible

to filter theuncertaintyeffect out of theobservedperformancecompletely. Never-

theless,sinceweknow thatcapableparticipantstendto performwell andincapable

onestendto performbadly, we shouldbeableto gainsomeinformationaboutthe

competitionfrom theobservedperformanceof all participants,usethis information

to generateabetterestimateof thepredictivepowerof eachmodel,andevenupdate

themodelestimateto a betteroneby strippingoff theexpecteduncertaintyeffect.2Data mining now denotes“the extractionof previously unknown informationfrom databases

thatmaybelarge,noisyandhavemissingdata”(Chatfield,1995).

7

Eventually, themodelselectionor, moregenerally, themodelingtaskcanbebased

on the estimatedpredictive power, not simply the dataexplanation. In short, the

modelingprocedurereliesonnot only randomdata,but alsorandommodels.

This methodologicalconsiderationformsthebasisof our new modelingapproach.

The practicalsituationmay be so difficult that no generalmathematicaltools are

availablefor providing solutions.For example,theremaybemany candidatemod-

els,generatedby differenttypesof modelingprocedure,which arenot statistically

independentatall. Thecompetitioncanbeveryhardto analyze.

In this thesis,this ideais appliedto solvingthedimensionalitydeterminationprob-

lem in fitting linear models,which seemsa lessdifficult problemthanthe above

generalformulation.To bemoreprecise,it is appliedto fitting linearmodelsin the

generalsense,which in fact subsumesthe dimensionalitydeterminationproblem.

We shall transformthe full model that containsall parametersinto an orthogonal

spaceto make the dimensionalmodelsindependent.The observed dimensional

modelsserve to provide anestimationof theoverall distribution of thetruedimen-

sionalmodels,which in turnhelpsto updateeachobserveddimensionalmodelto a

betterestimate,in expectation,of thetruedimensionalmodel.Theoutputmodelis

re-generatedby aninversetransformation.

Although startedfrom a different viewpoint, i.e., that of competingmodels,our

approachturnsout to beverysimilar to theempiricalBayesmethodologypioneered

by Robbins(1951,1955,1964,1983);seeSection2.8.

1.4 Contrib utions

Althoughthemajorideasin this thesiscouldbegeneralizedto othertypesof model

structure,we focuson fitting linearmodels,with a strongemphasison thedimen-

sionality determinationproblem,which, aswe will demonstrate,turnsout to be a

specialcaseof modelingin thegeneralsense.

8

As we have pointedout, theproblemunderinvestigationrelatescloselyto thefun-

damentalprinciplesof modeling.Therefore,despitethefact thatwe would like to

focussimply on solvingmathematicalproblems,argumentsin somephilosophical

senseseeminevitable.Thedangeris thatthey maysoundtoo “philosophical”to be

sensible.

Themaincontributionsof thework areasfollows.

The competition viewpoint. Throughoutthe thesis,thecompetitionviewpoint

is adoptedsystematically. This not only providesthe key to fitting linear models

quantitatively, includingasolutionto thedimensionalitydeterminationproblem,but

alsohelpsto answer, at leastqualitatively, someimportantquestionsin thegeneral

caseof empiricalmodeling.

Thephenomenonof competingmodelsfor fitting linearmodelshave beeninvesti-

gatedby otherauthors;for example,Miller (1990),DonohoandJohnstone(1994),

FosterandGeorge (1994),andTibshiraniandKnight (1997). However, asshown

later, theseexisting methodscanstill fail in someparticularcases.Furthermore,

noneof them achieves # -asymptoticoptimality generally, a propertyenjoyed by

our new procedures.

Thecompetitionphenomenonpervadesin variouscontexts of empiricalmodeling;

for example,in orthogonalizationselection(seeSection4.6).

Predictive objective of modeling. Throughoutthework, weemphasizetheper-

formanceof constructedmodelsover futureobservations,by explicitly incorporat-

ing it into theminimumexpectedlossprinciple(Section1.2)andrelatingit analyt-

ically to our modelingprocedures(see,e.g.,Sections2.2 and3.6). Although this

goal is often usedin simulationstudiesof modelingproceduresin the literature,

it is rarely addressedanalytically, which seemsironical. Many existing modeling

proceduresdo not directly examinethis relationship,althoughit is acknowledged

thatthey arederivedfrom differentconsiderations.

9

Employing thepredictiveobjectiveemphasizesthatit is reality thatgovernstheory,

notviceversa.Thedeterminationof lossfunctionaccordingto specificapplications

is anotherexample.

In Section5.6,we usethis sameobjective for evaluatingtheaccuracy of clustering

procedures.

Paceregression. A new approachto fitting linear models,paceregression, is

proposed,basedon consideringcompetingmodels,where“pace” standsfor “Pro-

jectionAdjustmentby Contribution Estimation.” Six relatedproceduresaredevel-

oped,denotedPACE � to PACE A , thatshareacommonfundamentalidea—estimating

thedistribution of theeffectsof variablesfrom the dataandusingthis to improve

modeling. The optimality of theseprocedures,in the senseof # -asymptotics,is

theoreticallyestablished.The first four proceduresutilize OLS subsetselection,

andoutperformexisting OLS methodsfor subsetselection,includingOLS itself (by

which we meanthe OLS fitted modelthat includesall parameters).By abandoning

theideaof selection,PACE B achievesthehighestpredictionaccuracy of all. It even

surpassestheOLS estimatewhenall variableshavelargeeffectsontheoutcome—in

somecasesby a substantialmargin. Unfortunately, theextensive numericalcalcu-

lationsthatPACE B requiresmaylimit its applicationin practice.However, PACE A is

averygoodapproximationto PACE B , andis computationallyefficient.

For deriving paceregressionprocedures,somenew terms,suchasthe “contribu-

tion” of a modelandalsothe - and � -functions,are introduced. They areable

to indicatewhetheror not a model,or a dimensionalmodel,contributes,in expec-

tation, to a predictionof futureobservations,andareemployed to obtainthepace

regressionprocedures.

We realizethat this approachis similar to the empiricalBayesmethodology(see

Section2.8), which, accordingto theauthor’s knowledge,hasnot beensuchused

for fitting linear models(aswell asothermodel structures),despitethe fact that

empirical Bayeshaslong beenrecognizedas a breakthroughin statisticalinfer-

10

ence(see,e.g.,Neyman,1962;Maritz andLwin, 1989;Kotz andJohnson,1992;

Efron,1998;CarlinandLouis,2000).

Reviewandcomparisons. Thethesisalsoprovidesin Chapter2 ageneralreview

thatcoversa wide rangeof key issuesin fitting linearmodels.Comparisonsof the

new procedureswith existingonesareplacedappropriatelythroughoutthethesis.

Throughoutthe review, comparisonsandsometheoreticwork, we alsoattemptto

provide a unified view for mostapproachesto fitting linear models,suchas OLS,

OLS subsetselection,modelselectionbasedon dataresampling,shrinkageestima-

tion, empiricalBayes,etc.Further, ourwork in fitting linearmodelsseemsto imply

that supervisedlearningrequiresunsupervisedlearning—moreprecisely, the esti-

mationof a mixing distribution—asa fundamentalstepfor handlingcompetition

phenomenon.

Simulation studies. Simulationstudiesareconductedin Chapter6. Theexper-

imentsrangefrom artificial to practicaldatasets,illustration to application,high to

low noiselevel, largeto smallnumberof variables,independentto correlatedvari-

ables,etc.Theproceduresusedin theexperimentsincludeOLS, AIC, BIC, RIC, CIC,

PACE � , PACE C , PACE A , LASSO andNN-GAROTTE.

Experimentalresultsgenerallysupportour theoreticconclusions.

Efficient, reliable and consistentestimation of a mixing distrib ution. Pace

regressionrequiresanestimatorof themixing distributionof arbitrarymixturedis-

tribution. In Chapter5, a new minimumdistanceapproachbasedon a nonnegative

measureis shown to provideefficient,reliableandconsistentestimationof themix-

ing distribution. Theexisting minimumdistancemethodsareefficient andconsis-

tent, yet not reliable,dueto the minority clusterproblem(seeSection5.5), while

the maximumlikelihoodmethodsare reliableandconsistent,yet not efficient in

comparisonwith theminimumdistanceapproach.

11

Reliableestimationis a finite sampleissuebut is vital in paceregression,since

themisclusteringof evena singleisolatedpoint could resultin unboundedlossin

prediction. Efficiency may alsobe desiredin practice,especiallywhen it is nec-

essaryto fit linearmodelsrepeatedly. Strongconsistency of eachnew estimatoris

established.

Wenotethattheestimationof amixing distribution is anindependenttopicwith its

own valueandhasa largeliterature.

Implementation. Paceregression,togetherwith someotherideasin the thesis,

are implementedin the programminglanguagesof S-PLUS (testedwith versions

3.4 and 5.1)—in a compatibleway with R (version1.1.1)—andFORTRAN 77.

The sourcecodeis given in the Appendix,andalsoservesto provide full details

of the ideasin the thesis. Help files in S format have beenwritten for the most

importantfunctions.

All the sourcecodeandhelp files arefreely availablefrom the author, andcould

beredistributedand/ormodifiedunderthe termsof version2 of theGNU General

PublicLicenseaspublishedby theFreeSoftwareFoundation.

1.5 Outline

Thethesisconsistsof sevenchaptersandoneappendix.

Chapter1 briefly introducesthe generalidea of the whole work. It definesthe

dimensionalitydeterminationproblem,and points outs that a shadow is caston

the validity of many empirical modelingprinciplesby their failure to solve this

problem.Our approachto this problemadoptstheviewpointof competingmodels,

dueto estimationuncertainty, andutilizestwo complementarymodelingprinciples:

minimumexpectedloss,andsimplicity.

Someimportantissuesin fitting linearmodelsarereviewedin Chapter2, in anat-

12

tempt to provide a solid basisfor the work presentedlater. They include linear

modelsand distances,OLS subsetselection,asymptotics,shrinkage,dataresam-

pling, RIC and CIC procedures,and empirical Bayes,etc. The basicmethodof

applyingthegeneralideasof competingmodelsto linearregressionis embeddedin

this chapter.

Chapter3, thecoreof thethesis,formally proposesandtheoreticallyevaluatespace

regression.New concepts,suchascontribution, - and � -functions,aredefinedin

Section3.3 andtheir rolesin modelingareillustratedin Section3.4, respectively.

Six proceduresof paceregressionare formally definedin Section3.5, while # -asymptoticoptimality is establishedin Section3.6.

A few important issuesare discussedin Chapter4, somecompletingthe defini-

tion of thepaceregressionproceduresin specialsituations,othersexpandingtheir

implicationsinto abroaderarena,andstill othersraisingopenquestions.

In Chapter5, theminimumdistanceestimationof a mixing distribution is investi-

gated,alongwith abrief review of maximumlikelihoodestimation.It focusesones-

tablishingconsistency of theproposedprobability-measure-basedandnonnegative-

measure-basedestimators.Theminority clusterproblem,vital for paceregression,

is discussedin Section5.5.Section5.6furtherempiricallyinvestigatestheaccuracy

of theseclusteringestimatorsin situationsof overlappingclusters.

Resultsof simulationstudiesof paceregressionand other proceduresfor fitting

linearmodelsaregivenin Chapter6.

Thefinal chapterbriefly summarizesthewholework andprovidessometopicsfor

futurework andopenquestions.

The sourcecodein S andFORTRAN, andsomehelp files, are listed in the Ap-

pendix.

13

Chapter 2

Issuesin fitting linear models

2.1 Intr oduction

Thebasicideaof regressionanalysisis to fit alinearmodelto asetof data.Theclas-

sicalordinaryleastsquares(OLS) estimatoris simple,computationallycheap,and

haswell-establishedtheoreticaljustification. Nevertheless,themodelsit produces

areoftenlessthansatisfactory. For example,OLS doesnotdetectredundancy in the

setof independentvariablesthataresupplied,andwhenalargenumberof variables

arepresent,many of which areredundant,the modelproducedusuallyhasworse

predictiveperformanceonfuturedatathansimplermodelsthattakefewervariables

into account.

Many researchershave investigatedmethodsof subsetselectionin an attemptto

neutralizethis effect. Themostcommonapproachis OLS subsetselection:from a

setof OLS-fittedsubsetmodels,choosetheonethatoptimizessomepre-determined

modelingcriterion. Almost all theseproceduresarebasedon the ideaof thresh-

olding variationreduction:calculatinghow muchthevariationof themodelis in-

creasedif eachvariablein turn is taken away, settinga thresholdon this amount,

anddiscardingvariablesthatcontributelessthanthethreshold.Therationaleis that

anoisyvariable—i.e.,onethathasno predictivepower—usuallyreducesthevaria-

tion only marginally, whereasthevariationaccountedfor by a meaningfulvariable

is largerandgrowswith thevariable’ssignificance.

15

Many well-known procedures,including FPE (Akaike, 1970),AIC (Akaike, 1973),

CD (Mallows, 1973)and BIC (Schwarz, 1978) follow this approach.While these

certainly work well for somedatasets,extensive practicalexperienceand many

simulationstudieshave exposedshortcomingsin themall. Often, for example,a

certainproportionof redundantvariablesareincludedin thefinal model(Derksen

andKeselman,1992).Indeed,therearedatasetsfor which a full regressionmodel

outperformsthe selectedsubsetmodelunlessmostof the variablesareredundant

(Hoerl etal., 1986;Roecker, 1991).

Shrinkage methodsoffer analternative to OLS subsetselection.Simulationstudies

show thatthetechniqueof biasedridgeregressioncanoutperformOLS subsetselec-

tion,althoughit generatesamorecomplex model(FrankandFriedman,1993;Hoerl

et al., 1986).Theshrinkageideawasfurtherexploredby Breiman(1995)andTib-

shirani(1996),who wereableto generatemodelsthatarelesscomplex thanridge

regressionmodelsyetstill enjoy higherpredictiveaccuracy thanOLS subsetmodels.

Empiricalevidencepresentedin thesepaperssuggeststhatshrinkagemethodsyield

greaterpredictiveaccuracy thanOLS subsetselectionwhenamodelhasmany noisy

variables,or atmostamoderatenumberof variableswith moderate-sizedeffects—

whereasthey performworsewhentherearea few variablesthat have a dramatic

effecton theoutcome.

Theseproblemsaresystematic:the performanceof modelingprocedurescanbe

relatedto theeffectsof variablesandtheextentof theseeffects. Researchershave

soughtto understandthesephenomenaandusethemto motivatenew approaches.

For example,Miller (1990)investigatedtheselectionbiasthat is introducedwhen

the samedatais usedboth to estimatethe coefficientsandto choosethe subsets.

New procedures,including the little bootstrap(Breiman,1992),RIC (Donohoand

Johnstone,1994;FosterandGeorge,1994),andCIC (TibshiraniandKnight, 1997)

havebeenproposed.While theseundoubtedlyproducegoodmodelsfor many data

sets,we will seelater that thereis no singleapproachthatsolvesthesesystematic

problemsin ageneralsense.

Ourwork in this thesiswill show thattheseproblemsarethetip of aniceberg. They

16

aremanifestationsof a muchmoregeneralphenomenonthatcanbeunderstoodby

examiningtheexpectedcontributionsthat individual variablesmake in an orthog-

onaldecompositionof theestimatedmodel.This analysisleadsto a new approach

called“paceregression”;seeChapter3.

We investigatemodelestimationin a generalsensethatsubsumessubsetselection.

We do not confineour efforts to finding thebestof thesubsetmodels;insteadwe

addressthewholespaceof linearmodelsandregardsubsetmodelsasaspecialcase.

But it is animportantspecialcase,becausesimplifying themodelstructurehaswide

applicationsin practice,andwewill useit extensively to helpsharpenour ideas.

In this chapter, we introduceand review someimportantissuesin linear regres-

sion to provide a basisfor the analysisthat follows. We briefly review the major

approachesto model selection,focusingon their failure to solve the systematic

problemsraisedabove. A commonthreademerges:thekey to solving thegeneral

problemof modelselectionin linearregressionlies in thedistributionof theeffects

of thevariablesthatareinvolved.

2.2 Linear modelsand the distancemeasure

Given # explanatoryvariables,the responsevariable,and E independentobserva-

tions,a linearmodelcanbewritten in thefollowing matrix form,F -HGJI ��KML 1 (2.1)

where F is the E -dimensionalresponsevector, G the EON�# designmatrix, I�� the# -dimensionalparametervectorof the true, underlying,model,andeachelement

of the noisecomponentL is independentand identically distributedaccordingtoP ��/=1�QR�,� . We assumefor themostpart thatthevarianceQR� is known; if not, it can

beestimatedusingtheOLS estimator $QR� of thefull model.

With variablesdefined,a linearmodelis uniquelydeterminedby a parametervec-

17

tor I . Therefore,weuse S to denoteany model, S ��IT� themodelwith parameter

vector I , and SU� asshorthandfor theunderlyingmodel S ��IT�� . Theentirespace

of linear modelsis V > -XW'S ��IT�ZYTI\[^] >=_ . Note that modelsconsidered(and

produced)by the OLS method,OLS subsetselectionmethods,andshrinkagemeth-

odsareall subclassesof V > . Any zeroentryin I � correspondsto a truly redundant

variable.In fact,variableselectionis notaproblemindependentfrom parameteres-

timation,or moregenerally, modeling;it is justaspecialcasein whichthediscarded

variablescorrespondto zeroentriesin theestimatedparametervector.

Further, we needa generaldistancemeasurebetweenany two modelsin thespace,

which supersedesthe usually-usedloss function in the sensethat minimizing the

expecteddistancebetweentheestimatedandtruemodelsalsominimizes(perhaps

approximately)theexpectedloss.Weuseageneraldistancemeasureratherthanthe

lossfunctiononly, becauseotherdistancesarealsoof interest.Defininga distance

measurecoincidestoowith thenotionof ametricspace—whichis how weenvisage

themodelspace.

Now we definethedistancemeasureby relatingit to thecommonly-usedquadratic

loss.Givenadesignmatrix G , thepredictionvectorof themodel S ��IT� is F'`ba;c3d -GJI . In particular, thetruemodel S � predictstheoutputvector F � - F `fe -gGJI � .Thedistancebetweentwo modelsis definedash ��S ��I � �i1�S ��I � �j�k-mlnl F `ba@c3o�dqp F `ba@cirsd ltl � uQ � 1 (2.2)

where ltlwvTltl denotesthe x � norm. Note that our real interest,the quadraticlossltl F ` p F �ylnl � of the model S , is directly relatedto the distanceh ��Sz1�SU�� by

a scalingfactor QR� . Therefore,in the casethat QR� is unknown and $Qq� is usedin-

stead,any conclusionconcerningthedistanceusing Qq� remainstrueasymptotically

(Section3.6).

Of course,lnl F ` p F �ylnl � is only thelossunderthe { -fixedassumption—i.e.,thede-

signmatrix G remainsunchangedfrom futuredata. Anotherpossibleassumption

usedin fitting linearmodelsis { -random—i.e.,eachexplanatoryvariableis a ran-

18

domvariablein thesensethatbothtrainingandfuturedataaredrawn independently

from the samedistribution. The implicationsof { -fixed and { -randomassump-

tionsin subsetselectionarediscussedby Thompson(1978);seealsoMiller (1990),

Breiman(1992) and Breimanand Spector(1992) for different treatmentof two

cases,in particularwhen E is small. Althoughour work presentedlaterwill not di-

rectly investigatethe { -randomsituation,it is well known thatthetwo assumptions

convergeas Ef| } , andhenceproceduresoptimalin onesituationarealsooptimal

in theother, asymptotically.

The performanceof models(producedby modelingprocedures)is thusnaturally

orderedin termsof their expecteddistancesfrom the true model. The modeling

taskis henceto find a modelthat is ascloseaspossible,in expectation,to thetrue

one.Twoequivalentmodels,sayS ��I � � and S ��I � � , denotedby S ��I � �~-HS ��I � � ,haveequalexpectedvaluesof this distance.

2.3 OL S subsetmodelsand their ordering

An OLS subsetmodel is one that usesa subsetof the # candidatevariablesand

whoseparametervectoris anOLS fit. Whendeterminingthebestsubsetto use,it is

commonpracticeto generateasequenceof # K 6 nestedmodelsW'S & _ with increas-

ing numbers� of variables.S�� is thenull modelwith no variablesand S > is the

full modelwith all variablesincluded.TheOLS estimateof model S & ’s parameter

vectoris $I `�� -��Gb�`�� G `�� ,� � Gb�`�� F , where G `�� is the E�N�� designmatrix for

model S & . Let � `�� -�G `�� G �`�� G `�� G �`�� betheorthogonalprojectionma-

trix from the original # -dimensionalspaceonto the reduced� -dimensionalspace.

Then $F `�� -9� `�� F is theOLS estimateof F �`�� -9� `�� F � .Oneway of determiningsubsetmodelsis to includethevariablesin a pre-defined

orderusingprior knowledgeaboutthe modelingsituation. For example,in time

seriesanalysisit usuallymakesgoodsenseto givepreferenceto closerpointswhen

selectingautoregressive terms,while whenfitting polynomials,lower-degreeterms

19

areoften includedbeforehigher-degreeones.Whenthevariablesequenceis pre-

defined,a total of # K 6 subsetmodelsareconsidered.

In theabsenceof prior ordering,a data-drivenapproachmustbeusedto determine

appropriatesubsets.Thefinal modelcould involve any subsetof thevariables.Of

course,computingandevaluatingall ! > modelsrapidly becomescomputationally

infeasibleas # increases.Techniquesthat are usedin practiceinclude forward,

backward,andstepwiserankingof variablesbasedon partial-� ratios(Thompson,

1978).

The differencebetweenthe prior orderinganddata-driven approachesaffects the

subsetselectionprocedure.If the orderingof variablesis pre-defined,subsetsare

determinedindependentlyof the data, which implies that the ratio betweenthe

residualsumof squaresand the estimatedvariancecanbe assumedto be � dis-

tributed. The subsetselectioncriteria FPE, AIC, andC D all make this assumption.

However, data-drivenorderingcomplicatesthesituation.Candidatevariablescom-

peteto enterand leave the model, causingcompetitionbias (Miller, 1990). It is

certainlypossibleto useFPE, AIC and C D in this situation,but they lack theoret-

ical support,and in practicethey performworsethanwhen the variableorder is

correctlypre-defined.For example,supposeunderfittingis negligible andthenum-

ber of redundantvariablesincreaseswithout bound. Thenthepredictive accuracy

of theselectedmodelandits expectednumberof redundantvariablesboth tendto

constantvalueswhenthevariableorderis pre-defined(Shibata,1976),whereasin

thedata-drivenscenariothey bothincreasewithoutbound.

Pre-definingthe orderingmakesuseof prior knowledgeof the underlyingmodel.

As is only to beexpected,thiswill improvemodelingif theinformationis basically

correct,andhinderit otherwise.In practice,acombinationof pre-definedanddata-

drivenorderingis oftenused.For example,whencertainvariablesareknown to be

relevant,they shoulddefinitelybekeptin themodel;also,it is commonpracticeto

alwaysretaintheconstantterm.

20

2.4 Asymptotics

We will be concernedwith two asymptoticsituations: E -asymptotics, wherethe

numberof observations increaseswithout bound, and # -asymptotics, where the

numberof variablesincreaseswithout bound. Usually # -asymptoticsimplies E -

asymptotics;for example,our # -asymptoticconclusionimplies,asleast,E p #�| }(seeSection3.6). In thissectionwereview someE -asymptoticresults.Theremain-

derof thethesisis largelyconcernedwith # -asymptotics,which justify theapplica-

bility of theproposedproceduresin situationof largedatasetswith many possible

variables—i.e.,asituationwhensubsetselectionmakessense.

ThemodelselectioncriteriaFPE, AIC andCD are E -asymptoticallyequivalent(Shi-

bata,1981) in the sensethat they dependon thresholdvaluesof � -statisticthat

becomethesame—inthis case,2—asE approachesinfinity. With reasonablylarge

samplesizes,theperformanceof different E -asymptoticallyequivalentcriteriaare

hardly distinguishable,both theoreticallyand experimentally. When discussing

asymptoticsituations,we useAIC to representall threecriteria. Notethatthis def-

inition of equivalenceis lessgeneralthanours,asdefinedin Sections1.2 and2.2;

ourswill beusedthroughoutthethesis.

Asymptoticallyspeaking,thechangein theresidualsumof squaresof a significant

variableis ��ER� , whereasthatof a redundantvariablehasa upperbound ��6'� , in

probability, andaupperbound��t�"�~�n��kER� , almostsurely;see,e.g.,Shibata(1986),

Zhaoet al. (1986),andRao& Wu (1989). The modelestimatorgeneratedby a

thresholdfunctionboundedbetween��6'� and ��ER� is weaklyconsistentin terms

of modeldimensionality, whereasonewhosethresholdfunctionisboundedbetween��n�"�~�t�"�~E�� and ��E�� is stronglyconsistent.

Somemodelselectioncriteriaare E -asymptoticallystronglyconsistent.Examples

includeBIC (Schwarz,1978), � (HannanandQuinn,1979),GIC (Zhaoetal.,1986),

andRaoandWu (1989). Theseall replaceAIC’s thresholdof ! by an increasing

functionof E boundedbetween��t�"�~�n�"��ER� and ��E�� . Thefunctionvalueusually

21

exceeds2 (unlessE is verysmall),giving athresholdthatis largerthanAIC’s. How-

ever, employing therateof convergencedoesnot guaranteea satisfactorymodelin

practice.For any finite dataset,a higherthresholdrunsa greaterrisk of discarding

a non-redundantvariablethat is only barelycontributive. CriteriasuchasAIC that

are E -asymptoticallyinconsistentdo not necessarilyperformworsethanconsistent

onesin finite samples.

TheseOLS subsetselectioncriteria all minimize a quantity that becomes,in the

senseof E -asymptoticequivalence,ltl F p F `�� lnl � yQ � K ?�� (2.3)

with respectto thedimensionalityparameter� , where? is thethresholdvalue.De-

notethe E -asymptoticequivalenceof proceduresormodelsby -�� (and # -asymptotic

equivalenceby - > ). We write (2.3) in parameterizedform as OLSC ��?�� , where

OLS -+� OLSC ��/�� , AIC -+� OLSC ��!"� and BIC -+� OLSC ��t�"�~E�� . The model se-

lectedby criterion(2.3)is denotedby S OLSCa@��d

. Sinceequivalentproceduresimply

equivalentestimatedmodels(andviceversa),wecanalsowrite S OLS -��S OLSCa � d ,S AIC -+��S OLSC

a � d and S BIC -+��S OLSCa;� �¢¡ � d .

2.5 Shrinkage methods

Shrinkagemethodsprovideanalternativeto OLS subsetselection.Ridgeregression

givesa biasedestimateof themodel’sparametervectorthatdependson a ridgepa-

rameter. IncreasingthisquantityshrinkstheOLS parameterstowardzero.Thismay

give betterpredictionsby reducingthevarianceof predictedvalues,thoughat the

costof a slight increasein bias. It oftenimprovestheperformanceof theOLS esti-

matewhensomeof thevariablesare(approximately)collinear. Experimentsshow

that ridge regressioncanoutperformOLS subsetselectionif mostvariableshave

small to moderateeffects (Tibshirani,1996). Although standardridge regression

doesnot reducemodeldimensionality, its lesserknown variantsdo (Miller, 1990).

22

Thenn-garrote(Breiman,1995)andlasso(Tibshirani,1996)procedureszerosome

parametersand shrink othersby defining linear inequality constraintson the pa-

rameters.Experimentsshow that thesemethodsoutperformridge regressionand

OLS subsetselectionwhenpredictorshavesmallto moderatenumbersof moderate-

sizedeffects, whereasOLS subsetselectionbasedon CD prevails over othersfor

smallnumbersof largeeffects(Tibshirani,1996).

All shrinkagemethodsrely onaparameter:theridgeparameterfor ridgeregression,

thegarroteparameterfor thenn-garrote,andthetuningparameterfor thelasso.In

eachcasetheparametervaluesignificantlyinfluencestheresult.However, thereis

no consensuson how to determinesuitablevalues,which may partly explain the

unstableperformanceof thesemethods.In Section3.4,we offer a new explanation

of shrinkagemethods.

2.6 Data resampling

Standardtechniquesof dataresampling,suchascross-validationandthebootstrap,

canbe appliedto the subsetselectionproblem. Theoreticalwork hasshown that,

despitetheir computationalexpense,thesemethodsperformno betterthantheOLS

subsetselectionprocedures.For example,underweak conditions,Shao(1993)

showsthatthemodelselectedby leave-£ -out cross-validation,denotedCV( £ ), is E -

asymptoticallyconsistentonly if £( yEf| 6 andE p £)| } asEf| } . Thissuggests

that, perhapssurprisingly, the training set in eachfold shouldbe chosento be as

small aspossible,if consistency is desired.Zhang(1993)further establishesthat

undersimilarconditions,CV ��£(�~- OLSC ��!yE p £(�j w��E p £(��E -asymptotically. This

meansthatAIC - CV ��6'� andBIC - CV ��E~��n�"�~E p !�� w��n�"�~E p 6'�j�¤E -asymptotically.

The behavior of the bootstrapfor subsetselectionis examinedby Shao(1996),

who provesthat if thebootstrapusingsamplesize ¥ , BS ��¥¦� , satisfies¥¦ uE§| / ,it is E -asymptoticallyequivalentto CV ��E p ¥¦� ; in particular, BS ��E��- CV �¨6'�©E -

asymptotically. Therefore,BS ��¥¦��- OLSC ��E K ¥¦�� y¥¦��E -asymptotically.

23

Oneproblemwith thesedataresamplingmethodsis the difficulty of choosingan

appropriatenumberof folds £ for cross-validation,or anappropriatesamplesize ¥for thebootstrap—bothaffect thecorrespondingprocedures.More fundamentally,

their asymptoticequivalenceto OLSC seemsto imply that thesedataresampling

techniquesfail to take the competitionphenomenoninto accountand henceare

unableto solve theproblemsraisedin Section2.1.

2.7 The RI C and CI C

Whenthereis no pre-definedorderingof variables,it is necessaryto take account

of the processby which a suitableorderingis determined.The expectedvalueof

the ª th largestsquared« -statisticof # noisy variablesapproaches!¬�n��#= yª¢� as #increasesindefinitely. Thispropertycanhelpwith variableselection.

Thesoft thresholdingproceduredevelopedin thecontext of wavelets(Donohoand

Johnstone,1994)andthe RIC for subsetselection(FosterandGeorge,1994)both

aim to eliminateall non-contributory variables,up to the largest,by replacingthe

threshold! in AIC with !¬�t�"�k# ; that is, RIC - OLSC ��!¬�n�"�k#=� . Themorevariables,

the higherthe threshold.Whenthe true hypothesisis the null hypothesis(that is,

thereareno contributive variables),or thecontributive variablesall have large ef-

fects,RIC findsthecorrectmodelby eliminatingall noisyvariablesupto thelargest.

However, whentherearesignificantvariableswith small to moderateeffects,these

canbeerroneouslyeliminatedby thehigherthresholdvalue.

TheCIC procedure(TibshiraniandKnight, 1997)adjuststhetrainingerrorby tak-

ing into accounttheaveragecovarianceof thepredictionsandresponses,basedon

thepermutationdistribution of thedataset.In anorthogonallydecomposedmodel

24

space,thecriterionsimplifiesto

CIC �t�(�®- lnl F p F `�� ltl �TK ! E�Z¯ &° ±³² � « � a ±n´ > dnµ $Q � (2.4)¶ lnl F p F `�� ltl � K§· &° ±³² � �n��#w uª¢��$Q � 1 (2.5)

where «¨� a ±n´ > d is the ª th largestsquared« -statisticout of # , andE� is theexpectation

over the permutationdistribution. As this equationshows, CIC usesa threshold

valuethatis twice theexpectedsumof thesquared« statisticsof the � largestnoisy

variablesoutof # . Because�n¸t¹ >,º¼» ��½ «¨� a � ´ > d¿¾ ! E«¨� a � ´ > d�À -0/ , thismeansthat,for the

null hypothesis,even the largestnoisy variableis almostalwayseliminatedfrom

themodel. Furthermore,this hastheadvantageover RIC thatsmallercontributive

variablesaremorelikely to berecognizedandretained,dueto theuneven,stairwise

thresholdof CIC for eachindividualvariable.

Nevertheless,shortcomingsexist. For example,if mostvariableshavestrongeffects

andwill certainlynotbediscarded,theremainingnoisyvariablesaretreatedby CIC

asthoughthey werethesmallestout of # noisyvariables—whereasin reality, the

numbershouldbe reducedto reflect the smallernumberof noisy variables. An

overfittedmodelwill likely result.Analogously, underfittingwill occurwhenthere

are just a few contributive variables(Chapter6 givesan experimentalillustration

of this effect). CIC is basedon anexpectedorderingof thesquared« -statisticsfor

noisy variables,and doesnot dealproperly with situationswherevariableshave

mixedeffects.

2.8 Empirical Bayes

Although our work startedfrom the viewpoint of competingmodels,the solution

discussedin the later chaptersturnsout to be very similar to the methodologyof

empiricalBayes.In this section,we briefly review this paradigm,in anattemptto

provideasimpleandsolidbasisfor introducingtheideasin ourapproach.However,

25

Sections4.6and7.3show thatthecompetingmodelsviewpointseemsto fall beyond

thescopeof theexistingempiricalBayesmethods.

2.8.1 Empirical Bayes

The empirical Bayesmethodologyis due to Robbins(1951,1955,1964,1983).

After its appearance,it wasquickly consideredto beabreakthroughin thetheoryof

statisticaldecisionmaking(cf. Neyman,1962).Thereis alargeliteraturedevotedto

it. Booksthatprovide introductionsanddiscussionsincludeBerger(1985),Maritz

andLwin (1989),CarlinandLouis(1996),andLehmannandCasella(1998).Carlin

andLouis (2000)givea recentreview.

The fundamentalideaof empiricalBayes,encompassingthe compounddecision

problem(Robbins,1951)1, is to usedatato estimatethe prior distribution (and/or

theposteriordistribution) in Bayesiananalysis,insteadof resortingto a subjective

prior in theconventionalway. Despitetheterm“empiricalBayes”it is a frequentist

approach,andLindley notesin hisdiscussionof Copas(1969)that“thereis noone

lessBayesianthananempiricalBayesian.”

A typicalempiricalBayesprobleminvolvesobservations{ � 1323232Á1�{ > , independently

sampledfrom distributions ��{ ± �,Â��± � , where Â��± may be completelydifferent for

eachindividual ª . DenoteÃg-��{ � 1323232i1�{ > �¢Ä and Å � -��Â�� 132�232�1,Ây�> �¨Ä . Let Å be an

estimatorof Å � . Givena lossfunction,say,Æ ��ÅT1�Å � ��-mlnlÇÅ p Å � lnl � - >° ±È² � ��Â ± p Â �± � � 1 (2.6)

theproblemis to find theoptimalestimatorthatminimizestheexpectedloss.

As we know, the usualmaximumlikelihoodestimatoris Å ML -ÉÃ , which is also

the Bayesestimatorwith respectto the non-informative constantprior. The em-

pirical Bayesestimator, however, is different. Since { � 1�23232�1j{ > are independent,1Throughoutthethesis,weadoptthewidely-usedtermempiricalBayes, althoughthetermcom-

pounddecisionprobablymakesmoresensein ourcontext of competingmodels.

26

we canconsiderthat Ây��1323232Á1,Â��> arei.i.d. from a commonprior distribution ��Â��,� .Therefore,it is possibleto estimate��Ây�� from { � 1�23232�1j{ > basedontherelationship� > ��{¤� ¶\Ê ��{��,Â � �©Ë��Â � ��2 (2.7)

Fromthis relationship,it is possibleto find anestimator ��Â � � of ��Â � � from the

known � > ��{¤� and ��{��,Â�� (we will discusshow to do this in Chapter5). Once��Â � � hasbeenfound,theremainingstepsareexactly like Bayesiananalysiswith

the subjective prior replacedby the estimated ��Â � � , resulting in the empirical

Bayesestimator Â EB

± ��ª�1�Ã��k-ÍÌuÎ Ây�,Ï¬��{ ± �,Â��,�©ËÐ��Ây��Ì Î Ï¿��{ ± ��Â � �©Ë¼��Â � � 2 (2.8)

Note that the dataareusedtwice: oncefor the estimationof ��Â�� andagainfor

updatingeachindividual Â ML

±(or { ± ).

TheempiricalBayesestimatorcanbetheoreticallyjustifiedin theasymptoticsense.

TheBayesrisk of theestimatorconvergesto thetrueBayesrisk as #�| } , i.e.,as

if thetruedistribution ��Â � � wereknown. Sinceno estimatorcanreducetheBayes

risk below the true Bayesrisk, it is asymptoticallyoptimal (Robbins,1951,1955,

1964).Clearly Å ML is inferior to Å EB in this sense.

Empirical Bayescan be categorizedin differentways. The bestknown division

is betweenparametric empirical Bayesand nonparametricempirical Bayes, de-

pendingon whetheror not the prior distribution assumesa parametricform—for

example,anormaldistribution.

Applicationsof empirical Bayesare surveyed by Morris (1983) and Maritz and

Lwin (1989).Themethodis oftenusedeitherto generatea prior distributionbased

on pastexperience(see,e.g.,Berger, 1985),or to solve compounddecisionprob-

lemswhena lossfunction in theform of (2.6) canbeassumed.In mostcases,the

parametricempiricalBayesapproachis adopted(see,e.g.,Robbins,1983;Maritz

andLwin, 1989),wherethemixing distribution hasa parametricinterpretation.To

27

my knowledge,it is rarelyusedin thegeneralsettingof empiricalmodeling,which

is the main concernof the thesis. Further, the lessfrequentlyusednonparametric

empiricalBayesappearsmoreappropriatein our approach.

2.8.2 Steinestimation

Shrinkageestimationor Steinestimationis anotherfrequentisteffort, andhasclose

ties to parametricempiricalBayes. It wasoriginatedby Stein(1955),who found,

surprisingly, that theusualmaximumlikelihoodestimatorfor themeanof a multi-

variatenormaldistribution,i.e., Å ML -^Ã , is inadmissiblewhen #ÒÑg! . Heproposed

anew estimator Å JS -ÔÓ¨6 p ��# p !"��Qq�lnl ÃÕlnl � Ö Ã~1 (2.9)

which waslater shown by JamesandStein(1961) to have smallermeansquared

error thantheusualestimatorfor every Å � , when #8Ñ×! , althoughthenew estima-

tor is itself inadmissible.This new estimator, known astheJames-Steinestimator,

improvesovertheusualestimatorby shrinkingit towardzero.Subsequentdevelop-

mentof shrinkageestimationgoesfar beyondthis case,for example,positive-part

James-Steinestimator, shrinking toward the commonmean,toward a linear sub-

space,etc.(cf. LehmannandCasella,1998).

The connectionof the James-Steinestimatorto parametricempirical Bayeswas

establishedin the seminalpapersof Efron and Morris (1971,1972a,b,1973a,b,

1975,1977). Among other things, they showed that the James-Steinestimatoris

exactlytheparametricempiricalBayesestimatorif ��Â�� is assumedto beanormal

distributionandtheunbiasedestimatorof theshrinkagefactoris used.

Althoughour work doesnot adopttheSteinestimationapproachandfocusesonly

onasymptoticresults,it is interestingto notethatwhile thegeneralempiricalBayes

estimatorachievesasymptoticoptimality, theJames-Steinestimator, andafew other

similarones,uniformallyoutperformtheusualML estimatorevenfor finite #��¨Ñ^!"� .28

Sofar theredoesnot seemto have beenfoundanestimatorwhich bothdominates

the usualML estimatorfor finite # and is # -asymptoticallyoptimal for arbitrary��Â�� .The James-Steinestimatoris appliedto linear regressionby Sclove (1968) in the

form ØIb-Ù��6 p ��# p !"��Q �Ú�Û�Û � �I~1 (2.10)

where �I is theOLS estimateof theregressioncoefficientsandRSSis theregression

sumof squares.Note that whenRSSis large in comparisonwith the numerator��# p !"�¨QR� , shrinkagehasalmostno effect. Alternativederivationsof this estimator

aregivenby Copas(1983).

2.9 Summary

In this chapter, we reviewed the major issuesin fitting linear models,aswell as

proceduressuchasOLS, OLS subsetselection,shrinkage,asymptotics,x-fixedvs.x-

random,dataresampling,etc.Fromthediscussion,acommonthreadhasemerged:

eachprocedure’s performanceis closelyrelatedto thedistribution of theeffectsof

the variablesit involves. This considerationis analogousto the idea underlying

empiricalBayesmethodology, andwill beexploitedin laterchapters.

29

Chapter 3

Paceregression

3.1 Intr oduction

In Chapter2, we reviewedmajor issuesin fitting linearmodelsandintroducedthe

empiricalBayesmethodology. Fromthisbrief review, it becomesclearthatthema-

jor proceduresfor linearregressionall fail to solve thesystematicproblemsraised

in Section2.1 in any generalsense.It hasalsoemergedthateachprocedure’s per-

formancerelatescloselyto theproportionsof thedifferenteffectsof theindividual

variables. It seemsthat the essentialfeatureof any particularregressionproblem

is thedistribution of theeffectsof thevariablesit involves.This raisesthreeques-

tions:how to definethisdistribution;how to estimateit from thedata,if indeedthis

is possible;andhow to formulatesatisfactorygeneralregressionproceduresif the

distribution is known. Thischapteranswersthesequestions.

In Section3.2, we introducethe orthogonaldecompositionof models. In the re-

sultingmodelspace,theeffectsof variables,whichcorrespondto dimensionsin the

space,arestatisticallyindependent.Oncethe effectsof individual variableshave

beenteasedout in this way, thedistributionof theseeffectsis easilydefined.

Thesecondquestionaskswhetherwe canestimatethis distribution from thedata.

The answeris “yes.” Moreover, estimatorsexist which arestronglyconsistent,in

thesenseof # -asymptotics(andalso E -asymptotics).In fact,theestimationis sim-

31

ply a clusteringproblem—tobe more precise,it involvesestimatingthe mixing

distributionof amixture.Section3.6showshow to performthis.

We answerthe third questionby demonstratingthreesuccessively morepowerful

techniques,andestablishoptimality in thesenseof # -asymptotics.First, following

conventionalideasof modelselection,thedistributionof theeffectsof thevariables

canbe usedto derive an optimal thresholdfor OLS subsetmodelselection. The

resultingestimatoris provably superiorto all existing OLS subsetselectiontech-

niquesthatarebasedon a thresholdingprocedure.Second,by showing that there

are limitations to the useof thresholdingvariationreduction,we develop an im-

provedselectionprocedurethatdoesnot involve thresholding.Third, abandoning

theuseof selectionentirelyresultsin anew adjustmenttechniquethatsubstantially

outperformsall otherprocedures—outperformingOLS evenwhenall variableshave

largeeffects.Section3.5 formally introducestheseprocedures,which areall based

on analyzingthedimensionalcontributionsof the estimatedmodelsintroducedin

Section3.3, anddiscussestheir properties.Section3.4 illustratesthe procedures

throughexamples.

Twooptimalities,correspondingto theminimumprior andposteriorexpectedlosses

respectively, aredefinedin Section3.4. The optimality of eachproposedestima-

tor is theoreticallyestablishedin thesenseof # -asymptotics,which further impliesE -asymptotics.Whenthenoisevarianceis unknown andreplacedby the OLS un-

biasedestimator $Qq� , the realizationof the optimality requires ��E p #=�| } ; see

Corollary3.12.1in Section3.6.

Sometechnicaldetailsarepostponedto theappendixin Section3.8.

3.2 Orthogonal decompositionof models

In this section,we discussthe orthogonaldecompositionof linear models,and

definesomespecialdistancesthat areof particularinterestin our modelingtask.

We will alsobriefly discussin Section3.2.3 the advantagesof usingan orthogo-

32

nal modelspace,the OLS subsetmodels,andthe choiceof an orthonormalbasis.

We assumethatno variablesarecollinear—that is, a modelwith # variableshas #degreesof freedom.Wereturnto theproblemof collinearvariablesin Section4.3.

Following thenotationintroducedin Section2.2,givenamodel S ��IT� with param-

etervector I , its predictionvectoris F ` -HGJI whereG is the E�N�# designmatrix.

This vectoris locatedin thespacespannedby the # separateE -vectorsthat repre-

sentthevaluesof theindividual variables.For any orthonormalbasisof this spaceÜ � 1 Ü � 13232�2�1 Ü > (satisfying lnl Ü & ltl�-Ý6 ), let � � 1,� � 1323232Á1,� > bethecorrespondingprojec-

tion matricesfrom this spaceonto the axes. F ` decomposesinto # components� � F ` 1�� F ` 13232�2Á1,� > F ` , eachbeinga projectionontoa differentaxis. Clearly the

wholeis thesumof theparts: F ` -ßÞ >& ² � � &iF ` .

3.2.1 Decomposingdistances

Thedistanceh ��S ��I � �i1�S ��I � �� betweenmodelsS ��I � � and S ��I � � hasbeende-

finedin (2.2).AlthoughthismeasureinvolvesthenoisevarianceQ � for convenience

of bothanalysisandcomputation,it is lnl F `ba@c3o�d p F `ba@cirsd ltl � that is thecenterof in-

terest.

Givenanorthogonalbasis,thedistancebetweentwo modelscanbedecomposedas

follows: h ��S ��I � ��1�S ��I � �j��- >° & ² � h & ��S ��I � ��1�S ��I � �j�i1 (3.1)

where h & ��S ��I � �i1�S ��I � ��~-àlnl � &,F `ba@c3o�d¤p � &,F `ba@cir¢d lnl � yQ � (3.2)

is the � th dimensionaldistancebetweenthe models. The propertyof additivity

of distancein this orthogonalspacewill turn out to be crucial for our analysis:

the distancebetweenthe modelsis equalto the sumof the distancesbetweenthe

33

models’projections.

Denoteby S�� thenull model S ��/"� , whoseevery parameteris zero.Thedistance

betweenS andthenull modelis theabsolutedistanceof S , denotedby á¦��Sâ� ;thatis, á¦��Sâ�~- h ��Sz1�S�� . Decomposingtheabsolutedistanceyields

á¦��Sâ�~- >° & ² � á & ��Sâ�i1 (3.3)

whereá & ��Sâ�~-ãlnl � &�F ` lnl �� yQR� is the � th dimensionalabsolutedistanceof S .

3.2.2 Decomposingthe estimation task

Two modelsareof centralinterestin theprocessof estimation:thetruemodel SU�andtheestimatedmodel S . In Section2.2,we definedthedistancebetweenthem

asthelossof theestimatedmodel,denotedbyÆ ��Sâ� ; thatis,

Æ ��Sâ�~- h ��Sz1�S � � .We also relatedthis loss function to the quadraticloss incurredwhenpredicting

futureobservations,asrequiredby theminimumexpectedlossprinciple.

Beingadistance,thelosscanbedecomposedinto dimensionalcomponentsÆ ��Sâ�~- >° & ² � Æ & ��Sâ�i1 (3.4)

whereÆ & ��Sâ�~- h & ��Sz1�SU�,�~-ãltl � &,F ` p � &ÁF `fe lnl �� yQR� is the � th dimensionalloss

of model S .

Orthogonaldecompositionbreakstheestimationtaskdown into individualestima-

tion tasksfor eachof the # dimensions.á & ��SU�� is theunderlyingabsolutedistance

in the � th dimension,and á & ��Sâ� is anestimateof it. Thelossincurredby this es-

timateisÆ & ��Sâ� . Thesumof the lossesin eachdimensionis thetotal loss

Æ ��Sâ�of themodel S . This reducesthemodelingtaskto estimatingá & ��S � � for all � .Oncetheseestimateddistanceshave beenfound for eachdimension�¦-ä6"1�23232�1,# ,theestimatedmodelcanbereconstructedfrom them.

34

Our estimationprocesshastwo steps:an initial estimateof á & ��SU�,� followedby

a furtherrefinementstage.This implies that thereis someinformationnot usedin

the initial estimatethatcanbeexploited to improve it; we usethe lossfunction to

guiderefinement.Thefirst stepis to find a relationshipbetweentheinitial estimateá & ��Sâ� and á & ��S � � . The classicOLS estimate åS , which hasparametervector�IJ-Ý��G � Gæ� � � G � F , providesabasisfor sucharelationship,becauseit is well known

thatfor all � , á & � åS�� areindependently, noncentrally�� distributedwith onedegree

of freedomandnoncentralityparameterá & ��SU�� "! (Schott,1997,p.390).Wewrite

this as á & � åSU��ç9� � � ��á & ��S � �� "!�� independentlyfor all �2 (3.5)

WhenQR� is unknownandthereforereplacedby theunbiasedOLS estimate$QR� , the ��distribution in (3.5)becomesan � distribution: á & � åSU��çß��¨6"1�E p #è1�á & ��SU�,�j "!"�(asymptotically)independentlyfor all � . The � -distribution canbeaccuratelyap-

proximatedby (3.5)when E p #�é / .This relationshipformsacornerstoneof ourwork. In Section3.8.1wediscusshow

to re-generatethemodelfrom theupdatedabsolutedistances.

Updating signedprojections. An alternativeapproachto updatingabsolutedis-

tancesis to updatesignedprojections,wherethe signsof the projectionscanbe

determinedby the directionsof theaxis in an orthogonalspace.Both approaches

sharethesamemethodology, anduseaverysimilaranalysisandderivationto obtain

correspondingmodelingprocedures.

Our work throughoutthethesisfocuseson updatingabsolutedistances.It seemsto

us thatupdatingabsolutedistancesis generallymoreappropriatethanthealterna-

tive,althoughwe do not excludethepossibilityof usingthelatter in practice.Our

argumentsfor this preferencearegivenin Section4.7.

35

3.2.3 Remarks

Advantagesof an orthogonal model space. Thereareat leasttwo advantages

to usingorthogonallydecomposedmodelsfor estimation.Thefirst is additivity: the

distancebetweentwo modelsis thesumof their distancesin eachdimension.This

convenientpropertyis inheritedby two specialdistances:theabsolutedistanceof

themodelitself andthelossfunctionof theestimatedmodel.Of course,any quan-

tity that involvesadditionandsubtractionof distancesbetweenmodelsis additive

too.

Thesecondadvantageis theindependenceof themodel’scomponentsin thediffer-

entdimensions.Thismakesthedimensionaldistancesbetweenmodelsindependent

too. In particular, theabsolutedistancesin eachdimension,andthelossesincurred

by an estimatedmodel in eachdimension—aswell asany measurederived from

theseby additiveoperators—areindependentbetweenonedimensionandanother.

Thesetwo featuresallow the processof estimatingthe overall underlyingmodel

to be broken down into estimatingits componentsin eachdimensionseparately.

Furthermore,the distribution of the effectsof variables—preciselydimensions—

canbeaccuratelydefined.

A specialcase: OL S subsetmodels. It is well known that OLS subsetmodels

area specialcaseof orthogonallydecomposedmodelsin which eachvariableis

associatedwith anorthogonaldimension.Thismakesit easyto simplify themodel

structure:discardingvariablesin acertainorderis thesameasdeletingdimensions

in thesameorder. If thediscardedvariablesareactuallyredundant,deletingthem

makesthemodelmoreaccurate.

Denotethe nestedsubsetmodelsof S , from the null model up to the full one,

by S��31�S � , 232�231�S > respectively. Denoteby F `�� the predictionvector of the� -dimensionalmodel S & , and by � `�� -ÔG `�� G �`�� G `�� G �`�� the orthogo-

nal projectionmatrix from the spaceof # dimensionsto the � -dimensionalsub-

36

spacecorrespondingto S & . Then F `�� - � `�� F ` and � & - � `�� p � `��¨ê o .Furthermore,the � th orthonormalbasecan be written as

Ü & - � &�F èltl � &�F lnlZ-� F `�� p F `��¨ê o �j èlnlÈ� F `�� p F `��¨ê o �'ltl .Choiceof an orthonormal basis. Thechoiceof anorthonormalbasisdepends

ontheparticularapplication.In somecases,theremayexist anobvious,reasonable

orthonormalbasis—forexample,orthogonalpolynomialsin polynomialregression,

or eigenvectorsin principal componentanalysis.In othersituationswhereno ob-

viousbasisexists, it is alwayspossibleto constructonefrom thegivendatausing,

say, a partial-� test(Section2.3). The latter, moregeneral,data-driven approach

is employed in our implementationandsimulationstudies(Chapter6). However,

wenotethatthepartial-� testneedsto observe F -valuesbeforemodelconstruction,

andhencecancausea problemof orthogonalizationselection(Section4.6),which

is ananalogousphenomenonto modelselection.

3.3 Contrib utions and expectedcontributions

Wenext exploreanew measurefor amodel:its contribution. An estimatedmodel’s

contribution is zerofor thenull modelandreachesa maximumwhenthemodelis

thesameastheunderlyingone.It canbedecomposedinto # independent,additive

componentsin # -dimensionalorthogonalspace—the“dimensionalcontributions.”

In practice,thesequantitiesarerandomvariables,andwe candefinebotha cumu-

lative expectedcontribution functionandits derivative of of the estimatedmodel.

Thesetwo functionscanbe estimatedfor any particularregressionproblem,and

will turn out to play key rolesin understandingthemodelingprocessandin build-

ing actualmodelsin practice.

37

3.3.1 ëíìjî8ï and ðñëíìjîJïDefinition 3.1 Thecontribution of an estimateS of theunderlyingmodel SU� is

definedto be ò ��Sâ�k- Æ ��S��,� p Æ ��Sâ��2 (3.6)

The contribution is actually the differencebetweenthe lossesof two models: the

null andtheestimated.Therefore,its signandvalueindicatethemeritsof theesti-

matedmodelagainstthenull one:a positivecontribution meansthat theestimated

modelis betterthanthenull onein termsof predictive accuracy; a negative value

meansit is worse;andzeromeansthey areequivalent.

Sinceá¦��S � � is aconstant,maximizingthe(expected)contribution

ò ��Sâ� is equiv-

alentto minimizing the(expected)lossÆ ��Sâ� , wherethelatter is our goalof mod-

eling. Therefore,wenow convertthisgoalinto maximizingthe(expected)contribu-

tion. Usingthecontribution, ratherthandealingdirectly with losses,is particularly

usefulin understandingthetaskof modelselection,aswill soonbecomeclear.

Givena # -dimensionalorthogonalbasis,thecontribution functiondecomposesinto# componentsthatretainthepropertiesof additivity anddimensionalindependence:ò ��Sâ��- >° & ² �ò & ��Sâ� (3.7)

where ò & ��Sâ�~- Æ & ��S�� p Æ & ��Sâ� (3.8)

is the � th dimensionalcontributionof themodel S .

In thesubsetselectiontask,eachdimensionis eitherretainedor discarded.It is clear

that this decisionshouldbe basedon the sign of the correspondingdimensional

contribution. If a dimension’s contribution is positive, retainingit will give better

38

predictiveaccuracy thandiscardingit, andconversely, if thecontributionis negative

thendiscardingit will improveaccuracy. If thedimensionalcontribution is zero,it

makesno differenceto predictive accuracy whetherthat dimensionis retainedor

not.

In following we focus on a single dimension,the � th. The resultsof individual

dimensionscaneasilybe combinedbecausedimensionsare independentand the

contribution measureis additive. Focusingon the contribution in this dimension,

we write %'& � -óá & ��Sâ� and % �& � -ôá & ��S � � . Without lossof generality, assume

that % �& ¾ / . If theprojectionof F ` in the � th dimensionis in thesamedirection

as that of F ` e , then %'& is the positive squareroot of á & ��Sâ� ; otherwiseit is the

negativesquareroot. In eithercase,thecontributioncanbewrittenò & ��Sâ��- % �& � p � %'& p % �& � � 2 (3.9)

Clearly,

ò & ��Sâ� is zerowhen %'& is / or ! % �& . When %'& liesbetweenthesevalues,the

contribution is positive. For any othervaluesof %'& , it is negative. The maximum

contribution is achievedwhen %u& - % �& , andhasvalue % �& � , which occurswhen—in

this dimension—theestimatedmodelis the truemodel. This is obviously thebest

thattheestimatedmodelcando in this dimension.

In practice,however, only %'& � is available. Neither the valueof % �& � nor the di-

rectionalrelationshipbetweenthe two projectionsareknown. Denote

ò & ��Sâ� by�� %'& � � , alteringthenotionof thecontribution of S in this dimensionto thecon-

tribution of %'& � . �� %'& � � is usedbelow asshorthandfor �� %'& � � % �& � 1�õ & ��- ò & ��Sâ� ,where õ & is thesignof %'& . In thefollowing we dropthesubscript� whenonly one

dimensionis underconsideration,giving % � and % � � for %'& � and % �& � respectively.

Wealsouse � for % � sinceit is this, ratherthan % , thatis available;likewiseweuse� � for % � � .We have arguedthat the performanceof an estimatedmodel can be analyzedin

termsof thevalueof its contribution in eachdimension.Unfortunately, this value

is unavailablein practice. What canbe computed,aswe will show, is the condi-

39

tional expectedvalue ö~÷¬½Ç��+�31�õ'�'l �13v À wherethe expectationis taken over all

theuncertaintiesconcerning�+� and õ . The“ v ” hererepresentstheprior knowledge

that is availableabout �� and õ (if any). In our work presentedlater, the“ v ” is re-

placedby threequantities:�� , ��+�,� (distributionof �� ), and �� plus ��,� . The# -asymptotic“availability” of �� (i.e., stronglyconsistentestimators)ensures

our # -asymptoticconclusions.

Note that theexpectationis alwaysconditionalon � . The valueof � is available

in practice,for example,thedimensionalabsolutedistanceof the OLS full model.

Therefore,theexpectedcontribution is a functionof � , andwe simply denoteit by�� (equivalently, ö~÷q�� ) asa generalrepresentationfor all possiblesitua-

tions.

3.3.2 øÍìjîJï and ùúì�îJïIn practice,theestimate� is a randomvariable. For example,accordingto (3.5),

theOLS estimateis ��9ç\� � �� !"� . Analogouslyto thedefinitionsof CDF andpdf

of a randomvariable,we definethecumulativeexpectedcontribution functionand

its derivative of � , denotedby ñ�� and �� respectively. For convenience,we

call �� and �� the - and � -functionsof � .

Definition 3.2 Thecumulativeexpectedcontribution functionof � is definedas ñ��- Ê ÷� ��«j�jÏ¬��«j�=Ë�«�1 (3.10)

where Ï¿�� is thepdfof � .

Further, ��- Ë ñ��Ëw� 2 (3.11)

40

Immediately, wehave ��- ��Ï¬�� 2 (3.12)

Note that thedefinitionsof �� and �� , aswell astherelationship(3.12),en-

capsulateconditionalsituations.Forexample,with known �+� , wehave ��+��~-�� j "Ï¬�� , with known �� , ��©-×��,�� "Ï¬��,�� , andwith

known �� plus �� , ��q��1,��-Í��q��1,��j "Ï¬��¤��û1,�� . Further, and �aremutuallyderivableunderthesamecondition.

3.3.3 ùúìRüîþýuîbÿuïNow wederiveanexpressionfor ��(�� given �� , where �� is theOLS estimateof the

dimensionalabsolutedistance.Let ��ã-�$% � and % �û- K � � � , where $% is a signed

randomvariabledistributedon therealline accordingto thepdf ��$% � % �,� . Notethat

from the property(3.5), we have a specialsituationthat $% ç P � % �3136'� , and thus��0-à$% � ç9� � � � % � � "!"� , where

��$% � % � �k- 6� !�� ê � e� rr 2 (3.13)

It is this specialcasethatmotivatestheutilization of thepdf of $% , insteadof thatof�� directly. This is becausethepdf of thenoncentralchi-squareddistributionhasan

infinite seriesform, which is inconvenientfor bothanalysisandcomputation.

Throughoutthethesis,weonly use��$% � % � � in theform (3.13),althoughthefollow-

ing derivationworksgenerallyonce��$% � % �,� is known.

With known ��,$% � % �,� , theCDF of �� given �� is

��*�� ~- Ê � ÷� � ÷ ��«�� =Ë=«�1 (3.14)

41

hence Ï¿�*�� - £*��*��,�£ �� - �� K �� p � �� ! � �� 2 (3.15)

Using (3.9), rewrite the contribution of �� given �+� by a two-argumentfunction� ��$% � % �,� ��(��- � ��$% � % � ��- % � � p ��$% p % � � � 2 (3.16)

Only the sign of $% can affect the value of the contribution, and so the expected

contributionof �� given �� is

�� - � � � �� )� � � � � K � � p � �� p � �� K �� p � �� 2(3.17)

Using(3.12),

��*�� - � � � �� K � � p � �� p � �� ! � �� 2 (3.18)

In particular, ��/=��+�,��-ä/ for every �� (seeAppendix3.8.2). This givesthe fol-

lowing theorem.

Theorem 3.3 The ��*��+�,� of theOLS estimate�� given �+� is determinedby(3.18),

while thepdf Ï¿�*��,� is determinedby (3.15).

Because��- �� "Ï¿�� by (3.12),and Ï¬�� is alwayspositive, the value

of �� hasthe samesign as �� . Thereforethe sign of � canbe usedasa

criterion to determinewhethera dimensionshouldbe discardedor not. Within a

positiveinterval, where ��8ÑU/ , � is expectedto contribute positively to the

predictiveaccuracy, whereaswithin a negativeone, where ��9/ , it will do the

opposite.At azeroof �� theexpectedcontribution is zero.

42

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20

h(A ; 0.0)h(A ; 0.5)h(A ; 1.0)h(A ; 2.0)h(A ; 5.0)

-20

-15

-10

-5

0

5

0 5 10 15 20

EC(A ; 0.0)EC(A ; 0.5)EC(A ; 1.0)EC(A ; 2.0)EC(A ; 5.0)

(a) (b)

Figure3.1: ��(��+�� and �� , for �+�©-H/=132;4�136"1�!(1�4 .Figure3.1(a)shows examplesof ��*�� where � � is /�1,/=2;4�136"1�! and 4 . All the

curvesstartfrom theorigin. When � � -9/ , thecurvefirst decreasesas �� increases,

and thengraduallyincreases,approachingthe horizontalaxis asymptoticallyand

never rising above it. As thevalueof �� grows, the ��(��+�,� curvegenerallyrises.

However, it always lies below the horizontalaxis until �+� becomes0.5. When� � Ñ^/=2;4 , thereis onepositiveinterval. A maximumis reachednot far from � � (the

maximumapproaches�� asthe latter increases),andthereaftereachcurve slowly

descendsto meettheaxisataround· �� (theordinateapproachesthisvalueas�+� in-

creases).Thereaftertheinterval remainsnegative;within it, each� -functionreaches

aminimumandthenascendsto approachtheaxisasymptoticallyfrom below.

Most observationsin thelastparagraphareincludedin thefollowing theorem.For

convenience,denotethe zerosof � by� � , � � , �� in increasingorder along the

horizontalaxis,andassumethat � is properlydefinedat } . Since �+� in practiceis

alwaysfinite, weonly considerthesituation� � �^} .

Theorem 3.4 Propertiesof ��(��+�� .1. Every ��,� hasthreezeros(twoof which maycoincide).

2. When �� /=2;4 , � � - � � - / , ��§- } ; when �+�íÑ /�254 , � � - / ,/�� ^} , ��Õ-ß} .

43

3. �t¸n¹)÷ e ºú» �� p · ��~-9/ .4. ��*�� ©Ñí/ for ��×[O��/�1�� and ��(�� í/ for ��\[§�� 1�}^� .5. When ��ßÑ /=254 , ��(��)�� has a unique maximum,at �� , say. Then�t¸n¹)÷ e ºú» �� p ��-H/ .6. �� is continuousfor every �� and � � .

Theproof canbeeasilyestablishedusingformulae(3.13)–(3.18).Almost all these

propertiesareevident in Figure3.1(a).Thecritical value /=2;4 —thelargestvalueof�+� for which ��*�� hasno positive interval—canbeobtainedby settingto zero

the first derivative of ��*��+�,� with respectto �� at point ��ä-ô/ . The derivatives

around ��0-9/ canbeobtainedusingtheTaylor expansionof �� with respect

to� �� (seeAppendix3.8.2).

As notedearlier, the sign of � canbe usedasa criterion for subsetselection. In

Section3.4,Figure3.1(a)is interpretedfrom asubsetselectionperspective.

Figure3.1(b)shows theexpectedcontribution curves ��(��+�� . The locationof

themaximumconvergesto �+� as �+��| } , andit convergesvery quickly—when� � -0! , thedifferenceis almostunobservable.

3.3.4 ù ìjîñý � ïWhenderiving the - and � -functions,wehaveassumedthattheunderlyingdimen-

sionalabsolutedistance�+� is given. However, �� , is, in practice,unknown—only

thevalueof � is known. To allow the functionsto be calculated,we consider�+�to bearandomvariablewith adistribution ��+�� , adistributionwhich is estimable

from asampleof � . This is exactly theproblemof estimatingamixing distribution.

In our situation,the pdf of the correspondingmixture distribution hasthe general

form Ï¿��~- Ê"!"# Ï¬�� =Ë�� i1 (3.19)

44

where ��[g]%$ , the nonnegative half real line. ��,� is the mixing distribution

function, Ï¿��,� thepdf of thecomponentdistribution,and Ï¬��,�� thepdf of the

mixturedistribution. Section3.6discusseshow to estimatethemixing distribution

from a sampleof � .

If themixing distribution ��+�,� is given,the � -functionof � canbeobtainedfrom

thefollowing theorem.

Theorem 3.5 Let �� bedistributedaccording to ��+�� and � bea randomvari-

ablesampledfromthemixturedistributiondefinedin (3.19).Then��,��~- Ê !"# �� =Ë�� 1 (3.20)

where ��,� is the � -functiondeterminedby Ï¬��,� .Theproof followseasilyfrom thedefinitionof � .No matterwhetherthe underlyingmixing distribution ��+�� is discreteor con-

tinuous,it canalwaysbe estimatedby a discreteonewhile reachingthe same# -asymptoticconclusions. Further, in practice,a discreteestimateof an arbitrary�� is often easierto obtainwithout sacrificingpredictionaccuracy, andeasier

to manipulatefor furthercomputation.Supposethecorrespondingpdf &q�� of a

discrete�� is definedas

&q��7 �± �~-(' ± 1 where

)° ±³² � ' ± - 6"2 (3.21)

Then(3.19)canbere-writtenasÏ¬��,��~- )° ±³² � ' ± Ï¬��,7 �± ��1 (3.22)

45

and(3.20)as ��,��- )° ±³² � ' ± ��7 �± ��2 (3.23)

Althoughthegeneralforms(3.19)and(3.20)areadoptedin thefollowing analysis,

it is (3.22)and(3.23)thatareusedin practicalcomputations.Notethat if ��,��and Ï¬��,�� areknown, the expectedcontribution of � given �� is given by

(3.12)as ��,�� Ï¬��,�� .Sinceall �� & ’s of an OLS estimatedmodelwhich is decomposedin an orthogonal

spacearestatisticallyindependent(see(3.5)), they form a samplefrom the mix-

turedistribution with a discretemixing distribution �� . Thepdf of themixture

distribution takesthe form (3.22), while the pdf of the componentdistribution is

provided by (3.15). Likewise the � -function of the mixture hasthe form (3.23),

while thecomponent� -functionis givenby (3.18).

Fromnow on,themixing distribution �� becomesourmajorconcern.If thetwo

functions Ï¬��+�,� and �� arewell defined(asthey arein OLS estimation),��+�,� uniquelydeterminesÏ¬��,�� and �� by (3.19)and(3.20)respectively.

The following sectionsanalyzethe modelingprocesswith known ��+�� , show

how to build thebestmodelwith known �� , andfinally tacklethequestionof

estimating��+�,� .3.4 The role of * +-,/. and 01+2,/. in modeling

The - and � -functionsilluminateour understandingof themodelingprocess,and

helpwith building modelstoo. Hereweusethemto illustrateissuesassociatedwith

theOLS subsetselectionproceduresdescribedin Section2.1,andto elucidatenew,

hitherto unreported,phenomena.While we concentrateon OLS subsetselection

criteria,we touchonshrinkagemethodstoo.

We alsoillustratewith examplesthebasisfor thenew proceduresthatareformally

46

definedin thenext section.Not only doessubsetselectionby thresholdingvariation

reductionseverelyrestrictthemodelingspace,but thevery ideaof subsetselection

is a limited one—whena wider modelingspaceis considered,betterestimators

emerge. Consequentlywe expandour horizonfrom subsetselectionto thegeneral

modelingproblem,producinga final model that is not a least-squaresfit at all.

This improveson OLS modelingeven whenthe projectionson all dimensionsare

significant.

Themethodologywe adoptis suggestedby contrastingthemodel-basedselection

problemthat we have studiedso far with the “dimension-basedselection”that is

usedin principal componentanalysis. Dimension-basedselectiontestseachor-

thogonaldimensionindependentlyfor elimination,whereasmodel-basedselection

analyzesasetof orthogonalnestedmodelsin sequence(asdiscussedin Section2.3,

the sequencemay be defineda priori or computedfrom the data). In dimension-

basedselection,deletinga dimensionremovesthetransformedvariableassociated

with it, andalthoughthis reducesthenumberof dimensions,it doesnotnecessarily

reducethenumberof original variables.

Chapter2 showedthatall theOLS subsetselectioncriteriasharetheideaof thresh-

olding the amountby which the variationis reducedin eachdimension.CIC sets

differentthresholdvaluesfor eachindividualdimensiondependingonits rankin the

orderingof all dimensionsin the model,whereasothermethodsusefixed thresh-

olds.While straightforwardfor dimension-basedselection,this needssomeadjust-

mentin model-basedselectionbecausethevariationreductionsof thenestedmodels

maynotbein thedesireddecreasingorder. Thenecessaryadjustment,if a few vari-

ablesaretested,is to comparethereductionsof thevariationby thesevariableswith

thesumof theircorrespondingthresholdvalues.Thecentralidearemainsthesame.

Therefore,thekey issuein OLS subsetselectionis thechoiceof threshold.We de-

notetheseschemesby OLSC ��?�� , where? is thethreshold(see(2.3)). Theoptimum

valueof ? is denotedby ?w� . We now considerhow ?� canbedeterminedusingthe -function,assumingthatall dimensionshave thesameknown . We begin with

dimension-basedselectionandtacklethenestedmodelsituationlater.

47

As discussedearlier, our modelinggoal is to minimizetheexpectedloss. Herewe

considertwo expectedlosses,theprior expectedloss ö ` ö ` ½ Æ ��Sâ� À andthepos-

terior expectedloss ö ` ½ Æ ��Sâ� À —or equivalentlyin our context ö ÷ ö~÷~½ Æ ��Sâ� À andö~÷¿½ Æ ��Sâ� À respectively—whichcorrespondto the situationsbeforeandafter see-

ing the data. Note that by � herewe meanall � ’s of the model S . We call the

correspondingoptimalitiesprior optimality andposterioroptimality. In practice,

posterioroptimality is usuallymoreappropriateto useafter the datais observed.

However, prior optimality, without taking specificobservationsinto account,pro-

videsa generalpictureaboutthemodelingproblemunderinvestigation.More im-

portantly, both optimalitiesconverge to eachotherasthe numberof observations

approachesinfinity, i.e., # -asymptoticallyhere.In thefollowing theorem,we relate? � to prior optimality, while in the next sectionwe seehow the estimatorsof two

optimalitiesconvergeto eachother.

Theorem 3.6 Givenan orthogonaldecompositionof a modelspaceV > , let SU�ú[V > beanyunderlyingmodeland S [�V > beits estimate. Assumethat all dimen-

sionalabsolutedistancesof S havethesamedistribution function �� andthus

thesame -function.Rankthe á & ��Sâ� ’s in decreasingorder as � increases.Then

the estimator S OLSCa@� e d

of SU� possessesprior optimality amongall S OLSCa@��d

if

andonly if ? � -4365��k¹¸87÷:9(� ��i2 (3.24)

Proof. For dimension� , weknow from (3.8) thatÆ & ��Sâ�~- Æ & ��S�� p ò & ��Sâ�i2OLSC ��?�� discardsdimension� if á & ��Sâ�;�í? , soò & ��S OLSC

a@��d ��- <=?> / if á & ��Sâ��í?ò & ��Sâ� if á & ��Sâ�ÕÑM? ,48

where�0-gá & ��Sâ� is a randomvariablewhich is distributedaccordingto theCDF�� . Thus Ê !@# ö~÷~½ Æ & ��S OLSCa;��d � À Ë��- Ê ! # ö~÷¿½ Æ & ��S�� À Ëw�� p Ê »� ö~÷¬½ ò & ��Sâ� À Ë��- Æ & ��S�� p ��}^� K ��?��2

Takingadvantageof additivity anddimensionalindependence,sumtheaboveequa-

tion overall # dimensions:ö ÷ ö~÷¿½ Æ ��S OLSCa@��d � À - Æ ��S��,� p #= ñ��}^� K #= ��?�i2 (3.25) ��?�� is theonly termon theright-handsideof (3.25)thatvarieswith ? . Thusmin-

imizing the prior expectedlossis equivalentto minimizing ��?�� , andvice versa.

Thiscompletestheproof. ATheorem3.6 requiresthedimensionalabsolutedistancesof the initially estimated

model—e.g.,theOLS full model—tobesortedinto decreasingorder. This is easily

accomplishedin the dimension-basedsituation,but not in the model-basedsitu-

ation. However, the nestedmodelsareusually invariably generatedin a way that

attemptsto establishsuchanorder. If so,thisconditionis approximatelysatisfiedin

practiceandthus S OLSCa;� e d

is agoodapproximationto theminimumprior expected

lossestimatorevenin themodel-basedsituation.

FromTheorem3.6,wehave

Corollary 3.6.1 Propertiesof ?� .1. ?w�Õ-B3C5��¹¸87EDGFIH�D@J8K ��L�� , where W�� ± _ is thesetof zerosof � .2. If ? � Ñ^/ , ��? � p �©Ñ^/ ; if ? � �í} , ��? � K ��^/ .3. ��? � �;�g/ and ��}^� p ñ��? � � ¾ / .

49

4. If thereexists � such that ñ��}^� p ñ��©Ñ / , then ?��í} .

Properties1 and2 show that theoptimum ?� mustbea zeroof � —moreover, one

that separatesa negative interval to the left from a positive interval to the right

(unless?w��-Ù/ or } ). Properties3 and4 narrow thesetof zerosthat includesthe

optimalvalue ?� , andthushelpto establishwhich oneis theoptimum.

Four examplesfollow. The first two illustrateTheorem3.6 in a dimension-based

situationin which eachdimensionis processedindividually. In the first example,

eachdimension’sunderlying�� is known—equivalently, its �� is known. In

the second,the underlyingvalueof eachdimensionalabsolutedistanceis chosen

from two possibilities,andonly themixing distributionof thesetwo valuesandthe

corresponding� -functionsareknown.

Thelast two examplesintroducetheideasfor building modelsthatwe will explore

in thenext section.

Throughouttheseexamples,noticethatthedimensionalcontributionsareonly ever

usedin expected-valueform, andthe component� -function is the OLS �� .Further, we alwaystake the # -asymptoticviewpoint, implying that thedistribution�� canbeassumedknown.

Example3.1 Subsetselectionfrom a single mixture. Considerthe function��*�� illustratedin Figure3.1(a).Wesupposethatall dimensionshavethesame�� .Noisydimensions,andoneswhoseeffectsareundetectable. A noisydimension,for

which ��/�� is alwaysnegative,will beeliminatedfrom themodelnomatterhow

large its absolutedistance �� . Since �n¸n¹÷ e º �è��(��)��+�,��-��*��,/"� , non-redundant

dimensionsbehave more like noisy onesas their underlyingeffect decreases—in

otherwords,their contribution eventuallybecomesundetectable.When � � �×/=2;4 ,any contribution is completelyoverwhelmedby thenoise,andno subsetselection

procedurecandetectit.

50

Dimensionswith small vs. large effects. Whenthe estimateresidesin a negative

interval of � , its contribution is negative. All � s, no matterhow large their �+� ,have at leastonenegative interval � · �+��1�}^� . This invalidatesall subsetselection

schemesthateliminatedimensionsbasedon thresholdingtheir variationreductions

with a fixed threshold,becausea large estimate �� doesnot necessarilymeanthat

the correspondingvariableis contributive—its contribution also dependson � � .The reasonthat threshold-typeselectionworks at all is that the estimate �� in a

dimensionwhoseeffect is largeis lesslikely to fall into anegativeinterval thanone

whoseeffect is small.

The OLSC ��?�� criterion. The OLS subsetselectioncriterion OLSC ��?�� eliminates

dimensionswhoseOLS estimatefalls below thethreshold? , where ?b-Ù! for AIC,�n��~E for BIC, !¬�n�"�k# for RIC, andtheoptimalvalueis ? � asdefinedin (3.24).Since

dimensionsshouldbediscardedbasedonthesignof theirexpectedcontribution,we

considerthreecases:dimensionswith zeroandsmalleffects,thosewith moderate

effects,andthosewith largeeffects.

Whena dimensionis redundant,i.e. ��b-®/ , it shouldalways be discardedno

matterhow largetheestimate �� . Thiscanonly bedoneby OLSC ��?w�� , with ?w�©-0}in this case. Whenever ?M��} , dimensionswhose �� exceeds? arekept inside

the model: thusa certainproportionof redundantvariablesareincludedin thefi-

nal model. Dimensionswith small effectsbehave similarly to noisyones,andthe

thresholdvalue ?w�Õ-ß} is still best—whichresultsin thenull model.

Supposethatdimensionshave moderateeffects.As thevalueof �+� increasesfrom

zero, the valueof the cumulative expectedcontribution ñ��?�� will at somepoint

changesign.At thispoint, themodelfoundby OLSC ��?�� , whichheretoforehasbeen

betterthan the full model,becomesworsethan it. Hencethereis a valueof � �for which the predictive ability of S OLSC

a;�,dis the sameasthat of the full model.

Furthermore,thereexists a valueof �� at which the predictive ability of the null

model is the sameas that of the full model. In thesecases,model S OLSCa@� e d

is

either the null modelor the full one,since ?w� is either / or } dependingon the

51

valueof �� .Wheneachdimensionhasa largeeffect—largeenoughthatthepositionof thesec-

ondzeroof ��(��)�� is at least ? —any OLSC ��?� with fixed ?§ÑÙ/ will inevitably

eliminatecontributive dimensions.This meansthat the full model is a betterone

than S OLSCa@��d

. Furthermore,OLSC ��?w�� with ?�¦-U/ will alwayschoosethe full

model,which is theoptimalmodelfor every S OLSCa@��d

.

Shrinkagemethodsin orthogonalspace. In orthogonalregression,when G � G is a

diagonalmatrix,contribution functionshelpexplain why shrinkagemethodswork.

Thesemethodsshrink theparametervaluesof OLS modelsandusesmallervalues

thantheOLS estimates.Thismayor maynotchangethesignsof theOLS estimated

parameters;however, for orthogonalregressions,the signsof the parametersare

left unchanged.In this situation,therefore,shrinkingparametersis tantamountto

shrinkingthe �� ’s. Ridgeregressionshrinksall theparameterswhile NN-GAROTTE

andLASSO shrinkthelargerparametersandzerothesmallerones.

When �� is small, it is possibleto choosea shrinkageparameterthat will shrink�� ’s that lie between� � and · � � to around � � , andshrink the negative contribu-

tionsoutside· � � to becomepositively contributive—despitethefactthat �� around

the maximumpoint �+� are shrunkto smallervalues. This may give the result-

ing model lower predictive error thanany modelselectedby OLSC ��?�� , including?^-�?w� . Zeroing the smaller �� ’s by NN-GAROTTE and LASSO doesnot guaran-

teebetterpredictive accuracy thanridgeregression,for thesedimensionsmight be

contributive. When �+� is large,shrinkagemethodsperformbadlybecausethedis-

tributionof �� tendsto besharperaround�� . This is why OLS subsetselectionoften

doesbetterin this situation.

Example3.2 Subsetselectionfroma doublemixture. Suppose��- # �# ��(��)�,7 � �� K # �# ��*��,7 �� i1 (3.26)

where #\- # � K # � . For # � dimensionsthe underlying �+� is 7T�� , while for the

52

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0 5 10 15 20-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0 5 10 15 20

(a) (b)

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

5:95

50:50

95:5

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

0 5 10 15 20

(c) (d)

Figure3.2: ��*��- > o> ��(��,7¿�� K > r> ��*��7¿�� . (a). 7¿�� -�/=1�7¿�� -ON=1�# � Y©# � -:"/æY¿!y/ . (b). 7¿��- /=1,7T�� -PN=1�# � Y¿# � -RQ"4æYT!"4 . (c). 7¿��- /�1,7¿�� -ó!�/ , # � YT# �arerespectively 4bYTS�4�1�4�/fY�4�/=1�S"4fY�4 . (d). 7 � � and 7 �� arecarefullysetwith fixed# � Y�# � -BS�4)Y(4 suchthat ��L��i�~-0/ .remaining # � dimensionsit is 7 �� . �� is an observation sampledfrom the mixture

distribution Ï¬�*��- > o> Ï¬�*��,7¿�� K > r> Ï¬�(��,7¿�� . Altering the valuesof # � , 7T�� , # �and 7 �� yieldsthedifferent � s illustratedin Figure3.2. Theoptimalthreshold? � of

OLSC ��?�� is discussed.

In Figure3.2(a),where 7¿��Ð- / , 7¿�� -/N and # � Y�# � - :"/ Y�!�/ , no positive interval

existsdespitethefact that thereare !�/ non-redundantdimensions.This is because

the effect of all the non-redundantdimensionsis overwhelmedby the noisyones.

In principle,no matterhow largeits effect, any dimensioncanbeoverwhelmedby

a sufficient numberof noisyones. In this case?�- } andOLSC ��?�� selectsthe

null model.

53

In Figure3.2(b),which is obtainedfrom theprevioussituationby altering # � Y�# �to Q"4bY�!"4 , thereis a positive interval. But ñ��}^� is theminimum of all zeros,so

that ?w� remains} and the modelchosenby OLSC ��?w�� is still the null one. The

contributivedimensionsarestill submergedby thenoisyones.

If a finite thresholdvalue ?� exists,it mustsatisfy ��}^� p ��?��©Ñ^/ (Property4

of Corollary 3.6.1). It can take on any nonnegative value by adjustingthe four

parameters# � , 7¿�� , # � and 7¿�� . Figure3.2(c)shows threefunctions � , obtainedby

setting 7 � � and 7 �� to / and !�/ respectively andmakingtheratio between# � and # �4)Y"S�4 , 4�/ZY(4�/ , and S�4ZY*4 . As thesecurvesshow, thecorrespondingvaluesof ?w� are

about ! , · and : .In Figure3.2(d), # � Y(# � -BS�4ZY�4 and 7 � � and 7 �� aresetto make ��L��-9/ , where�U� is the third zeroof � . In this case,therearetwo possibilitiesfor ?� : theorigin� � , and �� . �U� givesa simplermodel. Notice that the numberof parametersof

thetwo modelsis in theapproximateratio 4¦Y�6 /�/ . However, thebalanceis easily

broken—forexample,if 7¿�� increasesslightly, thenthereis a singlevalue0 for ?w� .Althoughthelargermodelhasslightly smallerpredictiveerrorthanthesmallerone,

it is muchmorecomplex. Hereis a situationwherea far moresuccinctmodelcan

beobtainedwith asmallsacrificein predictiveaccuracy.

Example3.3 Subsetselectionbasedon thesignof � . AlthoughOLSC ��? � � is op-

timal amongevery OLSC ��?�� , it haslimitations. It alwaysdeletesdimensionswhose�� ’s fall below the threshold,andretainsthe remainingones. Thus it may delete

dimensionsthatlie in thepositive intervalsof � , andretainonesin thenegative in-

tervals.Weknow thatthesignof � is thesignof theexpectedcontribution,andthe

selectioncanbe improvedby usingthis fact: we simply retaindimensionswhose��*�� is positiveanddiscardtheremainder.

In the next sectionwe formalize this idea and prove that its predictive accuracy

always improvesupon OLSC ��? � � . Becausethe positive andnegative intervals of� canlie anywherealongthe half real line, this proceduremay retaindimensions

with smaller �� anddiscardoneswith larger �� . Or it maydeleteadimensionwhose

54

�� lies betweenthoseof the otherdimensions.For example,in Figure3.2(b), the

new procedurewill keepall thedimensionswhose �� ’s lie within thesmallpositive

interval, despitethefactthatOLSC ��?w�� choosesthenull model.

Example3.4 General modeling. Sofar we have only discussedsubsetselection,

wheretheestimate �� is eitheralteredto zeroor remainsthesame.A naturalques-

tion is whetherbetterresultsmight be obtainedby relaxing this constraint—and

indeedthey canbe. For example,in Example3.1,whereall dimensionsarenoisy,

OLSC ��? � � deletesthemall andchoosesthenull model. This in effect replacesthe

estimate �� by theunderlying �� , which is / in this case.Similarly, whenall esti-

mates �� have thesameunderlying �� (which is non-zero),andall �� ’s areupdated

to �� , the estimatedmodel improvessignificantly—even thoughno dimensionis

redundant.

A similar thing happenswhen � is sampledfrom a mixture distribution. In Fig-

ure 3.2(a),the � -function of the mixture hasno positive interval, although20 out

of 100dimensionshave ��¼-VN . Thebestthatsubsetselectioncando is to discard

all dimensions.However, a dimensionwith ��\- !�/ —despite��*�� beinglessthan

0—is unlikely to be a redundantone; it is more likely to belongto the groupfor

which � � -WN . Altering its OLS estimate �� from 20 to 3 is likely to convert the

dimensioninto acontributiveone.

Thenext sectionformulatesa formalestimationprocedurebasedon this idea.

3.5 Modeling with known XY+2,/ZE.We now assumethat the underlying �� is known and definea group of six

procedures,collectively calledpaceregression, which build modelsby adjusting

the orthogonalprojectionsbasedon estimationsof theexpecteddimensionalcon-

tributions.They aredenotedPACE � 1 PACE � 13232�2�1 PACE A , andthemodelproducedby

PACE

±is written S PACEJ . Wewill discussin thenext sectionhow to estimate�� .

55

��+�,� is the distribution function of a randomvariable �� that representsthe di-

mensionalabsolutedistanceof theunderlyingmodel SU� . Thesedistances,denoted

by �+��1323232Á1��> , form asampleof size # from �� . Whenthemixing distribution��+�,� is known,(3.19)canbeusedto obtainthemixturepdf Ï¬��,�� fromthecom-

ponentdistribution pdf Ï¿�� . Fromthis, alongwith thecomponent� -function�� , thefunctions ��,�� and ��,�� canbefoundfrom (3.20)and(3.12)

respectively. As Theorem3.3 shows, OLS estimationprovidestheexpressionsfor

both Ï¬�� and ��+�� .Thepaceproceduresconsistof two steps.Thefirst, which is thesamefor all pro-

cedures,generatesan initial model S . We alwaysusethe OLS full model åS for

this, althoughany initial modelcanbeusedso long asthecomponentdistribution

andthecomponent� -functionareavailable. DecomposingåS in a givenorthogo-

nalspaceyieldsthemodel’sdimensionalabsolutedistances,say �� 1�23232�1q�� > . These

arein fact a samplefrom a mixture distribution ��*��û�,�� with known component

distribution �� û�� andmixing distribution �� . In thesecondstep,thefinal

modelis generatedfrom either(3.37)or (3.38),where

Ø� � 1323232Á1 Ø� > areobtainedby

updating �� 13232�2�1q�� > . Thenew proceduresdiffer in how theupdatingis done.

To characterizetheresultingperformance,wedefineaclassof estimatorsandshow

that paceestimatorsare optimal within this class,or a subclassof it. The class

of estimators[ > (where # is thenumberof orthogonaldimensions)is asfollows:

giventheinitial model åS andits absolutedistancesin anorthogonaldecomposed

modelspace,every memberof [ > is anupdatingof åS by (3.37)andviceversa,

wherethe updatingis entirely dependenton the set of absolutedistancesof åS .

Clearly, S OLSCa@��d [\[ > for any ? .

Variouscorollariesbelow establishthat thepaceestimatorsarebetterthanothers.

Eachestimator’s optimality is establishedby a theoremthat appliesto a specific

subclass,andproving eachcorollary reducesto exhibiting an examplewherethe

performanceis actuallybetter. Theseexamplesareeasilyobtainablefrom the il-

lustrationsin Section3.4. When we say that one estimatoris “better than” (or

“equivalentto”) another, wemeanin thesenseof posteriorexpectedloss.With one

56

exception,betterestimatorshave lowerprior expectedloss(theexception,asnoted

below, is Corollary3.8.1,which comparesPACE � andPACE � ). Referto Section3.4

for definitionsof prior andposteriorexpectedlosses.

Of thesix procedures,PACE � andPACE � performmodel-basedselection,thatis, they

selecta subsetmodelfrom a sequence.ProceduresPACE � andPACE C addressthe

dimension-basedsituation,whereeachorthogonaldimensionis testedandselected

(or not) individually. If theseproceduresare usedfor a sequenceof orthogonal

nestedmodels,the resultingmodelmaynot belongto thesequence.The last two

procedures,PACE B and PACE A , arenot selectionprocedures.Instead,they update

theabsolutedistancesof theestimatedmodelto valueschosenappropriatelyfrom

thenonnegativehalf realline.

Theorem3.6showsthatthereis anoptimalthresholdfor threshold-typeOLS subset

selectionthatminimizestheprior expectedloss.Weusethis ideafor nestedmodels.

Procedure 1 � PACE � � . Given a sequenceof orthogonal nestedmodels,let ? -365j�k¹�¸�7 DGF]H�D"J8K �� , where W�� ± _ is thesetof zerosof � . Outputthemodelin the

sequenceselectedby OLSC ��?�� .Accordingto Corollary3.6.1,Procedure1 findstheoptimalthreshold?� . Therefore

PACE � - OLSC ��?�,� (and S PACEo -0S OLSC

a@� e d). Sincethesequenceof dimensional

absolutedistancesof themodel åS is notnecessarilyalwaysdecreasing,S OLSCa@� e d

is not guaranteedto be the minimum prior expectedlossestimator. However, in

practicethesedistancesdo generallydecrease,andsoin thenestedmodelsituationS OLSCa@� e d

is anexcellentapproximationto theminimumprior expectedlossestima-

tor. In thissense,PACE � is superiorto otherOLS subsetselectionprocedures—OLS,

AIC, BIC, RIC, andCIC—for thesedonotusetheoptimalthreshold?� . In particular,

theprocedureCIC usesa thresholdthatdependson thenumberof variablesin the

subsetmodelaswell as the total numberof variables. The relative performance

of theseproceduresdependson the particularexperimentsusedfor comparison,

sincedatasetsexist for which any selectioncriterion’s thresholdcoincideswith the

optimalvalue ? � .57

Insteadof approximatingthe optimum as PACE � does,PACE � always selectsthe

optimalmodelfrom a sequenceof nestedmodels,whereoptimality is in thesense

of posteriorexpectedloss.

Procedure 2 � PACE � � . Amonga sequenceof orthogonalnestedmodels,outputthe

onewhich hasthelargestvalueof Þ & ±³² � ��(�� ± �j "Ï¬�*�� ± � for �)-�6"1�!(1323232i1�# .Theorem 3.7 Given ��,� , S PACE

rhasthe smallestposteriorexpectedlossin a

subclassof [ > in which each estimatorcan only selectfrom the sequenceof or-

thogonalnestedmodelsthat is provided.

Proof. Let S & betheselectedmodel,then

Ø� ± - / if ªúÑg� and

Ø� ± - �� ± if ª��0� .Fromthedefinitionof dimensionalcontribution (3.8),wehaveö H ÷EJ�K ½ Æ ��S & � À - Æ ��S�� p &° ±È² � ��*�� ± ��2 (3.27)

Because�� ± ��-ä�� ± �� Ï¬� �� ± � by (3.12),minimizing ö H ÷EJ�K ½ Æ ��S & � À is equiva-

lent to maximizing Þ & ±³² � ��*�� ± �� Ï¬�*�� ± � with respectto � . This completestheproof.ACorollary 3.7.1 Given�� andasequenceof orthogonalnestedmodels,S PACE

ris a betterestimatorthan S OLSC

a@��dfor any ? . This includesS PACE

o, S OLS, S AIC,S BIC, S RIC and S CIC.

Since, accordingto Section2.6, the cross-validation subsetselectionprocedure

CV ��£��-+� OLSC ��!uE p £��j w��E p £��j� andthe bootstrapsubsetselectionprocedure

BS ��¥¦��-+� OLSC �j��E K ¥¦�j y¥¦� , wehave

Corollary 3.7.2 Given �� and a sequenceof orthogonal nestedmodels, E -

asymptoticallyS PACEr

is a better estimatorthan S CVa_^�d

for any £ and S BSa ) d

for any ¥ .

58

In fact,thedifferencebetweenthemodelsgeneratedby PACE � andPACE � is small,

becausewehave

Corollary 3.7.3 Given �� , if theelementsof W�� & �¢�Z-×6"13232�2�1�# _ arein decreas-

ing order as � increases,S PACEo - > S PACE

ra.s.

Proof. Since W~�� & �s�^- 6"132�232�1�# _ are in decreasingorderas � increases,S PACEr

is in effect the model that minimizes Ì !@# ö ÷ ½ Æ ��S & � À Ë� > �*�� with respectto � ,where � > �*�� is the Kolmogorov empirical CDF of �� , and S PACE

ois the model

that minimizes Ì !@# ö ÷ ½ Æ ��S & � À Ë��*��+� with respectto � . Since, almost surely,� > � �� | �� +� uniformally as #g| } (Glivenko-Cantelli theorem),it implies

that,almostsurely, Ì ! # ö ÷ ½ Æ ��S & � À Ë� > �(��~| Ì ! # ö ÷ ½ Æ ��S & � À Ë��*�� as #�| } ,

dueto the Helly-Bray theorem(see,e.g.,Galambos,1995). Therefore,minimiz-

ing theprior expectedlossis equivalentto minimizing theposteriorexpectedloss,# -asymptotically. Wecompletetheproof with S PACEo - > S PACE

ra.s. A

Thesetwo proceduresshow how to selectthe bestmodel in a sequenceof # K 6nestedmodels.However, no modelsequencecanguaranteethattheoptimalmodel

is one of the nestedmodels. Thuswe now considerdimension-basedmodeling,

wherethe final model canbe a combinationof any dimensions.With selection,

the numberof potentialmodelsgiven the projectionson # dimensionsis as large

as ! > . Whenthe orthogonalbasisis providedby a sequenceof orthogonalnested

models,this kind of selectionmeansthat the final modelmay not be oneof the

nestedmodels,and its parametervector may not be the OLS fit in termsof the

original variables.

Procedure 3 � PACE �i� . Let ?�-(3C5��k¹¸87EDGF]H�D"J8K �� , where W�� ± _ is thesetof zeros

of � . Set

Ø� & -Ù/ if �� & ��? ; otherwise

Ø� & - �� & . OutputthemodeldeterminedbyW Ø� � 1323232Á1 Ø� > _ .Theorem 3.8 Given �� , S PACE is theminimumprior expectedlossestimator

of SU� in an estimatorclasswhich is thesubclassof [ > in which everyestimator

is determinedby W Ø� � 1323232i1 Ø� > _ where

Ø� & [þWu/=1q�� & _ for all � .59

Theproof is omitted;it is similar to thatof Theorem3.6.

Since S PACEo

is the modeldeterminedby W Ø� � 1323232Á1 Ø� > _ where

Ø� & is either �� & or/ dependingon whetherthe associatedvariableis includedor not, this estimator

belongsto theclassdescribedin Theorem3.8.Thisgivesthefollowing corollary.

Corollary 3.8.1 Given ��+�� , S PACE is a betterestimator(in the senseof prior

expectedloss)than S PACEo.

The differencebetweenthe modelsgeneratedby PACE � and PACE � is alsosmall,

because

Corollary 3.8.2 Given ��+�,� , if theelementsof W~�� & �¢�- 6"132�232Á1,# _ arein decreas-

ing orderas � increases,S PACEo -HS PACE .

As wehaveseen,whetheror notadimensionis contributiveis indicatedby thesign

of thecorresponding� -function.This leadsto thenext procedure.

Procedure 4 � PACE Ci� . Set

Ø� & -m/ if �� & �a�Ý/ ; otherwise

Ø� & - �� & . Outputthe

modeldeterminedby W Ø� � 1323232i1 Ø� > _ .PACE C doesnot rankdimensionsin orderof absolutedistanceandeliminatethose

with smallerdistances,asdo conventionalsubsetselectionproceduresandthepre-

cedingpaceprocedures.Instead,it eliminatesdimensionsthatarenot contributive

in theestimatedmodelirrespective of themagnitudeof their dimensionalabsolute

distance.It mayeliminatea dimensionwith a largerabsolutedistancethananother

dimensionthat is retained. (In fact the other proceduresmay end up doing this

occasionally, but they do soonly becauseof incorrectrankingof variables.)

Theorem 3.9 Given �� , S PACEb hasthesmallestposteriorexpectedlossin the

subclassof [ > in which everyestimatoris determinedby W Ø� � 1�23232�1 Ø� > _ where

Ø� & [Wu/=1¤�� & _ for all � .60

Theproof is omitted;it is similar to thatof Theorem3.7.

Becausethe estimatorclassdefinedin Theorem3.9 coversthe classesdefinedin

Theorems3.7and3.8,

Corollary 3.9.1 Given ��+�,� , S PACEb isabetterestimatorthan S PACErand S PACE .

PACE � , PACE � , PACE � andPACE C areall selectionprocedures:eachupdateddimen-

sionalabsolutedistanceof theestimatedmodelmustbeeither / or �� & . Theoptimal

valueof

Ø� & is often neitherof these. If the possiblevaluesarechosenfrom ] $instead,thebestupdatedestimate

Ø� & is theonethatmaximizestheexpectedcon-

tribution of the � th dimensiongiven �� & and �� . The optimality is achieved

over an uncountablyinfinite setof potentialmodels. This relaxationcanimprove

performancedramaticallyevenwhenthereareno noisydimensions.

Procedure 5 � PACE Bi� . Outputthemodeldeterminedby W Ø� � 1�23232i1 Ø� > _ , whereØ� & -4365��k¹c3]d÷:F ! # Ê"!"# ��Ï¿�� Ï¬� �� & �� =Ë�� i2 (3.28)

Theorem 3.10 Given ��,� , S PACEe has the smallestposteriorexpectedlossof

all estimators in [ > .Proof. Each OLS estimate �� & is an observation sampledfrom the mixture pdfÏ¬� ��,�� determinedby the componentpdf Ï¿� �� and the mixing distribution�� . If �� & is replacedby any �®[�]%$ , the expectedcontribution of � given�� & and ��+�� is��q�� & 1��~- Ì ! # ��+��Ï¬�*�� & ��=Ë��,�Ì ! # Ï¬�*�� & �� =Ë�� 2 (3.29)

Since �� b- �� j "Ï¬�� from (3.12) and Ì !@# Ï¬� �� & �� =Ëw�� is constantfor every � , the integration on the right-handside of (3.28) actually

maximizesover the expectedcontribution of � . From (3.8), this is equivalentto

61

minimizing the expectedloss in the � th dimension. Becauseall dimensionsare

independent,the posteriorexpectedlossof the updatedmodelgiven the set Wk�� & _and �� is minimized. ACorollary 3.10.1 Given �� , S PACEe is a betterestimatorthan S PACEb .This motivatesa new shrinkagemethod:shrink the magnitude(or the sumof the

magnitude)of the orthogonalprojectionsof the model åS . This is equivalent to

updatingthe OLS estimate åS to a modelthatsatisfies

Ø� & �z�� & for every � . Since

all shrinkageestimatorsof this typeobviously yield a memberof [ > , S PACEe is a

betterestimatorthanany of them.

Althoughwe do not pursuethis directionbecauseof theoptimality of S PACEe , this

helpsus to understandothershrinkageestimators.Modelsproducedby shrinkage

methodsin theliterature—ridgeregression(includingridgeregressionfor subsetse-

lection),NN-GAROTTE andLASSO—donotnecessarilyrequireanorthogonalspace

andhencedo not alwaysbelongto this subclass,andso we cannotshow that the

new estimatoris superiorto themin general.However, in theimportantspecialcase

of orthogonalregression,whenthecolumnvectorsof G aretakenastheorthogonal

axis,themodelsproducedby all theseshrinkagemethodsdobelongto thissubclass

(seeExample3.1). Therefore,

Corollary 3.10.2 Given �� , S PACEe is a betterestimatorfor orthogonalregres-

sionthan S RIDGE, S NN-GAROTTE and S LASSO.

A generalexplicit solutionto (3.28)doesnotseemto exist. Ratherthanresortingto

numericaltechniques,however, a goodapproximatesolutionis available. Consid-

eringthat�� Ï¬�� - � �s� �)�i� � � � ��s� ��i� � � � K � � p � ��Á� � � �f�� p � ��Á� � � �� K �� p � �� 1 (3.30)

62

thedominantpartontheright-handsideis � � � �� f�� —andits dom-

inanceincreasesdramaticallyas �� increases.Replacing ��+�,�� Ï¬��,� in

(3.28)by � � � �� , weobtainthefollowing approximationto PACE B .Procedure 6 � PACE Ai� . Outputthemodeldeterminedby W Ø� � 1�23232i1 Ø� > _ whereØ� & -4365��k¹c3]d÷:F ! # Ê !"# � � � �� jÏ¬�*�� & �� =Ë�� i2 (3.31)

Equation(3.31)canbesolvedby settingthefirst derivativeof theright-handsideto

zero,resultingin Ø� & - ÓiÌ !"# � � � Ï¿�*�� & ��+�,�=Ë��Ì !@# Ï¬�*�� & �� =Ëw�� Ö � 2 (3.32)

In (3.28),(3.31)and(3.32)thetruedistribution ��+�,� is discrete(asin (3.21)),so

they becomerespectivelyØ� & -B365��¹g3�d÷hF ! # )° ±³² � ��,7T�± �Ï¬��,7 �± � Ï¬�(�� & �,7 �± �i' ± 1 (3.33)Ø� & -B3C5��¹c3]d÷:F ! # )° ±È² � � � � �� 7 �± ��Ï¬�*�� & �,7 �± �j' ± 1 (3.34)

and Ø� & - Ó Þ )±³² � � 7 �± Ï¿� �� & �,7 �± �j' ±Þ )±³² � Ï¬�*�� & ��7 �± �i' ± Ö � 2 (3.35)

Thefollowing looseboundcanbeobtainedfor theincreasedposteriorexpectedloss

sufferedby thePACE A approximation.

Theorem 3.11/��göú½ Æ & ��S PACEk � l�� & 1,�� À p öú½ Æ & ��S PACEe � l�� & 1,�� À �í! � � � 2 (3.36)

63

Proof. SeeAppendix3.8.3.

Appendix3.8.3actuallyobtainsthetighterbound ·ml Ø� PACEk& 7 � � � ��n o÷ PACE k� p e, where7 � is thesupportpointof thedistribution �� thatmaximizes� � l Ø� PACEk& �Á� � � � p�� Ø� PACEk& ��+�� . This boundrisesfrom zero at the origin in termsof

Ø� PACEk& 7T� ,achievesthemaximum ! � � � at thepoint

Ø� PACEk& 7¿�û-ã/�254 , andthereafterdropsex-

ponentiallyto zeroas

Ø� PACEk& 7¿� increases.It follows that the increasedposterior

expectedlosscausedby approximatingPACE A is usuallycloseto zero.

Remarks

All thesepaceproceduresadjust the magnitudeof the orthogonalprojectionsof

the OLS estimate åS , basedon an estimateof the expecteddimensionalcontri-

butions. Among them, PACE B and PACE A go the furthest: eachprojectionof åSontotheorthogonalaxiscanbeadjustedto any nonnegativevalueandtheadjusted

valueachieves(or approximatelyachieves)thegreatestexpectedcontribution,cor-

respondingto the minimum posteriorexpectedloss. Thesetwo procedurescan

shrink,retainor evenexpandthevaluesof theabsolutedimensionaldistances.Sur-

prisingthoughit maysound,increasinga zerodistanceto a muchhighervaluecan

improvepredictiveaccuracy.

Of thesix paceprocedures,PACE � , PACE C andPACE A aremostappropriatefor prac-

tical applications.PACE A generatesa very goodapproximationto the modelfrom

PACE B , which is thebestof thesix procedures.ProcedurePACE � choosesthebest

memberof a sequenceof subsetmodelsthat is provided to it, which is useful if

prior informationdictatesthesequenceof subsetmodels.PACE � andPACE � involve

numericalintegrationandhave higher (posterior)expectedloss thanotherproce-

dures. PACE C , which is a lower expectedlossprocedurethanPACE � , is usefulfor

dimension-basedsubsetselection.

64

3.6 The estimationof XY+2, Z .Now it is time to considerhow to estimate�� from �� 1q�� 13232�2�1q�� > , which are

the dimensionalabsolutedistancesof the OLS estimate åS . Oncethis is accom-

plished,the proceduresdescribedin the last sectionbecomefully definedby re-

placingthetrue ��,� , which we assumedin thelastsectionwasknown, with the

estimate.Theestimationof ��+�,� is anindependentstepin thesemodelingproce-

dures,andcanbe investigatedindependently. It critically influencesthequality of

thefinal model—betterestimatesof �� givebetterestimatorsfor theunderlying

model.�� 1�� 132�232�1q�� > areactuallyasamplefrom amixturedistributionwhosecomponent

distribution �� is known andwhosemixing distribution is �� . (Strictly

speaking,this sampleis taken without replacement.However, this is asymptoti-

cally thesameassamplingwith replacement,andsodoesnot affectour theoretical

resultssincethey arebe establishedin the asymptoticsense.)Estimating ��,�from datapoints �� 1q�� 132�232�1q�� > is tantamountto estimatingthemixing distribution.

Notethatthemixturehereis acountableone—theunderlying��+�� hassupportat� � �,1�� 132�232�1�� > , andthenumberof supportpointsis unlimitedas #�| } . Chapter5

tacklesthis problemin ageneralcontext.

Thefollowing theoremguaranteesthat if themixing distribution is estimatedsuffi-

cientlywell, thepaceregressionprocedurescontinueto enjoy thevariousproperties

provedabove in thelimit of large # .Theorem 3.12 Let Wu� > �� _ be a sequenceof CDF estimators. If � > �� |rq�� a.s.as # | } , andtheknown �� is replacedby theestimator� > ��,� ,Theorems3.6–3.11(andall their corollaries)hold # -asymptoticallya.s.

Proof. Accordingto theHelly-Bray theorem,� > ��+��|rqO�� a.s.as #æ| }implies the almostsurepointwiseconvergenceof all the objective functionsused

in thesetheorems(andtheir corollaries)to theunderlyingcorrespondingfunctions,

becausethesefunctionsarecontinuous.Thisfurtherimpliesthealmostsureconver-

65

genceof theoptimalvaluesandof the locationswheretheseoptimaareachieved,

as #�| } . Thiscompletestheproof. AAll of theaboveresultsutilize thelossfunction lnl F ` p F � ltl � uQ � . However, our real

interestis ltl F ` p F � lnl � . Thereforeweneedthefollowing corollary.

Corollary 3.12.1 If thelossfunction lnl F ` p F � lnl � yQ � is replacedby lnl F ` p F � lnl � in

Theorems3.6–3.11(andall their corollaries),

1. Theorem3.12continuesto hold, if Q � is known;

2. Theorem3.12 holdsalmostsurely as E×| } if QR� is replacedwith an E -

asymptoticallystronglyconsistentestimator.

It is well known that both the unbiasedOLS estimator $Q � (for ��E p #w��| } asE | } ) and the biasedmaximumlikelihood estimator(for #w yEó| / ) are E -

asymptoticallystronglyconsistent.

In view of Theorem3.12,any estimatorof themixing distributionis ableto provide

thedesiredtheoreticresultsin thelimit if it is stronglyconsistentin thesensethat,

almostsurely, it convergesweaklyto theunderlyingmixing distributionas #| } .

From Chapter5, the maximumlikelihoodestimatoranda few minimum distance

estimatorsareknown to bestronglyconsistent.If any of theseestimatorsareusedto

obtain � > �� , Theorem3.12is secured.This,finally, closesourcircleof analysis.

However, it is pointedout in Chapter5 that all the minimum distancemethods

exceptthenonnegative-measure-basedonesuffer from a seriousdefectin a finite-

samplesituation: they maycompletelyignoresmallnumbersof datapointsin the

estimatedmixture, no matterhow distantthey arefrom the dominantdatapoints.

This severely impactstheir usein our modelingprocedures,becausethe valueof

onedimensionalabsolutedistanceis frequentlyquite different to all the others—

andthis implies that the underlyingabsolutedistancehasa very high probability

of beingdifferenttoo. In addition,themaximumlikelihoodapproach,which does

66

not seemto have this minority clusterproblem,canbeusedfor paceregressionif

computationalcostis notanissue.

For all theseestimators,thefollowing threeconditions,dueto Robbins(1964),must

besatisfiedin orderto ensurestrongconsistency (seealsoSection5.3.1).

(C1) ��{��,Â"� is continuouson s Nut .

(C2) Define v to be theclassof CDFson t . If �xw o -ã�xw r for � � 1,� � [�v , then� � -H� � .(C3) Either t is a compactsubsetof ] , or �n¸t¹zy ºU{»}| y~F Î ��{��Â"� exists for each{b[�s andis notadistribution functionon s .

Paceregressioninvolvesmixturesof �� +�, "!"� , where �+� is themixing parameter.

ConditionsC1 and C3 are clearly satisfied. C2, the identifiability condition, is

verifiedin Appendix3.8.4.

3.7 Summary

This chapterexploresandformally presentsa new approachto linear regression.

Not only doesthisapproachyield accuratepredictionmodels,it alsoreducesmodel

dimensionality. It outperformsothermodelingproceduresin the literature,in the

senseof # -asymptoticsand,aswe will seein Chapter6, it producessatisfactory

resultsin simulationstudiesfor finite # .We have limited our investigationto linearmodelswith normallydistributednoise,

but the ideasareso fundamentalthatwe believe they will soonfind applicationin

otherrealmsof empiricalmodeling.

67

3.8 Appendix

3.8.1 Reconstructingmodel fr om updated absolutedistances

Oncefinal estimatesfor theabsolutedistancesin eachdimensionhave beenfound,

themodelneedsto bereconstructedfromthem.Considerhow to build amodelfrom

a setof absolutedistances,denotedby � � 1323232Á1�� > . Let 7 -m��« � � � � 132�232�1�« > � � > � � ,where « & is either K 6 or p 6 dependingon whetheror not the � th projectionof the

predictionvector Fx�` hasthesamedirectionastheorthogonalbaseÜ & . This choice

of « & ’s valueis basedon thefactthattheprojectionsof F �` and F `fe aremostlikely

in thesamedirection—i.e.,any otherchoicewoulddegradetheestimate.

Ourestimateof theparametervectorisIJ-Ý��G � Gæ� � � G �� 7�Qq1 (3.37)

where � is a column matrix formed from the basesÜ � 13232�2�1 Ü > . For example, ifW'� � 1323232Á1�� > _ arethe OLS estimatesof theabsolutedistancesin thecorresponding

dimensions,(3.37)gives I thevalueof theOLS estimate�I .

It maybe that not all the � & ’s areavailable,but a predictionvectoris known that

correspondsto all missing � & ’s. This situationwill occurif somedimensionsare

forcedto bein thefinal model—forexample,theconstantterm,or dimensionsthat

give very greatreductionsin variation(very large � & ’s). Supposethe numberof

dimensionswith known � & ’s is # � , and call the overall predictionvector for the

remaining# p # � dimensionsF�� 8� . ThentheestimatedparametervectorisIJ-Ý��G � G8� � � G � � F�� 8� K � 7�Q��i1 (3.38)

where� is an EæNJ# � matrixand 7 a # � -vector.

The estimationof I from � & ’s is fully describedby (3.37)and(3.38). However,

in practicethecomputationtakesa different,moreefficient, route.Oncethe E Næ#68

approximationequationof theoriginal least-squaresproblemhasbeenorthogonally

transformed,finding theleastsquaressolutionreducesto solvingamatrixequation

� IJ-0£1 (3.39)

where�

is a #�N�# upper-triangularmatrixand £ is a # -vector(LawsonandHanson,

1974,1995). As a matterof fact, thesquareof the � th elementin £ is exactly the

OLS estimate�� & QR� . Whenanew setof estimates,say

Ø� & (�Z-�6"1323232i1�# ), is obtained,

thecorrespondingestimateof I � is thesolutionof (3.39)with the � th elementin £replacedby l

Ø� & Q without changingsign. If not all �� & ’s areknown, sothat(3.38)

is usedinsteadof (3.37),only dimensionswith known �� & ’s arereplaced.

3.8.2 The Taylor expansionof ùúìjîñýuîbÿ'ï with respectto � îLet % - � � ¾ / and % �©- � � � ¾ / . Denote�§-×6' � !]� . Because

� � % � % � ��- % � � p � % p % � � � -0! %*% � p % �and

�� % � % � �~-(� � � � � ê � e � rr -4� � r �� e ê � rr � � � e rrhence � � % � % �� % � % �,�� e rr- ��! % � p % � %@� r � e ê �rP�- ��! % � p % � % K ��! % � p % �¨�! % �TK ��! % � p % � �: % � K �� % C �- ! % � % K � p 6 K ! % � � � % � K � p ! % � K % � � � % � K �� % C �

69

Similarly

� � p % � % � �~- % � � p � % K % � � � - p ! %*% � p % �and

�� p % � % � ��-4� � � � � # � e� rr -(� � ê � r ê r �� er � � � e rrhence � � p % � % � �f�� p % � % � �� e rr- � p ! % � p % � %@� ê r � e ê �r �- p ��! % � K % � % K ��! % � K % ��! % � p ��! % � K % � �: % � K �� % C �- p ! % � % K � p 6 K ! % � � � % �¬K ��! % � p % � � � % � K �� % C �Therefore �� e rr - � � % � % ��f�� % � % �,� K � � p % � % ��f�� p % � % ��! % � � � � e rr- � p 6 K ! % � � � % K �� % � �thatis, �� k- 6� !�� er ½ � p 6 K !y� � � � � K �� `r � À 2 (3.40)

3.8.3 Proof of Theorem 3.11

To proveTheorem3.11,weneedasimplelemma.

Lemma 1 For any � ¾ / and �� ¾ / ,/�� )� � � � � p �� ;� · � �+� � � � Cj� ÷=÷ e �H! � � � 2 (3.41)

70

Proof. For every � ¾ / and �+� ¾ / , � � � �� ¾ / and � � p � �� z�Í/ ,hence

� � � �� - � � � �� f�� K � � � �� f�� p � �� K �� p � �� ¾ � � � �� f�� K � � p � �� p � �� K �� p � �� - �� i1and

� � � �� p �� - ½ � �¢� ��Á� � � � p � � p � ��Á� � � � À �� p � ��Á� � � �� K �� p � �� - · � �+� � �� p � ��Á� � � �� K �� p � �� · � � � � �� p � �� s� ��Á� � � �- · � � � � � � � � ÷�÷ e� ! � � � 1thuscompletingtheproofof thelemma. AProof of Theorem3.11. Accordingto thedefinitionof dimensionalcontribution, to

proveTheorem3.11we needto show that/��g�� Ø� PACEe& �q�� & 1,�� p �� Ø� PACEk& �q�� & 1�� j��í! � � � 1 (3.42)

where ��,� , no mattercontinuousor discrete,is themixing distribution function

usedin bothprocedures.Thefirst inequalityin (3.42)is obviousbecause

Ø� PACE e& is

theoptimalsolution.For thesecondinequality, from theabove lemmaandthefact

71

that

Ø� PACEk& is theoptimalsolutionof (3.34),we have�� Ø� PACEe& �q�� & 1,�� - Ì !@# ��Ø� PACE e& ��,�jÏ¬�*�� & ��=Ë��+�,�Ì ! # Ï¬� �� & �� =Ëw�� Ì ! # � � lØ� PACEe& �i� � � �jÏ¬�*�� & �� =Ëw�� Ì !@# Ï¬�*�� & �� =Ëw�� Ì !@# � � lØ� PACEk& � � � � �jÏ¬�*�� & ��,�=Ëw��,�Ì ! # Ï¬�*�� & �� =Ëw�� 2

Hence �� Ø� PACE e& �� & 1,�� p �� Ø� PACEk& �q�� & 1,�� Ì ! # W � � l

Ø� PACEk& � � � � � p �� Ø� PACEk& �� _ Ï¬�*�� & �� =Ë�� Ì !@# Ï¬�*�� & �� =Ëw�� 2Accordingto the lemma, � � l Ø� PACEk& �Á� � � � p �� Ø� PACEk& �� ¾ / for every � � ,andlet 7 � -(3C5��k¹c3�dH�÷ e K � � l Ø� PACEk& � � � � � p �� Ø� PACEk& �� 1where W'� � _ is thesetof supportpointsof �� . Therefore,�� Ø� PACE e& �q�� & 1,�� p �� Ø� PACEk& �R�� & 1,�� l Ø� PACEk& �Á� 7 � � p �� Ø� PACEk& �,7 � �� · l Ø� PACEk& 7 � � � � n o÷ PACE k� p e

� ! � � � 1whichfinishestheproof of Theorem3.11. A

72

3.8.4 Identifiability for mixtur esof �}� �(ìjî�ÿ��G�Rï distrib utions

Although we have beenunableto locatethe following theoremin the literature,

it seemsunlikely to be original. Note that here,identifiability appliesonly to the

situationwherethemixing function is limited to beinga CDF. Note that the iden-

tifiability of mixturesusingCDFsasmixing functionsimpliestheidentifiability of

mixturesusingany finite nonnegative functionsasmixing functions(Lemma4 in

Section5.4.3),asrequiredby (5.19).

Theorem 3.13 Themixture of �� , �!"� distributions,where �� is themixingpa-

rameter, is identifiable.

Proof. Let ��{�� be the CDF of the distributionP �L�¬1�6'� , and �� be the

CDFof thedistribution � � �� !"� . Fromthedefinitionof the � � �Á�� "!"� distribution

andthesymmetryof thenormaldistribution,�� U- �� )� � � � � p �� p � �� - �� )� � � � � K �� p � � � � p 6"1 (3.43)

where �� is theCDF ofP � � � � 136'� and �� p � � � � is theCDF ofP � p � � � 136 � . Therefore,if themixtureof �� wereunidentifiable,themix-

ture of normaldistributionswherethe mean � is the mixing parameterwould be

unidentifiable.Clearlythiscontradictsthewell-known identifiability resultfor mix-

turesof normaldistributions(Teicher, 1960),thuscompletingtheproof. A

73

Chapter 4

Discussion

4.1 Intr oduction

Several issuesrelatedto paceregressiondeserve furtherdiscussion.Somearenec-

essaryto completethedefinitionof thepaceregressionproceduresin specialsitua-

tions. Othersexpandtheir implicationsinto a broaderarena,while yet othersraise

interestingopenquestions.

4.2 Finite � vs. � -asymptotics

We have seenthat the paceregressionproceduresare optimal in a # -asymptotic

sense.Larger numbersof variablestend to produceestimatorsthat arecloserto

optimal. If thereareonly a few candidatevariables,paceregressionwill not neces-

sarily outperformothermethods.Since # is inevitably finite in practice,it is worth

expandingon this.

The central idea of paceregressionis to decomposethe predictionvector of an

estimatedmodel into # orthogonalcomponents,and then adjusteachaccording

to aggregatedmagnitudeinformationfrom all components.The morediversethe

magnitudesof thedifferentcomponents,thelessthey caninform theadjustmentof

75

any particularone. If onecomponent’s magnitudediffers greatly from that of all

others,thereis little basison which to alterits value.

Paceregressionshineswhenmany variableshave similar effects—acommonspe-

cial situationis whenmany variableshavezeroeffect. As theeffectsof thevariables

disperse,paceregression’ssuperiorityoverotherproceduresfades.Whentheeffect

of eachvariableis isolatedfrom thatof all others,thepaceestimatoris exactly the

OLS one.In principle,theworstcaseis whenno improvementoverOLS is possible.

Althoughpaceregressionis # -asymptoticallyoptimal, this doesnot meanthat in-

creasingthe numberof candidatevariablesnecessarilyimprovesprediction. New

contributivevariablesshouldincreasepredictiveaccuracy, but new redundantvari-

ableswill decreaseit. Pre-selectionof variablesbasedon backgroundknowledge

will alwayshelpmodeling,if suitablevariablesareselected.

4.3 Collinearity

Paceregression,like almostany otherlinear regressionprocedure,fails whenpre-

sentedwith (approximately)collinearvariables.Hencecollinearity shouldbe de-

tectedandhandledbeforeapplyingtheprocedure.Wesuggesteliminatingcollinear-

ity by discardingvariables.Thenumberof candidatevariables# shouldbereduced

accordingly, becausecollinearitydoesnot provide new independentdimensions,a

prerequisiteof paceregression.In otherwords,themodelhasthesamedegreesof

freedomwithout collinearityaswith it. Appropriatevariablescanbe identifiedby

examiningthematrix G � G , or QR-transformingG (see,e.g.,LawsonandHanson,

1974,1995;Dongarraet al., 1979;Andersonet al., 1999).

Notethat OLS subsetselectionproceduresaresometimesdescribedasa protection

againstcollinearity. However, the fact is that noneof theseautomaticprocedures

canreliably eliminatecollinearity, for collinearity doesnot necessarilyimply that

theprojectionsof thevector F aresmall.

76

4.4 Regressionfor partial models

The full OLS modelforms the basisfor paceregression.In somesituations,how-

ever, the full model may be unavailable. For example, theremay be too many

candidatevariables—perhapsmorethanthenumberof observations.Althoughthe

largestavailable partial model can be substitutedfor the full model, this causes

practicaldifficulties.

The clusteringstepin paceregressionmust take all candidatevariablesinto ac-

count. This is possibleso long asthe statisticaltestusedto determinethe initial

partialmodelcansupplyanapproximatedistribution for thedimensionalabsolute

distancesof the variablesthat do not participatein it. For example,the partial-�testmaybeusedto discardvariablesbasedon forwardselection;typically, dimen-

sionalabsolutedistancesaresmallerfor thediscardedvariablesthanfor thoseused

in the model. It seemslikely that a sufficiently accurateapproximatedistribution

for thedimensionalabsolutedistancesof thediscardedvariablescanbederivedfor

this test,thoughfurtherinvestigationis necessaryto confirmthis.

Insteadof providing an approximatedistribution for the discardedvariables,it is

alsopossibleto estimatethemixing distributiondirectlybyassumingthattheeffects

of thesevariablesarelocatedin anintervalof smallvalues.Estimationof themixing

distribution canthenproceedasusual,provided theboundariesof fitting intervals

areall chosento beoutsidethis interval. Our implementationadoptsthis approach.

In many practicalapplicationsof datamining, somekind of feature selection(see,

e.g.Liu andMotoda,1998;Hall, 1999)is performedbeforeaformalmodelingpro-

cedureis invoked,which is helpful whentherearetoo many features(or variables)

availablefor theprocedureto handlecomputationally. However, it is generallynot

acknowledgedthatbiasis introducedby discardingvariableswithout passingrele-

vantinformationon to themodelingprocedure—thoughadmittedlymostmodeling

procedurescannotmakeuseof this kind of information.

Estimatingthenoisevariance,shouldit be unknown a priori, is anotherissuethat

77

is affectedwhen only a partial initial model is available. The noisecomponent

will containthe effectsof all variablesthat arenot includedin the partial model.

Moreover, becauseof competitionbetweenvariables,the OLS estimateof QR� from

thepartialmodelis biaseddownwards.How to compensatefor this is aninteresting

topicworthyof furtherinvestigation.

4.5 Remarks on modelingprinciples

As notedin Section1.2, paceregressionmeasuresthe successof modelingusing

two separateprinciples. The primary oneis accuracy, or the minimizationof ex-

pectedloss.Thesecondaryoneis parsimony, or apreferencefor thesmallestmodel,

andis only usedwhenit doesnot conflict with thefirst—thatis, to decidebetween

severalmodelsthat have aboutthe sameaccuracy. The secondaryprinciple is es-

sentialif dimensionalityreductionis of interest.For example,if thesupportpoints

foundby theclusteringsteparenotall zero,noneof theupgraded

Ø� & ’sby PACE B and

PACE A will bezero. But many maybetiny, andeliminatingtiny

Ø� & hasnegligible

influenceonpredictiveability.

In the following, we discussfour widely-acceptedgeneralprinciplesof modeling

by fitting linear models. Whenthe goal is to minimize the expectedloss, the ( # -asymptotic)superiorityof paceregressioncastsdoubton theseprinciples. In fact,

so doesthe existing empirical Bayesmethodology, including Stein estimation—

however, it doesnotseemto havebeenappliedto fitting linearmodelsin awaythat

is directlyevaluatedbasedon thelossfunction;seethereview in Section2.8.

First, paceregressionchallengesthe generalleastsquaresprinciple—or perhaps

moreprecisely, theunbiasednessprinciple.All six paceproceduresoutperformOLS

estimation:PACE � andPACE � in thesenseof prior expectedlossandtheremainder

in thesenseof posteriorexpectedloss. Accordingto theGauss-Markov Theorem,

OLS yieldsa“bestlinearunbiasedestimator”(or BLUE). OLS’s inferiority, however,

is duepreciselyto the unbiasednessconstraint,which fails to utilize all the infor-

78

mationimplicit in thedata. It is well known thatbiasedestimatorssuchassubset

selectionandshrinkagecanoutperformthe OLS estimatorin particularsituations.

We have shown thatpaceregressionestimators,which arealsobiased,outperform

OLS in all situations.In fact, in modernstatisticaldecisiontheory, therequirement

of unbiasedestimationis usuallydiscarded.

Second,paceregressionchallengesthemaximumlikelihoodprinciple. In thelinear

regressionsituation,the maximumlikelihoodestimatoris exactly the sameasthe

OLS one.

Third, someusesof Bayes’ rule comeunderattack. The BayesianestimatorBIC

is threshold-basedOLS subsetselection,wherethe a priori densityis usedto de-

terminethe threshold.Yet thesix paceregressionproceduresequalor outperform

the bestOLS subsetselectionestimator. This improvementis not basedon prior

information—thegeneralvalidity of which haslongbeenquestioned—but on hith-

erto unexploited informationthat is implicit in the very samedata. Furthermore,

whenthe non-informative prior is used—andthe useof the non-informative prior

is widely accepted,evenby many non-Bayesians—theBayesestimatoris thesame

asthe maximumlikelihoodestimator, i.e., the OLS estimator, and thereforeinfe-

rior to the paceregressionestimators.Although hierarchical Bayes(alsoknown

asBayesempiricalBayes), which involvessomehigherlevel, subjectiveprior, can

producesimilar resultsto empiricalBayes(see,e.g.,Berger, 1985;Lehmannand

Casella,1998),bothdiffer essentiallyin whetherto utilize prior informationor data

informationfor thedistributionof trueparameters(or models).

Fourth, questionsariseconcerningcomplexity-basedmodeling. Accordingto the

minimum descriptionlength(MDL) principle (Rissanen,1978), the bestmodel is

the onethat minimizesthe sumof the modelcomplexity andthe datacomplexity

giventhemodel.In practicethefirst partis anincreasingfunctionof thenumberof

parametersrequiredto definethemodel,while thesecondis theresubstitutionerror.

Ouranalysisandexperimentsdonotsupportthisprinciple.Wehavefound—seethe

examplesandanalysisin Sections3.4 and3.5 andtheexperimentsin Chapter6—

thatpaceregressionmaychoosemodelsof thesameor even larger size,andwith

79

largerresubstitutionerrors,thanthoseof otherprocedures,yet givesmuchsmaller

predictionerrorson independenttestsets. In addition,the MDL estimatorderived

by Rissanen(1978)is thesameastheBIC estimator, whichhasalreadybeenshown

inferior to thepaceregressionestimatorsin termsof ourpredictioncriterion.

4.6 Orthogonalization selection

Our analysisassumesthat the dimensionalmodelsassociatedwith the axesin an

orthogonalspaceareindependent.This statement,however, is worth furtherexam-

ination.Whena suitableorthogonalizationis givenbeforeseeingthedata(or more

precisely, the F values),this is clearly true. Theproblemis thatwhentheorthog-

onalizationis generatedfrom thedata,say, by a partial-� test,it correspondsto a

selectionfrom a groupof candidateorthogonalizations.Suchselectionwill defi-

nitely affect thepaceestimators(andmany othertypesof estimatorwhich employ

orthogonalization).

Theeffectof selectingorthogonalizationcanbeshown by extremeexamples.After

eliminatingcollinearity(seeSection4.3),wecouldalwaysfind anorthogonalization

suchthat �� - �� - v v v - �� > , in which casethe updatingof the PACE B and

PACE A estimatorswill choosethesamevaluesof the OLS estimates.Or, possibly,

these �� & valuescould be adjustedby selectingan appropriateorthogonalbasisto

be “distributed” like a uni-component� � sample,thereforePACE B andPACE A will

updatethemto the centerof this distribution. Generallyspeaking,suchupdating

will not achieve the goal of betterestimation,becauseit manipulatesthe datatoo

much.

Despitethesephenomena,it is not suggestedthat the proceduresareonly useful

whenanorthogonalizationis providedin advance.Rather, employing thedatato set

upanorthogonalbasiscanhelpto eliminateredundantvariables.Thisphenomenon

resemblestheissueof modelselection—hereit is justorthogonalizationselection—

becauseboth involve manipulatingthe dataand thus affect the estimation. The

80

selectionof anorthogonalizationis acompetitionwith uncertaintyfactorsin which

thewinneris theonethatbestsatisfiestheselectioncriterion,e.g.,thepartial-� test.

Orthogonalizationselectioninfluencesthe proposedestimators,yet its effect re-

mainsunclearto us. Althoughexperimentswith paceregressionproducesatisfac-

tory results(seeChapter6) without taking this effect into account,the theoretical

analysisof paceregressionis incompletewhentheorthogonalizationis basedupon

thedata.We leave this problemopen:it is a futureextensionof paceestimation.

Despitethe optimality of paceregressiongiven an orthogonalbasis,differentor-

thogonalsystemsresult in differentestimatedmodels. This obviously contradicts

theinvarianceprinciple.Weconsiderthisquestionasfollows.

An orthogonalsystem,like any otherassumptionusedin empiricalmodeling,can

bepre-determinedby employing prior information(backgroundknowledge)or de-

terminedfrom data.Which of theseis appropriatedependson knowing (or assum-

ing) which helpsthe modelingmost. An orthogonalsystemthat is naturalin an

applicationis notnecessarilythebestto employ. In situationswhentherearemany

redundantvariables,thedata-drivenapproachbasedon thepartial-� testseemsto

beagoodchoice.

4.7 Updating signed �� ?All proposedpaceproceduresconcernthe mixture of �� & ’s, or the estimationof�� . An interestingquestionarises:is it possible,or indeedbetter, to estimate

andusethemixing distribution of themixtureof signed $%'& , wherethesignof each$%'& is determinedby thedirectionof eachorthogonalaxis? Henceeach $%'& is inde-

pendentandnormallydistributedwith mean% �& . Oncethemixing distribution,say,�� % � � , is obtainedfrom the given $%'& ’s throughan estimationprocedure,each $%'&could be updatedin a similar way to the proposedpaceestimators.Note that the

sign of $%'& hereis differentfrom that definedin Section3.3.1. Both $%'& and % �& are

signedvalueshere,while previously % �& is alwaysnonnegative.

81

a*0

Figure4.1: Uni-componentmixture

Employingonly theunsignedvaluesimpliesthatany potentialinfluenceof thesigns

is shieldedout of the estimation,no matterwhetherit improves or worsensthe

estimate.Therefore,it is worth examiningsituationswheresignshelp andwhere

they hinder. At first glance,using �� % �� ratherthan ��+�� seemsto producebetter

estimates,sinceit employsmoreinformation—informationaboutthesignaswell as

themagnitude.Nevertheless,our answerto this questionis negative: it is possible

but generally notasgood.Considerthefollowing examples.

Start from a simple, ideal situationin which the magnitudeof all projectionsare

equal.Further, assumethatthedirectionsof theorthogonalaxesarechosensothat

thesignsof % � � 1 % �� 132�232�1 % �> arethesame,i.e., % � � - % �� - v v v*- % �> - % � ��Ñ^/ , without

lossof generality).Thenthemixture is a uni-componentnormaldistribution with

mean% � , asshown in Figure4.1.Thebestestimatorusing �� % �,� — # -asymptotically

equivalentto usingastronglyconsistentestimatorof �� % �� —canideallyupdateall$%'& to % � , resultingin zeroexpectedloss. In contrast,however, theestimatorbased

on �� will only updatethe magnitude,but not the sign, i.e., the dataon the

negativehalf realline will keepthesamenegativesignwith theupdatedmagnitude.

Obviously in this casetheestimatorbasedon �� % � � shouldbebetterthantheother

one. Note this is an idealsituation.Also, for finite sample,the improvementonly

takeseffect for thedatapointsaroundtheorigin.

The secondexampleinvolvesasymmetricmixtures. Sincein practicewe areun-

likely to beso fortunatethatall % �& have thesamesign,considera lessidealsitua-

82

a*0-a*

Figure4.2: Asymmetricmixture

tion: therearemorepositive % �& thannegative % �& . In Figure4.2, a mixture of two

componentsis displayed,wheretwo-thirdsof the % �& ’s aresetto % � andthe resttop % � . In this asymmetriccase,theadvantageof using �� % �,� over using �� de-

creases,becausetheratio of datapointsthatchangesignafterupdatingdecreases.

Further, in thefinite samplesituation,thevalueof % � estimatedfrom theset W$%'& _ is

lessaccuratethanfrom theset W�� & _ , sincethesamenumberof datapointsaremore

sparselydistributedwhenusing W�$%u& _ thanwhenusing W �� & _ ; seealsoSection4.2.

More importantly, theassumptionof anasymmetricmixture implies theexistence

of correlationbetweenthedeterminedaxialdirectionsandtheprojectionsof thetrue

responsevector. This is equivalentto employing prior information,whoseeffect,as

always,mayupgradeor deterioratetheestimate,dependingonhow correctit is. For

example,if we know beforehandthatall thetrueparametersarepositive,wemight

be ableto determinean orthogonalbasisthat takesthis informationinto account,

thusimproving estimation.However, moregenerally, we have no ideawhetherthe

directionsof thechosenbasisaresomehow correlatedwith thedirectionsof thetrue

projections.Without this kind of prior information,wecanoftenexpectto obtaina

symmetricmixture,asshown in Figure4.3. Choosingaxial directionswithout any

implicationabouttheprojectiondirectionsis equivalentin that theaxial directions

arerandomlychosen.

In the caseof symmetric �� % � � , updatinginvolvesonly updatingthe magnitude,

not thesign, just aswhenusing �� . However, a betterestimateof �� than

83

a*0-a*

Figure4.3: Symmetricmixture

of �� % �� is obtainedfrom thesamenumberof datapoints,becausethey aremore

closelypacked together. Furthermore,randomassignmentof signsimplies extra,

imposeduncertainty, thusdeterioratingestimation.

In theabove,only onesinglemagnitudevalueis considered,but theideaextendsto

situationswith multipleor evencontinuousmagnitudevalues.Without usefulprior

informationto suggestthechoiceof axialdirections,onewouldexpectasymmetric

mixtureto beobtained.In this case,it is betterto employ �� than �� % �� .

84

Chapter 5

Efficient, reliableand consistent

estimationof an arbitrary mixing

distrib ution

5.1 Intr oduction

A commonpracticalproblemis to fit anunderlyingstatisticaldistributionto asam-

ple. In someapplications,this involvesestimatingtheparametersof a singledistri-

bution function—e.g.themeanandvarianceof a normaldistribution. In others,an

appropriatemixtureof elementarydistributionsmustbefound—e.g.asetof normal

distributions,eachwith its own meanandvariance.Paceregression,aspointedout

in Section3.6,concernsmixturesof noncentral�� distributions.

In many situations,thecumulative distribution function(CDF) of a mixture distri-

bution (or model) hastheform��w��{¤��- Ê Î ��{��,Â"��£*��Â"�i1 (5.1)

where Â8[(t , the parameterspace,and {^[4s , the samplespace.This givesthe

CDF of the mixture distribution ��w~��{¤� in termsof two moreelementarydistribu-

tions: the componentdistribution ��{��,Â"� andthe mixing distribution ��Â"� . The

85

formerhasa singleunknown parameterÂ ,1 while the lattergivesa CDF for Â . For

example, ��{��,Â"� might be thenormaldistribution with mean Â andunit variance,

whereÂ is a randomvariabledistributedaccordingto ��Â"� .The mixing distribution ��Â"� can be either continuousor discrete. In the latter

case,��Â"� is composedof a numberof masspoints,say, Â � 1323232Á1,Â > with masses' � 1�23232�1~' > respectively, satisfying Þ >±³² � ' ± - 6 . Then(5.1)canbere-writtenas

��w~��{¤��- >° ±³² � ' ± ��{��,Â ± �i1 (5.2)

eachmasspoint providing a component,or cluster, in the mixture with the corre-

spondingweight, wherethesepointsarealsoknown asthe supportpointsof the

mixture. If thenumberof components# is finite andknown a priori , themixture

distributionis calledfinite; otherwiseit is treatedascountablyinfinite. Thequalifier

“countably” is necessaryto distinguishthiscasefrom thesituationwith continuous��Â�� , which is alsoinfinite.

Themaintopic in mixturemodels—whichis alsotheproblemthatwe will address

in this chapter—is theestimationof ��Â"� from sampleddatathatareindependent

andidentically distributedaccordingto the unknown distribution �xw~��{¤� . We will

focuson the estimationof an arbitrary mixing distribution, i.e., the true ��Â"� “is

treatedcompletelyunspecifiedas to whetherit is discrete,continuousor in any

particularfamily of distributions;this will becallednonparametricmixturemodel”

(Lindsay, 1995,p.8).

For tackling this problem, two main approachesexist in the literature(seeSec-

tion 5.2), themaximumlikelihoodandminimumdistanceapproaches, althoughthe

maximumlikelihoodapproachcanalsobe viewedasa specialminimum distance

approachthat usesthe Kullback–Leiblerdistance(Titteringtonet al., 1985). It is

generallybelievedthatthemaximumlikelihoodapproachshouldproduce(slightly)

betterestimationthroughsolvingcomplicatednonlinearequations,while the(usual)1In fact, � couldbea vector, but we only focuson thesimplest,univariatesituation,which is all

thatis neededfor paceregression.

86

minimumdistanceapproach,whichonly requiressolvinglinearequationswith lin-

earconstraints,offers the advantageof computationalefficiency. This advantage

makestheminimumdistanceapproachto bethemaininterestin ourwork, because

fitting linearmodelscanbeembeddedin a largemodelingsystemandrequiredre-

peatedly.

Theminimum distanceapproachcanbe further categorizedbasedon thedistance

measureused,for example,CDF-based, pdf-based, etc.As weshow in moredetail

in Section5.5,existing minimumdistancemethodsall suffer from a seriousdraw-

backin finite-datasituations(theminority clusterproblem): smalloutlying groups

of datapointscanbecompletelyignoredin theclustersthatareproduced.To rectify

this,a new minimumdistancemethodusinga nonnegativemeasureandtheideaof

localfitting is proposed,whichsolvestheproblemwhile remainingascomputation-

ally efficient asotherminimum distancemethods.Beforeproposingthis method,

wegeneralizetheCDF-basedmethodandintroducetheprobability-measure-based

methodfor reasonsof completeness.Theoreticalresultsof strongconsistency of the

proposedestimatorswill alsobe established.Strongconsistencyheremeansthat,

almostsurely, the estimatorsequencesconverge weakly to any given ��Â"� asthe

samplesizeapproachesinfinity, i.e.,

� 53��t¸n¹� º¼» ��Â"��-9��Â"��1,Â�367��z��67"��¸�7" w¸¡�¢�z£è�"¸�7��T�6¤T��-�6�2 (5.3)

Theminority clusterphenomenonseemsto have beenoverlooked,presumablyfor

threereasons:small amountsof datamaybe assumedto representa small loss;a

few datapointscaneasilybe dismissedasoutliers; and in the limit the problem

evaporatesbecausethe estimatorsare strongly consistent. However, often these

reasonsareinappropriate:thelossfunctionmaybesensitiveto thedistancebetween

clusters;thesmallnumberof outlyingdatapointsmayactuallyrepresentsmall,but

important,clusters;andany practicalclusteringsituationwill necessarilyinvolve

finite data.Paceregressionis suchanexample(seeChapter3).

Note that the maximumlikelihoodapproach(reviewed in Subsection5.2.1)does

87

not seemto have this problem. The overall likelihoodfunction is the productof

eachindividual likelihoodfunction,henceany singlepoint thatis not in the“neigh-

borhood”of the resultingclusterswill make the productzero,which is obviously

not themaximum.

Finally, it is worth mentioningthat althoughour applicationinvolvesestimating

the mixing distribution in an empirical modeling setting, there are many other

applications. Lindsay (1995) providesan extensive discussion,including known

componentdensities,the linear inverseproblem,randomeffectsmodels,repeated

measuresmodels,latentclassandlatenttrait models,missingcovariatesanddata,

randomcoefficient regressionmodels,empiricalandhierarchicalBayes,nuisance

parametermodels,measurementerror models,de-convolution problems,robust-

nessandcontaminationmodels,over-dispersionandheterogeneity, clustering,etc.

Many of thesesubjects“have [a] vastliteratureof their own.” Booksby Tittering-

ton et al. (1985),McLachlanandBasford(1988)alsoprovide extensive treatment

in considerabledepthaboutgeneralandspecifictopicsin mixturemodels.

In thischapterwefrequentlyusetheterm“clustering”for theestimationof amixing

distribution. In our context, both termshave a similar meaning,althoughin the

literaturethey haveslightly differentconnotations.

Thestructureof this chapteris asfollows. Existingestimationmethodsfor mixing

distributionsarebriefly reviewedin Section5.2. Section5.3 generalizestheCDF-

basedapproach,anddevelopsa generalproof of thestrongconsistency of themix-

ing distributionestimator. Thenthisproof is adaptedfor measure-basedestimators,

including the probability-measure-basedandnonnegative-measure-basedones,in

Section5.4. Section5.5describestheminority clusterproblemandillustrateswith

experimentshow the new methodovercomesit. Simulationstudieson predictive

accuracy of theseminimumdistanceproceduresin thecaseof overlappingclusters

aregivenin Section5.6.

88

5.2 Existing estimationmethods

In this section,we briefly review theexisting methodsof thetwo mainapproaches

to theestimationof a mixing distribution: themaximumlikelihoodandminimum

distanceapproaches.Thesemethodswork generallyfor all typesof component

distribution.

TheBayesianapproachandits variationsarenot included,sincethis methodology

is usuallyusedto control thenumberof resultingsupportpoints,or to mitigatethe

effectof outliers,which in our context doesnot appearto bea problem.

5.2.1 Maximum lik elihoodmethods

The likelihood function for an independentsamplefrom the mixture distribution

hastheform xú��k- �¥ ±È² � x ± ��i1 (5.4)

wherex ± �� hastheintegralform Ì x ± ��Â"��£*��Â"� . Lindsay’s(1995)monographpro-

videsanexcellentcoverageof the theory, geometry, computationandapplications

of maximumlikelihood(ML) estimation;seealsoBohning(1995)andLindsayand

Lesperance(1995).

The ideaof finding thenonparametricmaximumlikelihoodestimator(NPMLE) of

a mixing distribution wasoriginatedby Robbins(1950)in anabstract.It waslater

substantiallydevelopedby Kiefer andWolfowitz (1956),who providedconditions

thatensureconsistency of theML estimator.

Algorithmic computationof the NPMLE, however, only begantwenty yearslater.

Laird (1978)provedthat the NPMLE is a stepfunctionwith no morethan E steps,

where E is the samplesize. Shealsosuggestedusing the EM algorithm(Demp-

steret al., 1977)to find the NPMLE. An implementationis providedby DerSimo-

89

nian(1986,1990).SincetheEM algorithmoftenconvergesveryslowly, this is anot

popularmethodfor finding the NPMLE (althoughit hasapplicationsin estimating

finite mixtures).

Gradient-basedalgorithmsaremoreefficient thanEM-basedones.Well-known ex-

amplesincludethevertex directionmethod(VDM) andits variations(e.g.,Fedorov,

1972;Wu, 1978a,b;Bohning,1982;Lindsay, 1983), the vertex exchange method

(VEM) (Bohning, 1985,1986; Bohning et al., 1992), the intra-simplex direction

method(ISDM ) (LesperanceandKalbfleisch,1992),andthesemi-infiniteprogram-

mingmethod(SIP) (CoopeandWatson,1985;LesperanceandKalbfleisch,1992).

LesperanceandKalbfleisch(1992)point out that ISDM andSIP arestableandvery

fast—inoneof their experiments,bothneedonly 11 iterationsto converge,in con-

trastto 2177for VDM and143for VEM.

The computationof the NPMLE hasbeensignificantly improved within the last

twenty-five years.Nevertheless,dueto the essenceof nonlinearoptimization,the

computationalcostof all theseML methodsincreasesdramaticallyasthe number

of final supportpointsincreases,andin practicethis numbercanbeaslargeasthe

samplesize. Theminimumdistanceapproachis moreefficient, sinceoften it only

involvesa linearor quadraticmathematicalprogrammingproblem.

5.2.2 Minimum distancemethods

Theideaof theminimumdistancemethodis to definesomemeasureof thegood-

nessof theclusteringandoptimizethis by suitablechoiceof a mixing distribution��Â"� for a sampleof size E . We generallywant theestimatorto bestronglycon-

sistentas EÍ| } , in the sensedefinedin Section5.1, for an arbitrary mixing

distribution. We alsowantto take advantageof any specialstructureof mixturesto

comeup with anefficientalgorithmicsolution.

We begin with somenotation.Let { � 1�23232�1j{� bea samplechosenaccordingto the

mixturedistribution, andsuppose(without lossof generality)that thesequenceis

90

orderedso that { � ��{ � �zv v v}� {� . Let ��Â"� be a discreteestimatorof the

underlyingmixing distribution with a setof supportpointsat WuÂ�� & �¢�-Ù6"1323232Á1�#�� _ .Each Â�� & providesa componentof thefinal clusteringwith weight '©� & ¾ / , whereÞ >j¦& ² � '©� & -Ý6 . Giventhesupportpoints,obtaining ��w��Â�� is equivalentto comput-

ing theweightvector §��- �L'©� � 1�'©� � 1�23232�1~'©� >j¦ � Ä . Denoteby ��w ¦ ��{¤� theestimated

mixtureCDFwith respectto ��Â"� .Two minimumdistanceestimatorswereproposedin thelate1960s.Choi andBul-

gren(1968)used 6E �° ±³² � ½Ç��w ¦ ��{ ± � p ª¢ yE À � (5.5)

as the distancemeasure. Minimizing this quantity with respectto �� yields a

stronglyconsistentestimator. This estimatorcanbeslightly improveduponby us-

ing theCramer-von Misesstatistic6E �° ±³² � ½Ç��w ¦ ��{ ± � p ��ª p 6' "!"�j yE À � K 6' =��6'!yE � ��1 (5.6)

whichessentiallyreplacesªs yE in (5.5)with ��ª p �� j yE withoutaffectingtheasymp-

totic result.As might beexpected,this reducesthebiasfor small-samplecases,as

wasdemonstratedempiricallybyMacdonald(1971)in anoteonChoiandBulgren’s

paper.

At aboutthesametime,DeelyandKruse(1968)usedthesup-normassociatedwith

theKolmogorov-Smirnov test.Theminimizationis over

¨ ©£�¢ª ± ª � Wl �xw ¦ ��{ ± � p ��ª p 6'�� yEkl51�l ��w ¦ ��{ ± � p ª¢ yEkl _ 1 (5.7)

andthis leadsto a linearprogrammingproblem.DeelyandKrusealsoestablished

thestrongconsistency of theirestimator�� .Ten yearslater, the above approachwasextendedby Blum andSusarla(1977)to

approximatedensityfunctionsby usingany function sequenceWyÏ'� _ that satisfies

91

¨ ©£ZlÇÏ'� p ÏCwÕlk| / a.s.as E\| } , where Ï�w is the underlyingpdf of the mix-

ture. Each Ï'� canbe obtained,for example,by a kernel-baseddensityestimator.

Blum andSusarlaapproximatedthefunction Ï'� by theoverallmixturepdf Ï�w ¦ , and

establishedthestrongconsistency of theestimator�� underweakconditions.

For reasonof simplicity andgenerality, we will denotetheapproximationbetween

two mathematicalentitiesof thesametypeby ¶ , which implies theminimization

with respectto anestimatorof adistancemeasurebetweentheentitiesoneitherside.

Thetypesof entity involvedin this thesisincludevector, functionandmeasure,and

weusethesamesymbol ¶ for each.

In the work reviewed above, two kinds of estimatorare used: CDF-based(Choi

andBulgren,Macdonald,andDeelyandKruse)andpdf-based(Blum andSusarla).

CDF-basedestimatorsinvolve approximatinganempiricaldistribution with anes-

timatedone ��w ¦ . Wewrite thisas ��w ¦ ¶ ��=1 (5.8)

where�� is theKolmogorov empiricalCDF,��{¤��- 6E �° ±È² �¬«] ® J | » d ��{¤�i1 (5.9)

or indeedany empiricalCDF thatconvergesto it. Pdf-basedestimatorsinvolve the

approximationbetweenprobabilitydensityfunctions:ÏCw ¦ ¶ Ï �=1 (5.10)

whereÏCw ¦ is theestimatedmixturepdf and Ï � is theempiricalpdf describedabove.

Theentitiesin (5.8)and(5.10)arefunctions.Whentheapproximationis computed,

however, it is computedbetweenvectorsthat representthe functions. Thesevec-

torscontainthe functionvaluesat a particularsetof points,which we call “fitting

points.” In the work reviewed above, the fitting pointsarechosento be the data

92

pointsthemselves.

5.3 Generalizing the CDF-basedapproach

In thissectionwegeneralizetheCDF-basedapproachusingtheform (5.8).A well-

definedCDF-basedestimatorneedsto specify(a) thesetof supportpoints,(b) the

setof fitting points,(c) the empiricalfunction,and(d) thedistancemeasure.The

following generalization,dueto theconsiderationsgiven in Subsection5.3.3,will

coverall of theabove four aspects.In Subsection5.3.1,conditionsfor establishing

the estimator’s strongconsistency are given. The proof of strongconsistency is

givenin Subsection5.3.2.

5.3.1 Conditions

Conditionsfor definingtheestimatorsandensuringstrongconsistency aregivenin

this section. They aredivided into two groups. The first includesthe continuity

condition (Condition5.1), the identifiability condition (Condition5.2), andCon-

dition 5.3. Their satisfactionneedsto be checked in practice. The secondgroup,

which canalwaysbe satisfied,is usedto preciselydefinethe estimator. They are

Condition5.4 for theselectionof (potential)supportpoints,Conditions5.5–5.6for

distancemeasure,Conditions5.7for empiricalfunction,andConditions5.8–5.9for

fitting points.

Condition 5.1 ��{��,Â�� is a continuousfunctionover s N\t .

Condition 5.2 Define v to betheclassof CDFson t . If ��w8-×��¯ for ��1, Ô[°v ,

then �×-9 .

Condition5.2 is known astheidentifiability condition.Theidentifiability problem

of mixturedistributionshasattractedmuchresearchattentionsincetheseminalpa-

93

persby Teicher(1960,1961,1963). Referencesandsummariescanbe found in

PrakasaRao(1992).

Condition 5.3 Either t is a compactsubsetof ] , or �t¸n¹�y ºU{»}| y~F Î ��{��,Â"� exist

for each {b[±s andarenot distribution functionson s .

Conditions5.1–5.3wereinitially consideredby Robbins(1964),andareessential

to establishthestrongconsistency of theestimatorof themixing distributiongiven

theuniformconvergenceof themixtureestimator.

Condition 5.4 Define vè� to be the classof discretedistributionson t with sup-

port at Â�� 1�23232�1�Â�� & ¦ , where the WuÂ�� & _ are chosenso that for any � [rv there is a

sequenceWu� q� _ with � q� [�v¤� that convergesweaklyto G a.s.

Condition5.4concernstheselectionof supportpoints.In fact,aweaklyconvergent

sequenceWu� q� _ , insteadof thealmostsurelyweaklyconvergentsequence,is always

obtainable.This is becauseif t is compact,a weaklyconvergentsequenceWu� q� _for any � canbe obtainedby, say, equallyspacingÂ�� & throughoutt ; if t is not

compact,somekind of mappingof the equallyspacedpoints in a compactspace

onto t canprovidesuchasequence.

Despitethis fact, we adopta weaker conditionherethat only requiresa sequence

which,almostsurely, convergesweaklyto any given � . We will discover that this

relaxationdoesnotchangetheconclusionaboutthestrongconsistency of theresult-

ing estimator. Theadvantageof usinga weaker conditionis that thesetof support

points WuÂ�� & _ canbeadaptedto thegivensample,oftenresultingin a data-oriented

andhenceprobablysmallersetof supportpoints,which implies lesscomputation

andpossiblyhigheraccuracy. For example,eachdatapoint could provide a sup-

port point, suggestedby the unicomponentmaximumlikelihoodestimators.This

setalwayscontainsasequenceWu� q� _ whichconvergesweaklyto any given � a.s.

Let £¤�L²k1´³¬� bea distancemeasurebetweentwo # -vectors² and ³ . This distanceis

notnecessarilyametric;thatis, thetriangleinequalitymaynotbesatisfied.Denote

94

the x¶µ -norm by ltlwvTltl µ . Conditions5.5 and5.6 provide a wide classof distance

measures.

Condition 5.5 Either

(i) £è�L²k1~³¿�~-ãltl·² p ³últl »or

(ii) if �n¸n¹ >,º¼» �> lnl ² p ³¼lnl �¹¸-H/ , then �n¸n¹ >�ºú» £è�L²k1~³¿� ¸-H/ .Condition 5.6 £è��²�1~³¿��H��ltl·² p ³¼lnl » � , where ��º"� is a univariate, non-decreasing

functionand �n¸t¹z» º �è��Lº��.-0��/��-0/ .Examplesof distancesbetweentwo # -vectorssatisfyingconditions5.5and5.6arelnl ² p ³¼lnl µ' �¼� # ( /½�°�¾�^} ), lnl ² p ³¼lnl µµ "# ( /¿�±�À�í} ), and ltl·² p ³¼ltl » .

Let �� be any empirical function obtainedfrom observations(not necessarilya

CDF) satisfying

Condition 5.7 �t¸n¹)� º¼» lnl ��w p ��¤ltl » -0/ a.s.

Accordingto the Glivenko-Cantellitheorem,the Kolmogorov empiricalCDF sat-

isfies this condition. Here «-Á ��{¤� is the indicator function: «ÂÁ ��{¤�f- 6 if {à[ÄÃ ,

and0 if otherwise.Otherempiricalfunctionswhich (a.s.)uniformally convergeto

theKolmogorov empiricalCDF as E | } satisfythis conditionaswell. Empir-

ical functionsotherthantheKolmogorov empiricalCDF resultin thesamestrong

consistency conclusion,but canprovide flexibility for betterestimationin the fi-

nite datasituation.For example,anempiricalCDF obtainedfrom theCramer-von

Misesstatisticcanbe easilyconstructed,which is differentfrom the Kolmogorov

empiricalCDF but uniformallyconvergentto it.

Denotethechosensetof fitting pointson s by a vector Å=�Z-m� % � � 1 % � � 1323232Á1 % � ± ¦ �¢Ä .Thenumberof points ª�� needsto satisfy

Condition 5.8 �t¸n¹)� º¼» ª��-0} .

95

Let Æ be the setof all Borel subsetson s . Denoteby ª,�LÃ�� the numberof fitting

pointsthatarelocatedwithin thesubsetÃ�[ÇÆ , andby ��w~� Ã=� theprobabilitymea-

suredeterminedby ��w over Ã .

Condition 5.9 There existsa positiveconstant� such that, for any Ã�[\Æ ,�t¸n¹� º¼» ª,� Ã��ª�� ¾ � ��w¬�LÃ�� % 2;õ"2 (5.11)

Conditions5.8–5.9aresatisfiedwhenthe setof fitting points is an exact copy of

the set of datapoints,as in the minimum distancemethodsreviewed in Subsec-

tion 5.2.2. Using a larger setof fitting pointswhenthe numberof datapoints is

small makes the resultedestimatormoreaccuratewithin tolerablecomputational

cost.Usingfewer fitting pointswhentherearemany datapointsovera small inter-

val decreasesthecomputationalburdenwithout sacrificingmuchaccuracy. In the

finite samplesituation,Condition5.9 givesgreatfreedomin the choiceof fitting

points. For example,more points canbe placedaroundspecificareasfor better

estimation.All thesecanbedonewithout sacrificingstrongconsistency.

5.3.2 Estimators

For any discretedistribution ��Â"�~- Þ & ¦& ² � '©� & « y ¦ � | » d ��Â�� on t , thecorresponding

mixturedistribution is�xw ¦ ��{¤�k- Ê ��{��,Â"�=Ë��w��Â"��- & ¦° & ² � '©� & ��{��,Â�� & �i2 (5.12)

Let Å�� beavectorof elementsfrom s ; for example,thesetof fitting points.Denote

thevectorsof thefunctionvalues�xw ¦ ��{¤� and ��{¤� correspondingto theelements

of Å=� by ��w ¦ �LÅ=�"� and ��w�LÅ�� . The distancebetweentwo vectors ��w ¦ �LÅ=�� and�� Å=�� is written Û �w��~-9£è��w ¦ �LÅ=��i1,��=�LÅ=��i2 (5.13)

96

Theestimator�� is definedastheoneminimizingÛ ��"� . Wehave

Theorem 5.1 UnderConditions5.1–5.9,�t¸n¹)� º¼» ��|rqÒ� a.s.

Proof. Theproof of thetheoremconsistsof threesteps.

(1). �n¸n¹�� º¼» Û ��"�~-9/ a.s.

Clearly, for any (discreteor continuous)distribution on t , the distribution ��¯on s is pointwisecontinuousunderCondition5.1. � q� |rq�� a.s.(Condition5.3)

implies �xw©È¦ | ��w pointwisea.s.accordingto theHelly-Braytheorem.Thisfurther

impliesuniformconvergence(Polya,1920),thatis, �n¸t¹)� º¼» lnl ��w È¦ p �xwÕlnl » -0/ a.s.

Because�n¸n¹�� º¼» lnl �xw p ��èlnl » -9/ a.s.by Condition5.7,andlnl ��w È¦ p ��¤ltl » �Ùlnl �xw È¦ p ��wÕltl » K ltl ��w p ��¤ltl » 1 (5.14)

then �t¸n¹� º¼» ltl �xw È¦ p ��èlnl » -H/ a.s.SinceÛ �w�� q� �;�H��,lnl ��w È¦ � Å=�� p ��w�LÅ=��'ltl » ��ltl ��w©È¦ p ��èlnl » � , this meansthat �t¸n¹� º¼» Û �w�� q� �þ- / a.s.by Condition 5.6.

Becausetheestimator�� is obtainedby a minimizationprocedureoverÛ �w��vÈ� , for

every E ¾ 6 Û ��"�� Û �� q� �i2 (5.15)

Therefore,�n¸n¹)� º¼» Û �w��-9/ a.s.

(2). �n¸n¹�� º¼» lnl �xw ¦ p �xw.lnl » -9/ a.s.

Let be an arbitrary elementin the set Wu��9Y�Eã| } 367wËÝ�n¸n¹)� º¼» lnl �xw ¦ p��wklnl » ¸-0/ _ , i.e. ltl ��¯ p ��w.ltl » ¸-9/ . Then É�{�� , ��¯ ��{��i� ¸-H��w~��{��i� .Because��¯ and ��w are both continuousand non-decreasing,and ��¯ � p }^�ñ-��w¬� p }^�Z-É/ and ��¯ ��}^�Z-ô��w~��}^�Z-U6 , theremustexist an interval Ã�� in the

rangeof ��w near��w��{�� suchthatthelength Ê�� of Ã�� is positiveand ¸87E¤¿l ��¯ ��{¤� p��w¬��{¤�'l ¾ L Ñ^/ for all {�[J� � �w �LÃÁ�i� , where� � �w �LÃ��i� denotestheinversemappingof��w over Ã�� . (Notethat � � �w � Ã�� maynot beunique,but theconclusionremainsthe

97

same.)Becauselnl �xw p ��¤lnl » | / a.s.,when E is sufficiently large, ltl ��w p ��¤ltl » can

bemadearbitrarily smalla.s.Therefore,within � � �w � Ã�� , ¸87E¤¿l ��¯ ��{¤� p ��w��{¤�'l ¾ La.s. as E | } . By Condition5.9, �n¸t¹)� º¼» ª,�� w � Ã��j�� yª�� ¾ � �xw~�� w � Ã��i�� ¾� Ê�� a.s.By Conditions5.5, either

Û ��Ë¯.� ¾ L a.s.if £¤�L²k1´³¬��-zltl·² p ³¼ltl » , or�t¸n¹� ºú» �± ¦ Þ ± ¦±³² � l ��¯ �LÅ�� p ��w�LÅ=��'l ¾ � Ê�� a.s.if £è�L²k1~³¿� is definedotherwise.In

eithersituation,Û �w�� b� ¸-9/ a.s.

Fromthelastparagraph,thearbitrarinessof andthecontinuityof ��¯ in ametric

spacewith respectto , wehave

� 5'��n¸n¹� º¼» Û �w��"� ¸-H/¤lR�t¸n¹� º¼» lnl ��w ¦ p ��w.ltl » ¸-0/��-�6�2 (5.16)

By step(1), immediately, ltl �xw ¦ p ��wklnl » | / a.s.

(3). ��|rqÒ� a.s.

Robbins(1964)proved that lnl ��w ¦ p ��wklnl » | / a.s.implies that ��æ|rqg� a.s.

underConditions5.1–5.3. AClearly, thegeneralizedestimatorobtainedby minimizing (5.13)coversChoi and

Bulgren(1968)andCramer-vonMisesstatistic(Macdonald,1971).It canbefurther

generalizedas follows to cover Deely andKruse (1968),who usetwo empirical

CDFs.

Let Wu�� ± 1�ªk-ã6"132�232�1�¥ _ bea setof empiricalCDFs,whereeach�� ± satisfiesCon-

dition 5.7.Denote Û � ± ��~-9£è��w ¦ �LÅ=��i1,�� ± �LÅ=��2 (5.17)

Theorem 5.2 Let �� be the estimatorby minimizing ¹c3]d �¢ª ± ª ) Û � ± �� . Then

underConditions5.1–5.9,��|rq�� a.s.

Theproof is similar to thatof Theorem5.1andthusomitted.

98

Theproof of Theorem5.1 is inspiredby thework of Choi andBulgren(1968)and

DeelyandKruse(1968). Themaindifferencein our proof is thesecondstep.We

claim this proof is moregeneral:it coverstheir results,while their proof doesnot

cover ours. A similar proof will be usedin the next sectionfor measure-based

estimators.

5.3.3 Remarks

In theabove,we have generalizedtheestimationof a mixing distribution basedon

approximatingmixture CDFs. As will be shown in Section5.5, the CDF-based

approachdoesnotsolve theminority clusterproblem.However, thisgeneralization

is includedbecauseof thefollowing considerations:

1. It is of theoreticinterestto show that thereactually exists a large classof

minimumdistanceestimators,not justa few individualones,thatcanprovide

stronglyconsistentestimatesof anarbitrarymixing distribution.

2. We would like to formalizethedefinition of theestimatorsof a mixing dis-

tribution, includingtheselectionof supportpointsandfitting points.For ex-

ample,ChoiandBulgren(1968)did notdiscusshow thesetof supportpointsWuÂ�� & _ shouldbe chosenfor their estimator. Clearly, the consistency of the

estimatorrelieson how this setis chosen.

3. A moregeneralclassof estimatorsmayprovideflexibility andadaptabilityin

practicewithout losingstrongconsistency. Condition5.4is suchanexample.

4. TheCDF-basedapproachis a specialandsimplercaseof themeasure-based

approach.It providesastartingpoint for themorecomplicatedsituation.

99

5.4 The measure-basedapproach

This sectionextendstheCDF-basedapproachto themeasure-basedone. Theidea

is to approximatetheempiricalmeasurewith theestimatedmeasureover intervals,

which we call fitting intervals. The two measuresarerepresentedby two vectors

thatcontainvaluesthat themeasurestake over thefitting intervals. Thendistance

in thevectorspaceis minimizedwith respectto thecandidatemixing distributions.

Two approachesare consideredfurther: the probability-measure-based(or PM-

based)approachandthenonnegative-measure-based(NNM-based)approach.For

theformer, theempiricalmeasure�� is approximatedby theestimatedprobability

measure��w ¦ , whichcanbewritten ��w ¦ ¶ ��=1 (5.18)

where �� is a discreteCDF on t . For the latter, we abandonthe normalization

constraintfor �� sothat theestimateis only a nonnegativemeasure,thusnot nec-

essarilyaprobabilitymeasure.Wewrite this as��wEÌ¦ ¶ ��=1 (5.19)

where� �� is adiscretefunctionwith nonnegativemassatsupportpoints. � �� canbe

normalizedafterwardsto becomeadistribution function,if necessary.

As with theCDF-basedestimators,weneedto specify(a) thesetof supportpoints,

(b) thesetof fitting intervals,(c) theempiricalmeasure,and(d) thedistancemea-

sure. Only (b) and(c) differ from CDF-basedestimation.They aregivenin Sub-

section5.4.1 for the PM-basedapproach,andmodified in Subsection5.4.3 to fit

theNNM-basedapproach.Strongconsistency resultsfor bothestimatorsareestab-

lishedin Subsections5.4.2and5.4.3.

100

5.4.1 Conditions

For the PM-basedapproach,we needonly specifyconditionsfor fitting intervals

andtheempiricalmeasure—theotherconditionsremainidenticalto thosein Sub-

section5.3.1. We first discusshow to choosethe empirical measure,then give

conditions(Conditions5.11–5.14)for selectingfitting intervals.

Denotethe valueof a measure� over an interval Ã by ��LÃ�� . Let �� be an em-

pirical measure(i.e., a measureobtainedfrom observations,but not necessarilya

probability measure).Definethe correspondingnondecreasingempirical function�� (hencenot necessarilyaCDF) as��w��{¤��-9��j� p }91�{ À �i2 (5.20)

Clearly �� and �� areuniquelydeterminedby eachotheron Borel sets.We have

thefollowing conditionfor �� .Condition 5.10 The �� correspondingto �� satisfiesCondition5.7.

If �� is theKolmogorov empiricalCDF asdefinedin (5.9), thecorresponding��overasubsetÃ�[�Æ is ��w�LÃ��~- 6E �° ±³² � «-Á ��{ ± �i2 (5.21)

An immediateresultfollowing Condition5.10is �n¸t¹� º¼» lnl ��w¬� Ã�� p ��=�LÃ�� lnl » - /a.s.for all Ã�[\Æ .

We determinethe set of fitting intervals by the set of the right endpointsand a

function Í��{¤� which,givena right endpoint,generatesa left endpointfor thefitting

interval. Giventhesetof right endpointsW % � � 1323232Á1 % � ± ¦ _ , thesetof fitting intervals

is determinedas ÎR�J- W]Ãi� � 1�23232�1~ÃÁ� ± ¦ _ - W��LÍ�� % � � ��1 % � � ��1323232Á1'��Í�� % � ± ¦ �i1 % � ± ¦ � _ . Set-

ting eachinterval to be open,closedor semi-openwill not changethe asymptotic

conclusion,becauseof thecontinuitycondition(Condition5.1).

101

In orderto establishstrongconsistency, we require Í��{¤� to satisfythefollowing.

Condition 5.11 For someconstant« ® Ñ^/ , Ïè{�[±s , { p Í��{¤� ¾ « ® , exceptfor those{ ’s such that Í��{¤� [�s ; for theexceptions,let Íj��{¤��-H¸�7Ð¤:s .

An exampleof sucha functionis Í��{¤�k-ß{ p % for { p % [°s , and Íj��{¤�Õ-9¸87E¤Ñs if

otherwise.Further, define

Í > ��{¤�k- <= > {�1 if #�-9/ ;Í��Í > � � ��{¤�j�i1 if #�Ñ^/ . (5.22)

Thefollowing conditionsgoverntheselectionof fitting intervals;in particular, they

determinethenumberandlocationsof right endpoints.

Condition 5.12 �t¸n¹� º¼» ª��-ß} .

We useanextra CDF � � ��{¤� to determinethedistribution of theright endpointsof

fitting intervals.If possible,wealwaysreplaceit with �xw~��{¤� sothatfitting intervals

canbe determineddynamicallybasedon the sample;otherwise,someadditional

techniquesareneeded.Wewill discussthis issuein Subsection5.4.4.

Condition 5.13 � � ��{¤� is a strictly increasingCDF throughouts .

Denoteby � � �LÃ�� its correspondingprobabilitymeasureoverasubsetÃ�[\Æ . Clearly� � � Ã�� ¸-ã/ unlesstheLebesguemeasureof Ã is zerofor every Ãñ[ÒÆ . Denotebyª��LÃ�� thenumberof pointsin theintersectionbetweenthesetof theright endpointsW % � ± ��ª�- 6"1�23232i1�ª�� _ andasubsetÃ on s .

Condition 5.14 There existsa constant� Ñ / such that, for all Ã�[\Æ ,�t¸n¹� º¼» ª��LÃ��ª�� ¾ � � � � Ã�� % 25õ"2 (5.23)

102

If ��{��,Â"� is strictly increasingthroughout s for every Âí[Ót , ��w¬��{¤� is strictly

increasingthroughouts aswell. In this situationwe canuse ��w��{¤� to substitute

for � � ��{¤� . Using ��w¬��{¤� insteadof thevague� � ��{¤� to guidetheselectionof fitting

intervalssimplifiestheselectionandmakesit data-oriented.For example,set ª��Z-E , and % � ± -ä{ ± , ª�-�6�1323232i1�E . Thenwe have �n¸t¹)� º¼» ª,� Ã=�j yE^-ó�xw~�LÃ�� a.s.,thus

satisfyingCondition5.14,where � is a valuesatisfying/¿� � �\6 .5.4.2 Approximation with probability measures��LÃ�� denotesthevalueof a probabilitymeasure� over Ãb[ÔÆ . Further, let ��ÕÎq��denotethe vectorof the valuesof the probability measureover the set of fitting

intervals.Our taskis to find anestimator��Z[�v¤� (definedin Condition5.4)of the

mixing distributionCDFwhich minimizesÛ ��"��-0£è��xw ¦ ��ÎR�"�i1,��w��ÎR�"��i2 (5.24)

Theproof of thefollowing theoremfollows thatof Theorem5.1. Thedifferenceis

thatherewe dealwith probability measureandthe setof fitting intervalsof (pos-

sibly) finite lengthasdefinedby Conditions5.11–5.14,while Theorem5.1 tackles

CDFs,which useintervalsall startingfrom p } .

Theorem 5.3 UnderConditions5.1–5.6and5.10–5.14,�n¸n¹�� º¼» ��|rq�� a.s.

Theproofof this theoremneedstwo lemmas.

Lemma 2 UnderCondition5.1and5.11,if ltl �x¯ p ��w.ltl » ¸-0/ for f1,�Ý[\v , there

existsa point {�� such that �x¯+�LÍj��{��i�i1�{��i� ¸-H��w~�LÍj��{��i�i1�{��i� .Proof of Lemma2. Assume��¯ ��Í��{¤��1�{¤�©-×��w~�LÍ��{¤�i1j{¤� for every {æ[Çs . �x¯+��{¤�©-Þ »> ² � �x¯��Í > ��{¤��1�Í > � � ��{¤�j� and ��w��{¤�+- Þ »> ² � �xw~��Í > ��{¤�i1�Í > � � ��{¤�� for every { [Òs(notethatsince��¯ and ��w arecontinuous,theprobabilitymeasureoveracountable

103

setof singletonsis zero). Thus ��¯ ��{¤�)-É��w~��{¤� for all {0[Ös . This contradictsltl ��¯ p ��w.ltl » ¸-9/ , thuscompletingtheproof. ALemma 3 If there is a point {��[�s such that ��¯ ��Í��{��i��1�{�� ¸-à��w~�LÍ��{��i�i1j{��i� for ¦1,�Ý[�v , thengivenconditions5.1,5.5,5.10-5.14,�n¸n¹�� º¼» Û �� J� ¸-9/ a.s.

Proof of Lemma3. Since �x¯��{¤� and ��w¬��{¤� arebothcontinuous,�x¯+�LÍj��{¤�i1�{¤� and�xw~��Í��{¤��1�{¤� arecontinuousfunctionsof { . Theremustbean interval Ã of nonzero

lengthnear {�� suchthat for every point { [ÓÃ , ��¯��LÍ��{¤�i1j{¤� p ��w~�LÍj��{¤�i1�{¤� ¸-â/ .Accordingto therequirementfor � � ��{¤� , we canassume� � �LÃ��.- L~× ÑH/ . By Con-

ditions5.12and5.14,no fewer than ª�� L~× pointsarechosenastheright endpoints

of fitting intervalsfrom Ã a.s.as Ef| } . Sincelnl ��¯ú�ÕÎR�"� p ��w�ÕÎq�� lnl » ¾ lnl ��¯ú�ÕÎq�� p �xw~�ÕÎR�"�'ltl » p lnl ��w¿�ÕÎR�"� p ��w�ÕÎR�"�'ltl » (5.25)

and �n¸t¹)� º¼» lnl ��w~��ÎR�� p ��ÎR�� lnl » -H/ , wehave �t¸n¹� º¼» ltl �x¯+��ÎR�� p ��w��ÎR�*� lnl » ¾�t¸n¹� ºú» ltl ��¯ú��ÎR�� p ��wk��ÎR�� lnl » Ñ / a.s.Also, �t¸n¹)� º¼» �± ¦ lnl ��¯ú�ÕÎR�� p ��ÎR�� lnl � Ñí/a.s.as Ef| } . This implies �t¸n¹)� ºú» Û �w�� b� ¸-9/ a.s.by Condition5.5. AProofof Theorem5.3. Theproof of thetheoremconsistsof threesteps.

(1). �n¸t¹� º¼» Û ��"�~-9/ a.s.

Following the first stepin the proof of Theorem5.1, we obtain �n¸t¹)� º¼» lnl ��w È¦ p��¤ltl » -â/ a.s. Immediately, �n¸n¹)� º¼» lnl �xw©È¦ p ��èlnl » -U/ a.s.for all Borel sets

on s . SinceÛ �w�� q� �Ô� ��,lnl ��w©È¦ ��ÎR�"� p ��w�ÕÎR�"�'ltl » �Ç� ��ltl �xw©È¦ p ��èlnl » � , this

meansthat �n¸t¹� º¼» Û �� q� ��-9/ a.s.by Condition5.6.Becausetheestimator�� is

obtainedby aminimizationprocedureofÛ �w��v5� , for every E ¾ 6Û ��;� Û �w�� q� ��2 (5.26)

Hence�t¸n¹� º¼» Û �w��"�k-H/ a.s.

(2). �n¸t¹� º¼» lnl ��w ¦ p �xw.lnl » -H/ a.s.

104

Let beanarbitraryelementin theset Wu��(��Ef| } Y"�n¸t¹)� º¼» lnl �xw ¦ p �xwÕlnl » ¸-0/ _ ,i.e. lnl ��¯ p ��wÕltl » ¸- / . According to Lemmas2 and 3, �n¸t¹)� º¼» Û �� J� ¸- /a.s. Therefore,dueto the arbitrarinessof , the continuity of ��¯ , andstep(1),lnl ��w ¦ p ��w.ltl » | / a.s.,as Ef| } .

(3). ��û|rq � a.s.

Theproof is thesameasthethird stepin theproof of Theorem5.1. AIn Subsection5.4.1,eachfitting interval Ã ± is uniquelydeterminedby theright end-

point % � ± —theleft endpointis givenby a function Í�� % � ± � . If we insteaddetermine

eachfitting interval in thereverseorder—i.e., giventhe left endpoint,computethe

right endpoint—thefollowing theoremis obtained.The function Í��{¤� for comput-

ing theright endpointcanbedefinedanalogouslyto Subsection5.4.1.Theproof is

similar andomitted.

Theorem 5.4 Let �� betheestimatordescribedin thelastparagraph.Thenunder

thesameconditionsasin Theorem5.3, �n¸n¹�� º¼» ��|rqÒ� a.s.

Further, let theratio betweenthenumberof intervalsin asubsetof thesetof fitting

intervalsand ª�� begreaterthana positive constantas Eþ| } , andthe intervalsin

this subsetareeitherall in theform �LÍ�� % � ± ��1 % � ± � or all in theform � % � ± 1�Í�� % � ± �� and

satisfyConditions5.13and5.14. We have the following theorem,whoseproof is

similar andomittedtoo.

Theorem 5.5 Let �� betheestimatordescribedin thelastparagraph.Thenunder

thesameconditionsasin Theorem5.3, �n¸n¹�� º¼» ��|rqÒ� a.s.

Theorem5.5providesflexibility for practicalapplicationswith finite samples.This

issuewill bediscussedin Subsection5.4.4.

In fact, � � ��{¤� canalwaysbereplacedwith ��w��{¤� in Condition5.9 for establishing

thetheoremsin thissubsection,evenif ��{��,Â�� is not strictly increasingthroughout

105

s . This is becausethe normalizationconstraintof the estimator�� ensuresthat�x¯ in Lemma3 is a distribution function on s , thus ��¯+� p }^�-ó��w~� p }^��-ó/and ��¯��}^�~-0�xw��}^�~-�6 . This impliesthatif atany point {�� , ��¯+��{��i� ¸-9�xw~��{��i� ,theremustexist an interval Ã near {�� , satisfying �xw~�LÃ�� ¸-�/ , suchthat �x¯+��{¤� p�xw~��{¤� ¸-0/ , Ïè{J[\Ã , which canbeusedto establishLemma3.

We will seethat theestimatorinvestigatedin thenext subsectioncouldnot always

replace� � ��{¤� with �xw��{¤� .Barbe(1998)investigatedthepropertiesof mixing distributionestimators,obtained

by approximatingthe empiricalprobability measurewith the estimatedprobabil-

ity measure.He establishedtheoreticalresultsbasedon certainpropertiesof the

estimator. However, hedid not developanoperationalestimationscheme.Thees-

timatorsdiscussedin this subsectionalsoform a wider classthantheoneheused.

For example,�� maynotbeaprobabilitymeasure.

5.4.3 Approximation with nonnegativemeasures

In the lastsubsection,we discussedtheapproximationof empiricalmeasureswith

estimatedprobabilitymeasures.Now weconsiderthrowing awaythenormalization

constraint.As in (5.19), � �� is adiscretefunctionwith nonnegativemassat support

points. Let � �� Â�� - Þ & ¦& ² � ' �� & « a � »Ø| y ¦ �Ù ��Â�� with every ' �� & ¾ / . The mixture

functionbecomes��w Ì¦ ��{¤��- Ê ��{��,Â�� d� �� Â��- & ¦° & ² � ' �� & ��{��,Â�� & ��2 (5.27)

Note that the '©� & ’s arenot assumedto sumto one,henceneither � �� nor �xw Ì¦ are

CDFs. Accordingly, the correspondingnonnegative measureis not a probability

measure.Denotethis measureby ��w Ì¦ �LÃ�� for any Ã [rÆ . Theestimator� �� is the

onethatminimizes Û �w�� -0£è��wEÌ¦ �ÕÎR��i1,��=�ÕÎR�"��2 (5.28)

106

Of course,� �� canbenormalizedafterwards.

Wewill establishstrongconsistency of � �� . Basedonthework in thelastsubsection,

the result is almoststraightforward. We needto usethegeneralmeaningof weak

convergencefor functions,insteadof CDFs. In Condition5.2, v is changedto v � ,the classof finite, nonnegative, nondecreasingfunctionsdefinedon s suchthat� � � p }^�-ó/ for every � � [4v � . Also, in Condition5.4, vè� is replacedwith v �� .Theidentifiability is now aboutgeneralfunctions.In fact,wehave

Lemma 4 Theidentifiability of mixture ��w where �Ù[uv impliestheidentifiability

of mixture �xw Ì where � � [�v � .Proof. Assume �xw Ì - �x¯ Ì , where � � 1, � [Wv � . Then �xw Ì ��}^� - �x¯ Ì ��}^� ,i.e., Ì'Î � � £*Âb- Ì'Î � £*Â . Let � - � � Ì'Î � � £*Â and -ó � Ì Î � £*Â . Hence��wb-ß��¯ . Sinceboth � and areCDFson t , �Ý-ß , or equivalently, � � -ß � ,thuscompletingtheproof. AAlso, weneeda lemmawhich is similar to Theorem2 in Robbins(1964).

Lemma 5 UnderConditions5.1-5.3with changesdescribedin theparagraphbe-

foreLemma4, if �n¸t¹� º¼» lnl ��wEÌ¦ p ��w.ltl » -0/ a.s.,then � �� |rqÒ� a.s.

The proof is intactly the sameas that of Theorem2 in Robbins(1964)and thus

omitted.

Lemma 6 Let �� bea CDF obtainedby normalizing � �� and �n¸n¹)� º¼» � �� |rqþ�a.s.where � [uv . Then�n¸n¹�� º¼» ��û|rq � a.s.

Let � �� bethesolutionobtainedby minimizingÛ �w�� in (5.28).We have

Theorem 5.6 Undersimilar conditionswith changesmentionedabove, theconclu-

sionsof Theorems5.3-5.5remaintrue if �� is replacedwith � �� (or theestimator

obtainedbynormalizing � �� ).107

Theproofof Theorem5.6followsexactlythatof Theorem5.3.For step3,Lemma5

shouldbeusedinstead.

It wasshown in the lastsubsectionthat ��w��{¤� canalwaysbe replacedwith � � ��{¤�in Condition5.9 to establishstrongconsistency for �� obtainedfrom (5.18). This

is, however, notsofor � �� obtainedfrom (5.19).For example,if ��{��,Â�� is bounded

on s (e.g., uniform distribution), then ��w~��{¤� may have intervals with zero ��w -

measure.Mixing distributions,with or without componentsboundedwithin zero�xw -measureintervals,have thesameÛ ��¨vÈ� valueas Ef| } .

5.4.4 Considerationsfor finite samples

Fromtheprevioussubsections,it becomesclearthatthereis alargeclassof strongly

consistentestimators.Thebehavior of theseestimators,however, differs radically

in thefinite samplesituation.In thediscussionof a few finite sampleissuesthatfol-

lows,we will seethat,without carefuldesign,someestimatorscanperformbadly.

We take an algorithmicviewpoint andensurethat the estimationis workable,fast

andaccuratein practice.Theconditionscarefullydefinedin Subsections5.3.1and

5.4.1providesuchflexibility andadaptability.

Of all minimumdistanceestimatorsproposedsofar, only thenonnegative-measure-

basedone is able to solve the minority clusterproblem(seeSection5.5), in the

sensethat it canalwaysreliably detectsmall clusterswhentheir locationsarefar

away from dominantdatapoints. The experimentsconductedin Sections5.5 and

5.6alsoseemto suggestthattheseestimatorsaregenerallymoreaccurateandstable

thanotherminimumdistanceestimators.We focuson theNNM-basedestimator.

The conditionsin Subsection5.3.1show that, without losing strongconsistency,

thesetof supportpointsandthesetof fitting intervalscanbe determinedwithout

taking the datapoints into account. Nevertheless,unlesssufficient prior knowl-

edgeis providedaboutthesupportpointsandthedistribution of datapoints,such

an independentprocedureis unlikely to provide satisfactoryresults.An algorithm

108

thatdeterminesthe two setsbeforeseeingthedatawould not work for many gen-

eralcases,becausetheunderlyingmixing distribution changesfrom applicationto

application.Thesampleneedsto beconsideredwhendeterminingthetwo sets.

Thesupportpointscouldbesuggestedby thesampleandthecomponentdistribu-

tion. As mentionedearlier, eachdatapoint could generatea supportpoint using

theunicomponentmaximumlikelihoodestimator. For example,for themixtureof

normaldistributionswherethe meanis the mixing parameter, datapointscanbe

directly takenassupportpoints.

Eachdatapoint canbetakenasanendpointof a fitting interval, while thefunctionÍ��{¤� canbedecidedbasedon thecomponentdistribution. In theabove exampleof

a mixtureof normaldistributions,we canset,say, Í��{¤�Ð-\{ p NyQ . Notethatfitting

intervalsshouldnot be chosenso large thatdatapointsmay exert an influenceon

the approximationat a remoteplace,or so small that the empirical (probability)

measuresarenotaccurateenough.

Thenumberof supportpointsandthenumberof fitting intervalscanbeadjustedac-

cordingto Conditions5.4,5.12–5.14.For example,whentherearefew datapoints,

moresupportpointsandfitting intervalscanbeusedto getamoreaccurateestimate

within a tolerablecomputationalcost;whentherearea largenumberof datapoints

in asmallregion, fewersupportpointsandfitting intervalscanbeused,whichsub-

stantiallydecreasescomputationalcostwhile keepingtheestimateaccurate.

Basingtheestimateon solving(5.19)hascomputationaladvantagesin somecases.

Thesolutionof anequationlike thefollowing,ÚÛÖÜ � w Ì ¦ ÝÝ Ü � �w Ì ¦Þß ÚÛ §gàá§ à àá

Þß ¶ ÚÛ4â àáâ à·àáÞßäã

(5.29)

subjectto å àá�æèç and å à àáÒæèç , is the sameascombiningthe solutionsof two

subequationsé à êEëì å àáBí â àá subjectto å àá æ ç , and é à àêEëì å à àá(í â à·àá subjecttoå à àá æîç . Eachsub-equationcanbe further partitioned.This implies that clusters

separatingfrom oneanothercanbeestimatedseparately. Further, normalizingthe

109

separatedclusters’weight vectorsbeforerecombiningthemcanproducea better

estimate.The computationalburdendecreasessignificantlyasthe extent of sepa-

ration increases.Thebestcaseis thateachdatapoint is isolatedfrom every other

one,suggestingthat eachpoint representsa clustercomprisinga singlepoint. In

this case,thecomputationalcomplexity is only linear.

Wewishto preservestrongconsistency, anddeterminethesupportpointsandfitting

intervalsfrom thedata.Underourconditions,however, anextraCDF ïxð�� ñmò , which

shouldbestrictly increasing,is neededto determinefitting intervals. It canbe re-

placedwith ï ê � ñmò when ï� ñ�ó�ô ò is strictly increasing,makingit data-oriented.But

when ï��fñ�ó�ô ò is notstrictly increasing,ï ê � ñmò cannotsubstitutefor ïxðy� ñmò . Sincethe

selectionof fitting intervalsshouldbettermakeuseof thesample,this is adilemma.

However, accordingto Theorem5.5 (and its counterpartfor õ àá , Theorem5.6),ï�ð��fñmò couldbea CDF usedto determineonly partof all fitting intervals,no matter

how small theratio is. This givescompleteflexibility for almostanything we want

to do in thefinite samplesituation.Theonly problemis thatif ö ê � ñmò is usedasthe

selectionguidefor fitting intervals,unwantedcomponentsmayappearat locations

wherenodatapointsarenearby. Thiscanbeovercomeby providing enoughfitting

intervals at locationsthat may producecomponents.This, in turn, suggeststhat

supportpointsshouldavoid locationswherethereareno datapointsaround.Also,

this reducesthenumberof fitting intervalsandhencecomputationalcost.

In fact,for finite samples,usingthestrictly increasingö ê �fñmò astheselectionguide

for fitting intervalssuffersfrom theproblemdescribedin thelastparagraph.Almost

all distributionshave dominantpartsanddwindlequickly away from thedominant

parts,hencethey areeffectively boundedin finite samplesituations.Thereforethe

strategy discussedin the last paragraphfor boundeddistributionsshouldbe used

for all distributions—i.e.,thedeterminationof fitting intervalsneedsto takesupport

points into account,so that all regionswithin thesesupportpoints’ sphereof in-

fluencearecoveredby a largeenoughnumberof fitting intervals. Theinformation

aboutthesupportpoints’sphereof influenceis availablefrom thecomponentdistri-

butionandthenumberof datapoints.For regionsonwhichsupportpointshavetiny

110

influencein termsof probabilityvalue,it is notnecessaryto providefitting intervals

there,becausetheequation ÷ø é ê ë ìç

ùß å á í÷ø â áç

ùß(5.30)

hasthesamesolutionas é ê ë ì å á í â á .Finally, in termsof distancemeasures,notall of thosedeterminedby Conditions5.5

and5.6 canprovide straightforward algebraicsolutionsto (5.18)and(5.19). The

chosendistancemeasuredeterminesthe type of the mathematicalprogramming

problem. If a quadraticdistancemeasureis chosen,a leastsquaressolutionun-

der constraintswill be pursued.This approachcanbe viewed asan extensionof

Choi andBulgren(1968)from CDF-basedto measure-based.Elegantandefficient

methodsLSE andNNLS providedin LawsonandHanson(1974,1995)canbeem-

ployed. If the sup-normdistancemeasureis used,solutionscanbe found by the

efficient simplex methodin linear programming.This approachcanbe viewedas

anextensionof DeelyandKruse(1968).

5.5 The minority cluster problem

This sectiondiscussesthe minority clusterproblem,andillustratesit with simple

experiments.Theminimumdistanceapproachbasedonnonnegativemeasurespro-

videsa solution,while otherminimumdistanceonesfail. This problemis vital for

paceregression,sinceignoringevenoneisolateddatapoint in theestimatedmix-

ing distribution cancauseunboundedloss. Thediscussionbelow is intendedto be

intuitiveandpractical.

111

5.5.1 The problem

Although they perform well asymptotically, the minimum distancemethodsde-

scribedabove suffer from the finite-sampleproblemdiscussedearlier: they can

neglect small groupsof outlying datapointsno matterhow far they lie from the

dominantdatapoints. The underlyingreasonis that the objective function to be

minimized is definedglobally ratherthan locally. A global approachmeansthat

thevalueof theestimatedprobabilitydensityfunctionat a particularplacewill be

influencedby all datapoints,nomatterhow farawaythey are.Thiscancausesmall

groupsof datapointsto be ignoredevenif they area long way from thedominant

part of the datasample. From a probabilisticpoint of view, however, thereis no

reasonto subsumedistantgroupswithin the major clustersjust becausethey are

relatively small.

The ultimate effect of suppressingdistantminority clustersdependson how the

clusteringis applied.If theapplication’s lossfunctiondependson thedistancebe-

tweenclusters,theresultmayprove disastrousbecausethereis no limit to how far

away theseoutlying groupsmay be. Onemight arguethat small groupsof points

caneasilybe explainedaway asoutliers,becausethe effect will becomelessim-

portantas the numberof datapoints increases—andit will disappearin the limit

of infinite data. However, in a finite-datasituation—andall practicalapplications

necessarilyinvolvefinite data—the“outliers” mayequallywell representsmallmi-

nority clusters.Furthermore,outlying datapointsarenot really treatedasoutliers

by thesemethods—whetheror not they arediscardedis merelyan artifact of the

global fitting calculation. When clustering,the final mixture distribution should

takeall datapointsinto account—includingoutlying clustersif any exist. If practi-

cal applicationsdemandthatsmalloutlying clustersaresuppressed,this shouldbe

donein aseparatestage.

In distance-basedclustering,eachdatapoint hasa far-reachingeffect becauseof

two global constraints.Oneis theuseof thecumulative distribution function; the

otheris thenormalizationconstraintúBû ìü~ýmþ@ÿ á ü�� . Theseconstraintsmaysacrifice

112

a smallnumberof datapoints—atany distance—fora betteroverall fit to thedata

asa whole. Choi andBulgren(1968),theCramer-von Misesstatistic(Macdonald,

1971),andDeelyandKruse(1968)all enforceboththeCDFandthenormalization

constraints.Blum andSusarla(1977)droptheCDF, but still enforcethenormaliza-

tion constraint.Theresultis thattheseclusteringmethodsareonly appropriatefor

finite mixtureswithout smallclusters,wheretherisk of suppressingclustersis low.

Hereweaddressthegeneralproblemfor arbitrarymixtures.Of course,theminority

clusterproblemexistsfor all typesof mixture—includingfinite mixtures.

5.5.2 The solution

Now thatthesourceof theproblemhasbeenidentified,thesolutionis clear, at least

in principle: drop both the approximationof CDFs,asBlum andSusarla(1977)

do, andthenormalizationconstraint—nomatterhow seductive it mayseem.This

considerationleadsto thenonnegative-measure-basedapproachdescribedin Sub-

section5.4.3.

To definetheestimationprocedurefully, weneedto determine(a)thesetof support

points, (b) the setof fitting intervals, (c) the empiricalmeasure,and(d) the dis-

tancemeasure.Theoreticalanalysisin theprevioussectionsguaranteesa strongly

consistentestimator:herewediscussthesein anintuitivemanner.

Support points. Thesupportpointsareusuallysuggestedby thedatapointsin the

sample.For example,if thecomponentdistribution ï��fñ�ó�ô ò is thenormaldistribu-

tion with meanô andunit variance,eachdatapoint canbetakenasasupportpoint.

In fact,thesupportpointsaremoreaccuratelydescribedaspotentialsupportpoints,

becausetheir associatedweightsmay becomezero after solving (5.19)—and,in

practice,many oftendo.

Fitting intervals. The fitting intervals arealsosuggestedby the datapoints. In

thenormaldistribution examplewith known standarddeviation � , eachdatapointñ�� canprovide oneinterval, suchas � ñ�� ã ñ�� , or two, suchas �·ñ�� ã ñ�� and

113

� ñ�� ã ñ�� , or more. Thereis no problemif the fitting intervals overlap. Their

lengthshouldnot besolargethatpointscanexert aninfluenceon theclusteringat

anundulyremoteplace,nor sosmall thattheempiricalmeasureis inaccurate.The

experimentsreportedbelow useintervalsof a few standarddeviationsaroundeach

datapoint,and,aswewill see,this workswell.

Empirical measure. The empiricalmeasurecanbe the probability measurede-

terminedby the Kolmogorov empiricalCDF, or any measurethat convergesto it.

Thefitting intervalsdiscussedabove canbeopen,closed,or semi-open.This will

affect theempiricalmeasureif datapointsareusedasinterval boundaries,although

it doesnot changethevaluesof theestimatedmeasurebecausethecorresponding

distribution is continuous.In small-samplesituations,biascanbereducedby care-

ful attentionto thisdetail—asMacdonald(1971)discusseswith respectto Choiand

Bulgren’s (1968)method.

Distancemeasure. Thechoiceof distancemeasuredetermineswhatkind of math-

ematicalprogrammingproblemmustbesolved. For example,a quadraticdistance

will give rise to a leastsquaresproblemunderlinearconstraints,whereasthesup-

normgivesriseto a linearprogrammingproblemthatcanbesolvedusingthesim-

plex method.Thesetwo measureshaveefficientsolutionsthataregloballyoptimal.

5.5.3 Experimental illustration

Experimentsareconductedin thefollowing to illustratethefailureof existingmin-

imum distancemethodsto detectsmall outlying clusters,and the improvement

achievedby thenew scheme.Theresultsalsosuggestthatthenew methodis more

accurateandstablethantheothers.

Whencomparingclusteringmethods,it is not alwayseasyto evaluatetheclusters

obtained.To finessethis problemwe considersimpleartificial situationsin which

the properoutcomeis clear. Somepracticalapplicationsof clustersdo provide

objective evaluationfunctions—forexample,in our settingof empiricalmodeling

114

(seeSection5.6).

ThemethodsusedareChoi andBulgren(1968)(denotedCHOI), MacDonald’s ap-

plicationof the Cramer-von Misesstatistic(CRAMER), the methodwith theprob-

ability measure(PM), and the methodwith the nonnegative measure(NNM). In

eachcase,equationsinvolving non-negativity and/orlinearequalityconstraintsare

solved asquadraticprogrammingproblemsusing the elegantandefficient proce-

duresNNLS and LSEI provided by Lawson and Hanson(1974, 1995). All four

methodshave thesamecomputationaltimecomplexity.

Experiment 5.1 Reliability of clusteringalgorithms. We setthe samplesize �to

��throughouttheexperiments.Thedatapointsareartificially generatedfrom a

mixtureof two clusters:� þ pointsfrom �� ã � ò and �� pointsfrom �� ã � ò . The

valuesof � þ and �� arein theratios �� , � �!�" , ��#�$� , % � �"& � and ' � � ' � .Every datapoint is takenasa potentialsupportpoint in all four methods:thusthe

numberof potentialcomponentsin theclusteringis 100. For PM andNNM, fitting

intervalsneedto bedetermined.In theexperiments,eachdatapoint ñ�� providestwo

fitting intervals, � ñ��"�( ã ñ��)� and � ñ�� ã ñ��*�+�� . Any datapoint locatedontheboundary

of an interval is countedashalf a point whendeterminingthe empiricalmeasure

over thatinterval.

Thesechoicesareadmittedlycrude,andfurther improvementsin theaccuracy and

speedof PM andNNM arepossiblethattakeadvantageof theflexibility providedby

(5.18)and(5.19).For example,accuracy will likely increasewith more—andmore

carefully chosen—supportpoints and fitting intervals. The fact that it performs

well even with crudelychosensupportpointsandfitting intervals testifiesto the

robustnessof themethod.

Ourprimaryinterestin thisexperimentis theweightsof theclustersthatarefound.

To casttheresultsin termsof theunderlyingmodels,we usetheclusterweightsto

estimatevaluesfor � þ and�� . Of course,theresultsoftendonotcontainexactlytwo

clusters—but becausetheunderlyingclustercenters,0 and100,arewell separated

115

comparedto their standarddeviation of 1, it is highly unlikely thatany datapoints

from oneclusterwill fall anywhereneartheother. Thuswe usea thresholdof 50

to divide the clustersinto two groups:thosenear0 andthosenear100. The final

clusterweightsarenormalized,andthe weightsfor thefirst grouparesummedto

obtainanestimate,� þ of � þ , while thosefor thesecondgrouparesummedto give

anestimate,�� of �� . � þ-� �� þ.� � � � þ/� �� þ.� % � � þ.� ' ��0� � � �� 0� � & � �0� � ' �CHOI Failures 86 42 4 0 0,� þ21 ,�� 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4

SD( ,� þ ) 0.36 0.98 1.71 1.77 1.30CRAMER Failures 80 31 1 0 0,� þ21 ,�� 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3

SD( ,� þ ) 0.50 1.13 1.89 1.80 1.31PM Failures 52 5 0 0 0,� þ21 ,�� 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9

SD( ,� þ ) 0.32 0.83 0.87 0.78 0.55NNM Failures 0 0 0 0 0,� þ21 ,�� 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9

SD( ,� þ ) 0.01 0.16 0.19 0.34 0.41

Table5.1: Experimentalresultsfor detectingsmallclusters

Table 5.1 shows resultsfor eachof the four methods. Eachcell representsone

hundredseparateexperimentalruns. Threefiguresarerecorded.At the top is the

numberof timesthemethodfailedto detectthesmallercluster, that is, thenumber

of times ,�� . In themiddlearetheaveragevaluesfor ,� þ and ,�0� . At thebottom

is thestandarddeviationof ,� þ and ,�� (which areequal).Thesethreefigurescanbe

thoughtof asmeasuresof reliability, accuracy andstability respectively.

The top figuresin Table5.1 show clearly that only NNM is alwaysreliablein the

sensethat it never fails to detectthesmallercluster. Theothermethodsfail mostly

when �� ; their failurerategraduallydecreasesas �� grows. Thecenterfigures

show that,underall conditions,NNM givesa moreaccurateestimateof thecorrect

valuesof � þ and �0� thantheothermethods.As expected,CRAMER showsanotice-

ableimprovementoverCHOI , but it is veryminor. ThePM methodhaslowerfailure

ratesandproducesestimatesthataremoreaccurateandfar morestable(indicated

116

by thebottomfigures)thanthosefor CHOI andCRAMER—presumablybecauseit is

lessconstrained.Of the four methods,NNM is clearlyandconsistentlythewinner

in termsof all threemeasures:reliability, accuracy andstability.

Theresultsof the NNM methodcanbe further improved. If thedecomposedform

(5.29)is usedinsteadof (5.19),andthesolutionsof thesub-equationsarenormal-

izedbeforecombiningthem—whichis feasiblebecausethetwo underlyingclusters

are so distantfrom eachother—the correctvaluesare obtainedfor ,� þ and ,�� in

virtually every trial.

5.6 Simulation studiesof accuracy

In this section,we studythe predictive accuracy of clusteringproceduresthrough

simulations.Sincethecomponentdistribution hasa limited sphereof influencein

finite samplesituations,datapointsshouldhave little effecton theestimationof the

probabilitydensityfunctionat far away points. BecauseNNM useslocal fitting, it

mayalsohavesomeadvantagein accuracy overproceduresthatadoptglobalfitting,

evenin thecaseof overlappingclusters.Wedescribetheresultsof two experiments.

Unlikesupervisedlearning,thereseemsto benowidely acceptedcriterionfor eval-

uatingclusteringresults(Hartigan,1996).However, for thespecialcaseof estimat-

ing amixing distribution,it is reasonableto employ thequadraticlossfunction(2.6)

usedin empiricalBayesor, equivalently, the lossfunction(3.4) in paceregression

(whenthe varianceis a known constant),andthis is whatwe adopt. More details

concerningtheir usearegivenlater.

Experiment5.2 investigatesa mixture of normal distributions with the meanas

the mixing parameter. The underlyingmixing distribution is normal as well—

informationthat is not usedby the clusteringprocedureswe tested. This type of

mixturehasmany practicalapplicationsin empiricalBayesanalysis,aswell asin

Bayesiananalysis;see,for example,Berger (1985),Maritz andLwin (1989),and

CarlinandLouis (1996).Thesecondexperimentconsidersamixtureof noncentral

117

3 � þ distributionswith thenoncentralityparameterasthemixing one.Theunderlying

mixing distribution follows Breiman(1995),whereheusesit for subsetselection.

Thisexperimentcloselyrelatesto paceregression.

UnlikeExperiment5.1,datapointsin Experiment5.2and5.3aremixedsoclosely

thatthereis nocleargrouping.Therefore,reliability is notaconcernfor them.It is

worthmentioning,however, thatignoringevenoneisolatedpoint in othersituations,

asin Experiment5.1,maydramaticallyincreasetheloss.

Experiment 5.2 Accuracy of clusteringproceduresfor mixturesof normal distri-

butions. Weconsiderthemixturedistribution

ñ��24 5��76 ��85�� ã � ò5��76 �� ã � �9 ò ãwhere : � � ã & ã ;<;=; ã �<�� . The simulationresultsgiven in Table5.2 arelossesav-

eragedover twenty runs. For eachrun, 100 datapoints are generatedfrom the

mixturedistribution. Given thesedatapoints,themixing distribution is estimated

by thefour proceduresCHOI , CRAMER, PM andNNM, andtheneachñ�� is updatedto

be ñ EB� using(2.8). Theoverall lossfor all updatedestimatesis calculatedby (2.6),

i.e., ú þ?>@>� ýmþ 4)4 ñ EB� ��5��24A4 � .Eleven caseswith differentvaluesof � �9 arestudied. Generally, as � �9 increases,

thepredictionlossesincrease.This is dueto thedecreasingnumberof neighboring

pointsto supportupdating;seealsoSection4.2.For almosteverycase,PM andNNM

performsimilarly, bothoutperformingCHOI andCRAMER with aclearmargin.

Experiment 5.3 Accuracyof clusteringproceduresfor mixturesof 3 � þ distributions.

In this experiment,we considera mixtureof noncentral3 � þ distributionsusingthe

noncentralityparameterasthe mixing one,i.e., the mixture (3.19)or (3.22). The

underlyingmixing distributionis amodifiedversionof adesignby Breiman(1995);

seealsoExperiment6.6.Sincethisexperimentrelatesto linearregression,weadopt

thenotationusedfor paceregressionin Chapter3.

118

� �9 CHOI CRAMER PM NNM

0 4.00 3.93 3.27 3.951 59.2 59.1 57.4 57.82 75.6 75.5 73.4 73.83 86.2 86.2 84.2 84.24 89.1 88.8 87.0 87.05 96.5 96.5 93.6 93.56 100.0 100.0 97.7 97.87 100.4 100.5 97.8 97.88 102.3 101.9 100.3 100.39 104.9 104.8 102.1 102.110 104.9 104.8 102.0 102.0

Table5.2: Experiment5.2. Averagelossof clusteringproceduresover mixturesofnormaldistributions.

Breimanconsiderstwo clusteringsituationsin termsof the B ü (C � � ã ;=;<; ã D ) values.

One,withD � & � , hastwo clustersaround' and

� ' andaclusteratzero.Theother,

withD �FE � , hasthreeclustersaround

��, & � and � anda clusterat zero. We

determineB ü in thesameway asBreiman. Thenwe set G"Hü �JI B ü and KLHü � G$Hü � ,whereI is always

þM for bothcases.Thevalueof I is sochosenthattheclustersare

neithertoo faraway from nor toocloseto eachother.

Thesampleof themixturedistributionis theset NOK ü�P , whereeachK üQ� G �ü and G ü 6��?G Hü ã � ò . Eachclusteringprocedureis appliedto thegivensample,K þ ã KL� ã ;<;<; ã K û ,to producean estimatedmixing distribution õ û �?K H ò . TheneachK ü is updatedby

PACE R to obtain

ØK ü . Thelossfunction(3.4) in paceregression,thatis thequadratic

loss function (2.6), is usedto obtain the true prediction loss for eachestimate.

The signsof SG ü and G Hü are taken into accountwhen calculatingthe loss, whereG Hü is alwayspositive and SG ü hasthesamesignas G ü . Therefore,theoverall lossisúBûü~ýmþ 4A4TSG ü �UG Hü 4A4 � .Thesimulationresultsaregivenin Table5.3,eachfigurebeinganaverageover100

runs. The variable V�W (sameasusedby Breiman)is the radiusof eachclusterin

the true mixing distribution, i.e., larger V�W implies moredatapoints in eachclus-

ter. It canbe observed from theseresultsthat NNM (andprobablyPM aswell) is

alwaysa safemethodto apply. For small V�W , bothCHOI andCRAMER produceless

119

V�W CHOI CRAMER PM NNMD � & �1 15.9 14.1 9.4 11.12 13.3 13.1 13.2 13.03 15.3 15.7 16.1 16.14 17.5 18.0 18.5 18.55 19.6 19.9 20.5 20.4D �XE �1 55.0 48.8 17.4 17.42 29.7 29.9 26.5 26.73 32.3 32.5 33.9 33.84 36.7 37.0 38.4 38.45 40.7 40.8 41.4 41.4

Table5.3: Experiment5.3. Averagelossesof clusteringproceduresover mixturesof 3 � þ distributions.

accurateor sometimesvery bad results—notethat in this experimentdatapoints

are not easily separable.For large V�W , however, CHOI and CRAMER do seemto

produceslightly moreaccurateresultsthantheothertwo. This suggeststhat CHOI

andCRAMER may have a small advantagein predictive accuracy whenclustering

is within a relatively small areaandthereareno small clustersin this area. This

slightly betterperformanceis probablydueto CHOI and CRAMER’s useof larger

intervals,whenthereis no risk of suppressingsmallamountsof datapoints.

5.7 Summary

Theideaexploitedin thischapteris simple:anew minimumdistancemethodwhich

is ableto reliably detectsmall groupsof datapoints,including isolatedones.Re-

liable detectionin this caseis importantin empiricalmodeling.Existingminimum

distancemethodsdonotcopewell with this situation.

Themaximumlikelihoodapproachdoesnotseemto suffer from thisproblem.Nev-

ertheless,it often consumesmorecomputationalcost thanthe minimum distance

one. For many applications,fastestimationis important.Whenpaceregressionis

120

embeddedin amoresophisticatedestimationtask,it maybenecessaryto repeatedly

fit linearmodels.

Thepredictive accuracy of minimumdistanceestimatorsis alsoempirically inves-

tigatedin Section5.6 for situationswith overlappingclusters. Although further

investigationis needed,NNM seemsgenerallypreferable.

Despitethe simplicity of the underlyingidea,mostof this chapteris devotedto a

theoreticalaspectof new estimators—strongconsistency. This is anecessarycondi-

tion for establishingasymptoticpropertiesof paceregression,andgivesaguarantee

for largesamplebehaviour.

121

Chapter 6

Simulation studies

6.1 Intr oduction

It is time for someexperimentalresultsto illustratetheideaof paceregressionand

give someindication how it performsin practice. The resultsare very much in

accordancewith the theoreticalanalysisin Chapter3. In addition, they illustrate

someeffectsthathavebeendiscussedin previouschapters.

We will testAIC, BIC, RIC, CIC (in theform (2.5)),PACE � , PACE Y , andPACE R . The

OLS full modeland the null model—whichareactuallygeneratedby procedures

OLS (being OLSC � � ò ) and OLSC �[Z�ò —are includedfor comparison. Procedures

PACE þ , PACE M , andPACE \ arenot includedbecausethey involvenumericalsolution

andcanbeapproximatedby PACE � , PACE Y , andPACE R , respectively.

Two shrinkagemethods,NN-GAROTTE andLASSO,1 areusedin Experiments6.1,

6.2 and 6.7. Becausethe choiceof shrinkageparameterin eachcaseis some-

what controversial,we usethe bestestimatefor this parameterobtainedby data

resampling—moreprecisely, five-fold cross-validationoveranequally-spacedgrid

of twenty-onediscretevaluesfor theshrinkageratio. This choiceis computation-

ally intensive, which is why resultsfrom thesemethodsarenot includedin all ex-

periments. The Stein estimationversionof LASSO (Tibshirani, 1996) is usedin1The procedure LASSO is downloaded from StatLib (http://lib.stat.cmu.edu/S/)and

NN-GAROTTE from Breiman’s ftp siteat UCB (ftp://ftp.stat.berkeley.edu/pub/users/breiman).

123

Experiments6.1and6.2,dueto theheavy timeconsumptionof thecross-validation

version.2 It is acknowledgedthatin thesetwo experimentsthecross-validationver-

sionmayyield betterresults,assuggestedin Experiment6.7.

We usethe partial-ï testandbackward eliminationof variablesto setup the or-

thogonalization,andemploy theunbiased,� � estimateof the OLS full model. The

non-central3 � distribution is usedinsteadof the ï -distribution for computational

reasons.Toestimatethemixing distribution õ]�?K H ò , weadopttheminimumdistance

procedurebasedonnonnegativemeasure(NNM) given(5.19)or equivalently(5.28)

in Chapter5, alongwith thequadraticprogrammingalgorithmNNLS providedby

LawsonandHanson(1974,1995). Supportpointsare the set N-^K ü�P plus zero(if

thereis any ^K ü nearzero),except that points lying in the gapbetweenzero and

threearediscardedin orderto simplify the model. This gaphelpsselectioncrite-

ria PACE � andPACE Y to properlyeliminatedimensionsin situationswith redundant

variables,becausewhenthe _ -function’s sign nearthe origin is negative—which

is crucial for bothto eliminateredundantdimensions—itcouldbeincorrectlyesti-

matedto be positive (notethat _`� � �zóaK H ò]b �when K H æ � ; ' ò ). This choiceis,

of course,slightly biasedtowardssituationswith redundantvariables,but doesnot

sacrificemuchpredictionaccuracy in any situation,andthuscanbeconsideredas

an applicationof the simplicity principle (Section1.2). For PACE R ,ØK ü ’s smaller

than� ; ' arediscardedto simplify themodelaswell (seealsoSection4.5). These

areroughandreadydecisions,takenfor practicalexpediency. For moredetail,refer

to thesourcecodeof theS-PLUS implementationin theAppendix.

Recallingthe modelingprinciplesthat we discussedin Section1.2, our interests

focusmainly on predictive accuracy, andtheestimatedmodelcomplexities appear

asanaturalby-product.

Both artificial andpracticaldatasetsare includedin our experiments. Onemain

advantageof usingartificial datasetsis that the experimentsarecompletelyunder

control. Sincethe true model is known, the predictionerrorscanbe obtainedac-2In our simulations,NN-GAROTTE takesabouttwo weekson our computerto obtaintheresults

for Experiments6.1and6.2,while LASSO c�d appearsto beseveraltimesslower.

124

curatelyandthe estimatedmodelcomplexity canbe comparedwith the true one.

Furthermore,wecandesignexperimentsthatcoverany situationsof interest.

Section6.2 considersartificial datasetsthat containonly statisticallyindependent

variables. Hencethe columnvectorsof the designmatrix form a nearly, yet not

completely, orthogonalbasis.

Section6.3investigatestwomorerealisticsituationsusingartificial datasets.In one,

studiedby Miller (1990),theeffectsof variablesaregeometricallydistributed.The

otherconsidersclustersof coefficients,following theexperimentof Breiman(1995).

As any modelingproceduresshouldfinally beappliedto solve realproblems,Sec-

tion 6.4 extendsthe investigationto real world datasets.Twelve practicaldatasets

areusedto comparethemodelingproceduresusingcross-validation.

6.2 Artificial datasets:An illustration

To illustratethe ideasof paceregression,in this sectionwe give threeexperimen-

tal examplesthat comparepaceregressionwith otherproceduresin termsof both

predictionaccuracy andmodelcomplexity, usingartificial datasetswith statistically

independentvariables.In thefirst, contributive variableshave smalleffects;in the

secondthey have largeeffects.Thethird exampleteststheinfluenceof thenumber

of candidatevariableson paceregression. Resultsaregiven in the form of both

tablesandgraphs.

Experiment 6.1 Variableswith smalleffects. Theunderlyingmodelis e �gfh> �úBûü~ýmþ fiü ñ ü �i�� ã � � ò , where fh>j� � ã ñ ü 6k�� ã � ò , Cov �fñ�� ã ñ ü ò � �for :ml� C ,

and � � � & �� . Eachparameterfiü is either1 or 0, dependingon whetherit hasan

effector is redundant.For eachsample,� � ��and

D � ��(excluding fh> which

is alwaysincludedin themodel).Thenumberof non-redundantvariablesD H is set

to� ã �� ã & � ã ;<;<; ã �<�� . For eachvalue,the resultis theaverageof twenty runsover

twenty independenttraining samples,testedon a large independentset. This test

125

n H 0 10 20 30 40 50 60 70 80 90 100Dim

NN-GAR. 0.6 12.8 27.2 40.9 50.5 64.7 75.2 87.6 96.7 98.5 99.6LASSO oqpAr?sut 47.4 47.2 51.0 56.0 60.0 64.2 67.6 72.8 77.4 78.6 81.5AIC 14.8 21.4 28.1 34.5 40.6 46.9 53.4 59.9 65.2 71.2 77.6BIC 1.1 4.2 7.4 11.1 14.5 16.7 20.1 22.9 26.6 30.1 32.5RIC 0.2 1.6 3.7 6.0 7.9 10.4 12.2 14.6 17.9 19.6 21.8CIC 0.0 0.2 1.3 2.6 28.4 72.0 100 100 100 100 100PACE � 0.0 4.6 18.6 37.8 59.6 84.0 95.7 99.4 100 100 100PACE Y 0.0 4.6 18.8 37.9 59.6 84.2 95.8 99.4 100 100 100PACE R 0.6 8.9 22.1 34.0 47.6 62.0 72.8 87.2 94.5 99.1 100PE

NN-GAR. 0.4 7.3 11.3 14.4 17.6 20.1 21.9 23.0 23.1 23.0 22.9LASSO oqpAr?sut 5.69 7.46 10.1 12.5 15.7 18.9 22.6 25.6 29.0 35.8 42.6full 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6AIC 11.2 13.1 15.6 18.2 20.9 23.8 26.4 28.8 31.1 33.5 35.7BIC 1.7 9.1 16.9 24.5 31.8 41.0 49.4 57.3 65.6 73.3 82.1RIC 0.5 9.3 18.2 27.5 36.6 45.6 55.4 64.6 72.6 83.5 92.5CIC 0.0 10.1 19.7 29.3 33.9 29.3 21.6 21.6 21.6 21.6 21.6PACE � 0.0 9.8 15.6 19.3 21.1 22.1 21.8 21.6 21.6 21.6 21.6PACE Y 0.0 9.8 15.6 19.3 21.3 22.2 22.0 22.0 21.8 21.8 21.7PACE R 0.1 7.0 10.6 13.2 15.4 16.9 17.1 16.5 15.5 14.5 10.7

Table 6.1: Experiment6.1. Averagedimensionalityand predictionerror of theestimatedmodels.

setis generatedfrom thesamemodelstructurewith 10,000observationsandzero

noisevariance,andhencecanbetakenasthetruemodel.Theresultsareshown in

Figure6.1andrecordedin Table6.1.

Figure 6.1(a)and Figure 6.1(b) show the dimensionality(Dim) of the estimated

modelsand their predictionerrors(PE), respectively; the horizontalaxis isD H in

bothcases.In Figure6.1(a),thesolid line givesthesizeof theunderlyingmodels.

Sincepredictionaccuracy ratherthanmodelcomplexity is usedasthestandardfor

modeling,thebestestimatedmodeldoesnot necessarilyhave thesamecomplexity

astheunderlyingone—dimensionalityreductionis merelya byproductof parame-

ter estimation.Modelsgeneratedby PACE � , PACE Y andPACE R find theunderlying

null hypothesisv > andtheunderlyingfull hypothesisv#w correctly;their seeming

inconsistency betweentheseextremesis in factnecessaryfor theestimatedmodels

to produceoptimal predictions. AIC overfits v > andunderfits v#w . BIC and RIC

126

0

20

40

60

80

100

0 20 40 60 80 100

Dim

ensi

onal

ityx

k*

true

AIC

BIC

CIC

RIC

PACE6

PACE2,4

LASSO

NN-GAROTTE

0

20

40

60

80

100

0 20 40 60 80 100

Pre

dict

ion

erro

ry

k*

full

null

AIC

BIC

CIC

RIC

PACE2,4

PACE6

NN-GAROTTE

LASSO

(a) (b)

Figure6.1: Experiment6.1,D � ��{z � � � & �� . (a) Dimensionality. (b) Prediction

error.

bothfit v > well, but dramaticallyunderfitall theotherhypotheses—includingv#w .CIC successfullyselectsboth v > and v#w but eitherunderfitsor overfitsmodelsin

between.NN-GAROTTE generallychoosesslightly larger modelsthanPACE R , but

LASSO oqp)r8sut significantlyoverfitsfor smallD H andunderfitsfor large

D H .In Figure6.1(b),theverticalaxisrepresentstheaveragemeansquarederrorin pre-

dicting independenttestsets.Themodelsbuilt by BIC andRIC have errorsnearly

as greatas the null model. AIC is slightly better for v#w than BIC and RIC, but

fails on v > . CIC eventually coincideswith the full model asD H increases,and

produceslarge errorsfor somemodelstructuresbetweenv > and v#w . It is inter-

estingto notethatPACE � alwaysperformsaswell asthebestof OLSC �[Z�ò (thenull

model),RIC, BIC, AIC, CIC, andOLS (thefull model):no OLS subsetselectionpro-

ceduresproducemodelsthataresensiblybetter. Recallthat PACE � selectstheop-

timal subsetmodelfrom thesequence,andin this respectresemblesPACE þ , which

is the optimal threshold-basedOLS subsetselectionprocedureOLSC �}| H ò . PACE Yperformssimilarly to PACE � . Remarkably, PACE R outperformsPACE � andPACE Y by

a largemargin, evenwhentherearenoredundantvariables.Theshrinkagemethods

NN-GAROTTE and LASSO oqp)r8sut arecompetitive with PACE � and PACE Y —betterfor

smallD H andworsefor large

D H —but cannever outperformPACE R in any practical

sense,which is consistentwith Corollary3.10.2andTheorem3.11.

127

n H 0 10 20 30 40 50 60 70 80 90 100Dim

NN-GAR. 0.6 21.0 32.6 45.7 57.1 67.2 78.0 87.0 94.3 98.5 100LASSO oqpAr?sut 47.4 46.8 52.8 59.0 64.2 71.0 76.6 82.6 88.3 93.7 92.3AIC 14.8 23.4 32.0 40.5 49.2 58.2 66.4 75.3 83.0 91.4 99.7BIC 1.1 10.6 20.1 29.4 39.0 48.4 57.8 67.2 75.9 85.4 94.0RIC 0.2 9.6 18.8 27.4 36.6 45.3 54.0 62.5 70.3 78.0 86.0CIC 0.0 9.6 21.1 72.4 100 100 100 100 100 100 100PACE � 0.0 10.3 21.4 31.5 42.0 53.0 63.2 74.0 83.4 93.3 100PACE Y 0.0 10.3 21.4 31.5 42.1 53.0 63.4 74.0 83.5 93.3 100PACE R 0.6 11.5 22.1 32.6 43.0 54.2 63.9 74.3 83.4 92.5 100PE

NN-GAR. 0.10 1.35 2.12 2.83 3.61 4.17 4.72 5.27 5.56 5.70 5.71LASSO oqpAr?sut 1.42 1.91 2.65 3.43 4.43 5.27 6.15 7.00 7.94 9.61 35.60full 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40AIC 2.80 3.00 3.27 3.53 3.85 4.16 4.40 4.70 4.95 5.32 5.56BIC 0.43 1.00 1.59 2.71 3.44 4.50 5.62 6.35 7.88 8.76 10.5RIC 0.12 1.00 2.03 3.82 5.04 6.81 8.59 10.3 13.1 15.8 18.3CIC 0.00 0.99 1.83 4.31 5.40 5.40 5.40 5.40 5.40 5.40 5.40PACE � 0.00 1.06 1.71 2.40 2.95 3.56 3.96 4.57 5.03 5.29 5.40PACE Y 0.00 1.06 1.71 2.40 2.92 3.58 3.98 4.57 5.04 5.29 5.40PACE R 0.03 0.80 1.19 1.82 2.31 2.66 2.92 3.24 3.28 3.22 2.48

Table 6.2: Experiment6.2. Averagedimensionalityand predictionerror of theestimatedmodels.

Experiment 6.2 Variableswith large effects. In this experimentthe underlying

modelis thesameasin thelastexperimentexceptthat � � � ' � ; resultsareshown

in Figure6.2andTable6.2.

The resultsaresimilar to thosein the previous example. As Figure6.2(a)shows,

modelsgeneratedby PACE � , PACE Y andPACE R lie closerto the line of underlying

modelsthanin thelastexample.AIC generallyoverfitsby anamountthatdecreases

asD H increases.RIC andBIC generallyunderfitby anamountthatincreasesas

D H in-

creases.CIC settlesonthefull modelearlierthanin thelastexample.NN-GAROTTE

and LASSO oqpAr?sut generallychoosemodelsof larger dimensionalitiesthan PACE � ,PACE Y andPACE R .In termsof predictionerror(Figure6.2(b)),PACE � andPACE Y arestill thebestof all

OLS subsetselectionprocedures,andarealmostalwaysbetterandnever meaning-

fully worsethanNN-GAROTTE andLASSO oqp)r8sut (seealsothediscussionof shrinkage

128

0

20

40

60

80

100

0 20 40 60 80 100

Dim

ensi

onal

ityx

k*

true

AIC

BICCIC

RIC

NN-GAROTTE

LASSO

PACE2,4,6

0

5

10

15

20

0 20 40 60 80 100

Pre

dict

ion

erro

ry

k*

full

AIC

BIC

CIC

RIC

PACE2,4

PACE6

NN-GAROTTE

LASSO

(a) (b)

Figure6.2: Experiment6.2.D � ��~z � � � ' � . (a) Dimensionality. (b) Prediction

error.

estimationin situationswith differentvariableeffects;Example3.1),while PACE R is

significantlysuperior. Whentherearenoredundantvariables,PACE R choosesafull-

sizedmodelbut usesdifferentcoefficientsfrom OLS, yielding a modelwith much

smallerpredictionerrorthantheOLS full model.Thisdefiesconventionalwisdom,

whichviews theOLS full modelasthebestpossiblechoicewhenall variableshave

largeeffects.

Experiment 6.3 Rateof convergence. Ourthird exampleexplorestheinfluenceof

thenumberof candidatevariables.ThevalueofD

is chosento be�<�~z & �~z ;<;<; z<��

respectively, andfor eachvalueD H is chosenas

�,D 1 & and

D; otherwisetheexper-

imentalconditionsareasin Experiment6.1. Note thatvariableswith smalleffects

areharderto distinguishin the presenceof noisy variables.Resultsareshown in

Figure6.3andrecordedin Table6.3.

The paceregressionproceduresarealwaysamongthe bestin termsof prediction

error, a propertyenjoyed by noneof the conventionalprocedures.None of the

paceproceduressuffer noticeablyasthenumberof candidatevariablesD

decreases.

Apparently, they arestablewhenD

is assmallasten.

129

n10 20 30 40 50 60 70 80 90 100

Dim AIC 1.7 2.5 5.0 5.8 7.8 8.2 12.3 10.9 13.2 14.8(n H.�� ) BIC 0.0 0.2 0.3 0.2 0.2 0.5 0.8 0.5 0.9 1.1

RIC 0.2 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.3 0.2CIC 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0PACE � 0.9 0.3 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.0PACE Y 0.9 0.3 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.0PACE R 0.8 0.7 1.1 0.6 0.3 0.6 1.4 0.5 0.6 0.6

PE AIC 1.1 2.0 3.5 4.4 5.8 6.1 9.5 8.5 10.7 11.2(n H.�� ) BIC 0.0 0.5 0.5 0.5 0.5 0.7 1.4 0.8 1.6 1.7


Dim AIC 4.8 8.9 14.6 18.8 23.4 27.2 33.2 37.0 40.2 46.9(n H � û � ) BIC 2.0 3.5 6.2 6.5 7.9 10.6 12.0 13.3 13.9 16.7


PE AIC 2.3 4.7 7.2 9.0 11.4 13.8 17.1 18.1 22.0 23.8(n H.� û � ) BIC 3.5 8.0 11.0 15.8 20.2 22.9 27.8 32.2 37.3 41.0


Dim AIC 7.7 15.6 23.6 31.0 38.6 45.9 52.0 60.9 67.5 77.6(n H.� n

) BIC 3.3 7.2 11.1 13.2 15.2 19.9 22.1 24.4 27.4 32.5RIC 5.3 8.5 11.3 12.3 13.1 15.6 16.6 18.1 18.8 21.8CIC 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE � 9.4 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE Y 9.4 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE R 9.2 20.0 29.9 39.6 49.9 60.0 70.0 80.0 90.0 100

PE full 2.0 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6(n H � n

) AIC 3.5 7.0 10.3 13.8 17.1 21.2 26.7 28.3 33.1 35.7BIC 7.7 15.3 22.8 31.3 40.5 47.2 56.4 65.7 74.0 82.1RIC 5.6 13.9 22.6 32.1 42.5 51.3 61.9 71.8 82.4 92.5CIC 2.0 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6PACE � 2.3 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6PACE Y 2.3 4.0 6.4 8.3 10.6 12.3 16.4 16.9 19.4 21.7PACE R 1.2 1.0 2.6 2.7 4.5 5.2 7.8 8.0 10.4 10.7

Table6.3: Experiment6.3—for thetruehypotheses,D H � �~z û � , and

D, respectively.

130

0

5

10

15

0 20 40 60 80 100

Dim

ensi

onal

ityxk*

AIC

PACE2,4,6,

BIC,RIC,CIC 0

5

10

0 20 40 60 80 100

Pre

dict

ion

erro

ryk*

AIC

BICRIC,CIC,

PACE2,4,6

(a) (b)

0

20

40

60

80

0 20 40 60 80 100

Dim

ensi

onal

ityxk*

true

AIC

BICRIC

CIC

PACE2,4

PACE6

0

10

20

30

40

50

0 20 40 60 80 100

Pre

dict

ion

erro

ryk*

AIC

BICRIC

CIC

PACE2,4PACE6

(c) (d)

0

20

40

60

80

100

0 20 40 60 80 100

Dim

ensi

onal

ityxk*

true,CIC,

AIC

BICRIC

PACE2,4,6

0

20

40

60

80

100

0 20 40 60 80 100

Pre

dict

ion

erro

ryk*

AIC

BICRIC

full,CIC,PACE2,4PACE6

(e) (f)

Figure6.3: Experiment6.3. (a), (c) and(e) show thedimensionalitiesof theesti-matedmodelswhentheunderlyingmodelsarethenull, half, andfull hypotheses.(b), (d) and (f) show correspondingpredictionerrorsover large independenttestsets.

131

6.3 Artificial datasets:Towards reality

Miller (1990)considersanartificial situationfor subsetselectionthatassumes“geo-

metricprogressionof valuesof thetrueprojections”,i.e., thegeometricdistribution

of K H . Despitethefactthat,ashepointsout (p.181),“thereis no evidencethatthis

patternis realistic”,it is probablycloserto realitythanthemodelsin thelastsection,

asevidencedby thedistribution of eigenvaluesof a few practicaldatasetsgivenin

Table6.2,Miller, 1990.Hencethissectiontestshow paceregressioncompareswith

othermethodsin this artificial situation. Realworld datasetswill be testedin the

next section.

Weknow thattheextentof upgradingeach ^K ü by paceregressionbasicallydepends

on the numberof ^K aroundit. In particular, PACE � and PACE Y , both essentially

selectionprocedures,rely moreon thenumberaroundthe true thresholdvalue | H .Experiment6.4 usesa situationin which therearerelatively more K Hü around | H ,so that PACE � and PACE Y shouldperformsimilarly to how they would if | H were

known. Experiment6.5roughlyfollowsMiller’ s idea,in which case,however, fewK Hü arearound| H , andhencepaceregressionwill producelesssatisfactoryresults.

Breiman(1995) investigatesanotherartificial situationin which the nonzerotrue

coefficient valuesare distributed in two or threegroups,and the columnvectors

of � arenormally, but not independently, distributed.Experiment6.6 follows this

idea.

For the experimentsin this section,we includeresultsfor both the ñ -fixed and ñ -

randomsituations.From theseresults,it canbe seenthat all proceduresperform

almostidentically in both situations.We shouldnot conclude,of course,that this

will remaintruein othercases,in particularfor small � .

Experiment 6.4 Geometricallydistributedeffects. We now investigatea simple

geometricdistributionof effect. Exceptwheredescribedbelow, all otheroptionsin

this experimentarethesameasusedpreviously.

132

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0 5 10 15 20 25 30 35

Figure6.4: _`� ^Kzó�õËò in Experiment6.4.

Thelinearmodelwith 100explanatoryvariablesis used,i.e.,

e � þ?>@>� ü~ýmþ B ü ñ ü �� z (6.1)

wherethenoisecomponent��6�� ~z � � ò . Theprobabilitydensityfunctionof ^K is

determinedby thefollowing mixturemodel

��+� �"^K ó�õ ò � ' ��+� �"^Kaó � ò��&�' �(� �"^KËó � ò�� +� �"^KËó E ò�� +� �"^K óa��ò�� +� �"^K ó � ��ò��& �� {^K óh&�' ò�� +� �"^K óa��ò z (6.2)

i.e., K H � �{z<��z E z � z<� � z &�' and �� with probabilities.5, .25, .13, .6, .3, .2 and.1

respectively.

The correspondingmixture _ -function is shown in Figure6.4. It canbe seenthat| H í & ; ' , for OLSC �8|Ñò . Hencethis situation seemsfavorable to AIC, for ( � -

asymptotically)AIC � OLSC ��& ò , which is nearly the optimal selectionprocedure

here. SincehereBIC � OLSC �8�)��.�¶ò � OLSC �� ; ��ò and RIC � OLSC �[&��A�� D ò �OLSC �?� ; & ò , bothshouldbeinferior to AIC.

The effects K H � �~z<��z E z � z<� � z &�' z �� areconvertedto parametersthroughthe ex-

133

Procedure � -fixedmodel � -randommodelDim PE Dim PE

full (OLS) 100.0 1.03 100.0 1.11null 0.0 2.73 0.0 2.68AIC 37.8 0.96 37.2 1.03BIC 15.4 1.09 14.6 1.16RIC 11.3 1.21 10.9 1.30CIC 13.7 1.15 12.4 1.26PACE � 45.2 1.00 41.2 1.05PACE Y 45.3 1.03 41.4 1.08PACE R 39.2 0.80 36.7 0.84

Table6.4: Resultsof Experiment6.4.

pression

B ü�� K Hü � �� ; (6.3)

We let ñ ü 6�� ~z<� ò independentlyfor all C . For eachtrainingset,let � � � �<�and

thenumberof observations� � �<��. Thenthetruemodelbecomes

e � �� ñ þ ��O� ñ�\ > ò�� ; �� ñ�\ þ ��O� ñ��\Âò�� ; & � � ñ��R��<� ñ��@�Âò�� ; � �fñ��@ ��X��O� ñ� �Y-ò�� ; E � � ñ� @\/� ñ� @R�� ñ� ¡� ò�� ; ' � � ñ� @�� ñ� @ Âò�� ; � � ñ þ?>@> �� ; (6.4)

To testtheestimatedmodel,two situationsareconsidered:ñ -fixedand ñ -random.

In the ñ -fixedsituation,thetrainingandtestsetssharethesame� matrix,but only

thetrainingsetcontainsthenoisecomponent� . In the ñ -randomsituation,a large,

independent,noise-freetestset( � � ��, anddifferent � from thetrainingset)

is generated,andthecoefficientsin (6.4)areusedto obtaintheresponsevector.

Table6.4 presentstheexperimentalresults.Eachfigure in the tableis theaverage

of��

runs. ThemodeldimensionalityDim is thenumberof remainingvariables,

excluding the intercept,in the trainedmodel. PE is the meansquaredprediction

errorby thetrainedmodelover thetestset.

As shown in Table6.4, the null model is clearly the worst in termsof prediction

134

-5

-4

-3

-2

-1

0

1

2

3

0 2 4 6 8 10

k = 10

k = 20

k = 30

k = 40

Figure6.5: The mixture _ -function forD � ��~z & �{z �~z E � predictors,and K¢Hü �� <� I üh£Ñþ ò � with I+� � ; % .

error. The full model is relatively good,even betterthan BIC, RIC and CIC—this

is due to the high thresholdvaluesemployed by the latter. AIC and threepace

estimatorsarebetterthanthefull model.AIC, PACE � andPACE Y performsimilarly:

AIC is slightly betterbecausethe other two have to estimatethe thresholdvalue

while it usesa nearlyoptimal thresholdvalue. PACE R beatsall othersby a clear

margin, althoughit choosesmodelsof similar dimensionalityto AIC, PACE � and

PACE Y .Experiment 6.5 Geometricallydistributedeffects,Miller’ s case. We now look

at Miller’ s (1990,pp.177–181)casewhereeffectsaredistributedgeometrically. In

our experiment,however, we examinethe situationwith independentexplanatory

variables,whereasMiller (pp.186-187)performssimilarity transformationson the

variance-covariancematrix. Correlatedexplanatoryvariableswill be inspectedin

Experiments6.6and6.7.

Miller’ sassignmentof theeffectsis equivalentto this: for amodelwithD

predictors

(excludingtheconstantterm),theunderlying K Hü � � �<� I üh£Ñþ ò � for C � ��z ;<;=; z D . In

particular, heconsidersthesituationsD � ��~z & �~z � or E � and IU� � ; % or

� ; � ; i.e.,

for I�� ; % , K Hü � ��{z � E z E ��z &�� z=� � z<��z � ; � z E ; E z & ; % z<� ; % z ;<;=; , andfor I�� ; � ,135

-8

-7

-6

-5

-4

-3

-2

-1

0

1

0 2 4 6 8 10

k = 10

k = 20

k = 30

k = 40

Figure6.6: The mixture _ -function forD � ��~z & �~z �~z E � predictors,and KLHü �� I üh£Ñþ ò � with I+� � ; � .

¤ ��¥§¦ ¤ ��¥§¨n �ª©h� n �U«<� n �U¬<� n ��O� n �®©h� n �U«<� n �U¬<� n ��O�¯ H/° 0 0.8 3.5 5 2 5.5 6.5 7.5

Table6.5: | H for I±� � ; % and� ; � respectively in Experiment6.5.

K Hü � �<��~z �� z<� z E ; � z=� ; � zh� ; � zh� ; &�& z ;=;<; . Figure6.5 and6.6 show the mixture _ -

functionsfor the I�� ; % and� ; � situationsrespectively. From both figures,we

canroughlyestimatethevaluesof | H for eachsituation,givenin Table6.5. When

the thresholdvalueof a modelselectioncriterion is near | H , this criterion should

resembletheoptimalselectioncriterion in performance.This observationhelpsto

explain theexperimentalresults.

As in Experiment6.4, eachB ü is chosenaccordingto (6.3), where � � ��and� � � �<�

. Thereforethemodelfor I+� � ; % is of theform

e � � ñ þ � � ; %Cñ�� ; � E ñ M � � ; ' � &�ñ�Y/� � ; E � ñ�\²� � ; �Cñ�R�� ; &��ñ��²�� ; & � ñ��²� � ; � ��ñ� ²� � ; � �ñ þ?> �X�� ; % û £Ñþ ñ û �� z (6.5)

136

andthemodelfor I³� � ; � is of theform

e � � ñ þ � � ; ��ñ�� ; ��Cñ M � � ; &�&�ñ�Y�� ; � Cñ�\²� � ; � ��%Cñ�R²� � ; � E �]ñ�� ; � &�%Cñ�� ; �~� ��ñ� ²� � ; �~� ñ þ?> �� ; � û £Ñþ ñ û �� ; (6.6)

Trainingandtestsetsaregeneratedasin Experiment6.4.

Theexperimentalresultsaregivenin Table6.6and6.7 for I³� � ; % and� ; � respec-

tively. Amongall theseresults,PACE R is clearlythebest:it never(sensibly)losesto

any otherprocedures,andoftenwins by a clearmargin. In somecases,procedures

PACE � andPACE Y areinferior to thebestselectioncriterion—despitethe fact that,

theoretically, they areD-asymptoticallythebestselectioncriteria. This is theprice

that paceregressionpaysfor estimatingõ]�?K H ò in the finiteD

situation. It could

be slightly inferior to otherselectionprocedureswhich have the thresholdvalues

near | H ; nevertheless,no individual oneof the OLS, null, AIC, BIC � OLSC �� ; ��ò ,RIC � OLSC �[&��)�� D ò andCIC selectioncriteriacanalwaysbeoptimal.

The accuracy of paceestimatorsrelies on the accuracy of the estimatedõ´�8K H ò ,whichfurtherdependsonthenumberof predictors(i.e.,thenumberof K ü ’s)—more

precisely, the numberof K ü ’s arounda particularpoint of interest,for example,| H for PACE � andPACE Y . In this experiment,thereareoften few datapointsclose

enoughto | H to provideenoughinformationfor accuratelyestimatingõ]�?K H ò around| H . In thecasesI³� � ; � andD � � and E � , for example,| H í � ; ' and � ; ' , but we

only have K¢Hü � ��~z �� z<� z E ; � z<� ; � zh� ; � zh� ; &�& z ;<;<; . For samplesdistributedlikethis,

theclusteringresultaround| H is unlikely to beaccurate,implying higherexpected

lossfor paceestimators.

Experiment 6.6 Breiman’scase. Thisexperimentexploresasituationconsidered

by Breiman(1995,pp.379–382).

According to Breiman(1995, p.379), the “valuesof the coefficients were renor-

malizedso that in the � -controlledcase,the averageÚ � wasaround ; % ' , in the� -random,about ; ��' .” Sincedetailsof the “re-normalization”arenot given, the

137

Proceduren � -fixedmodel � -randommodel

Dim PE Dim PEfull (OLS) 10 0.11 10 0.11null 0 2.79 0 2.72AIC 8.61 0.13 8.6 0.14BIC 6.53 0.24 6.61 0.24RIC 10 7.37 0.18 7.53 0.19CIC 9.87 0.11 9.81 0.12PACE � 9.45 0.12 9.44 0.12PACE Y 10 0.11 10 0.11PACE R 10 0.11 10 0.11full (OLS) 20 0.20 20 0.22null 0 2.80 0 2.74AIC 10.8 0.20 11.0 0.22BIC 6.83 0.27 6.71 0.29RIC 20 7.10 0.25 7.23 0.26CIC 10.6 0.21 10.6 0.23PACE � 12.2 0.21 12.1 0.23PACE Y 12.2 0.21 12.1 0.23PACE R 11.5 0.20 11.3 0.22full (OLS) 30 0.31 30 0.30null 0 2.75 0 2.74AIC 12.7 0.27 12.1 0.25BIC 7.06 0.27 7.15 0.27RIC 30 7.11 0.27 7.21 0.27CIC 8.40 0.26 8.42 0.26PACE � 13.4 0.28 11.6 0.26PACE Y 13.4 0.28 11.6 0.26PACE R 12.4 0.23 11.3 0.23full (OLS) 40 0.41 40 0.43null 0 2.78 0 2.79AIC 14.1 0.33 13.8 0.34BIC 7.17 0.28 6.96 0.31RIC 40 6.99 0.29 6.73 0.31CIC 7.52 0.29 7.26 0.31PACE � 11.2 0.32 11.1 0.33PACE Y 11.3 0.31 11.2 0.33PACE R 11.5 0.26 11.3 0.28

Table6.6: Experiment6.5. ResultsforD � �<�~z & �~z �~z E � predictors,and K Hü �� I üh£Ñþ ò � with I+� � ; % .

138

Proceduren � -fixedmodel � -randommodel

Dim PE Dim PEfull (OLS) 10 0.11 10 0.11null 0 1.57 0 1.61AIC 4.92 0.10 5.13 0.11BIC 3.23 0.13 3.44 0.13RIC 10 3.77 0.12 3.96 0.11CIC 5.12 0.11 5.3 0.11PACE � 5.28 0.12 5.75 0.11PACE Y 10 0.11 10 0.11PACE R 10 0.11 10 0.11full (OLS) 20 0.20 20 0.22null 0 1.57 0 1.60AIC 6.35 0.16 6.56 0.17BIC 3.24 0.13 3.48 0.14RIC 20 3.5 0.13 3.76 0.13CIC 3.4 0.14 3.76 0.14PACE � 4.54 0.15 5.37 0.16PACE Y 4.66 0.14 5.53 0.15PACE R 4.72 0.13 5.56 0.13full (OLS) 30 0.31 30 0.30null 0 1.55 0 1.54AIC 8.5 0.22 7.92 0.20BIC 3.59 0.14 3.37 0.15RIC 30 3.61 0.14 3.41 0.15CIC 3.28 0.15 3.12 0.16PACE � 4.59 0.18 3.88 0.18PACE Y 4.73 0.16 4.05 0.17PACE R 5.48 0.14 4.72 0.14full (OLS) 40 0.41 40 0.43null 0 1.58 0 1.59AIC 10.2 0.29 9.71 0.29BIC 3.63 0.16 3.58 0.16RIC 40 3.52 0.16 3.44 0.16CIC 3.12 0.16 3.03 0.17PACE � 4.33 0.23 3.84 0.23PACE Y 4.73 0.19 4.24 0.19PACE R 5.47 0.16 5.09 0.16

Table6.7: Experiment6.5. ResultsforD � ��{z & �~z �~z E � predictors,and K Hü �� <� I üh£Ñþ ò � with I+� � ; � .

139

Procedure µ�¶ �ª© µ<¶ �U« µ�¶ �U¬ µ�¶ �� µ�¶ �U·Dim PE Dim PE Dim PE Dim PE Dim PE� -fixed(

n �U«<��¸@¹´��O� )full (OLS) 20 0.54 20 0.54 20 0.54 20 0.54 20 0.54null 0 3.65 0 4.00 0 4.88 0 5.76 0 6.58AIC 4.77 0.32 6.4 0.42 7.45 0.48 8.00 0.51 8.69 0.53BIC 3.01 0.21 3.97 0.36 5.07 0.46 5.86 0.53 6.38 0.58RIC 2.32 0.14 2.98 0.34 3.83 0.49 4.41 0.59 4.84 0.67CIC 2.25 0.14 3.04 0.37 3.92 0.53 5.09 0.62 6.01 0.64PACE � 3.47 0.20 5.46 0.41 6.89 0.51 8.19 0.57 9.41 0.58PACE Y 3.62 0.20 5.74 0.40 7.22 0.51 8.61 0.56 9.72 0.58PACE R 3.76 0.16 5.46 0.34 6.99 0.46 8.17 0.51 8.99 0.54� -random(

n ��O��¸@¹´�U¦<� )full (OLS) 40 1.06 40 1.06 40 1.06 40 1.06 40 1.06null 0 0.41 0 2.60 0 3.22 0 3.85 0 4.43AIC 8.46 0.48 10.2 0.55 12.1 0.67 13.1 0.75 13.9 0.80BIC 3.94 0.30 5.51 0.38 6.74 0.52 7.55 0.64 8.06 0.71RIC 2.27 0.26 3.64 0.31 4.85 0.50 5.25 0.64 5.60 0.76CIC 1.22 0.30 3.38 0.31 4.47 0.53 5.17 0.67 6.03 0.76PACE � 3.50 0.36 5.83 0.41 8.31 0.60 10.8 0.75 12.7 0.80PACE Y 4.55 0.34 6.45 0.40 9.08 0.59 11.7 0.74 13.4 0.80PACE R 5.04 0.27 7.37 0.33 9.72 0.47 11.5 0.60 13.0 0.66

Table6.8: Experiment6.6,boththe ñ -fixedand ñ -randomsituations

normalizationfactor I usedheresetsB Hü ��I B ü , for C � ��z ;<;<; z D , sothat

Ú � í � I � ú ûü´ýmþ B ü �� I � ú ûü~ýmþ B ü � � D z (6.7)

thatis,

I �?� z D z Ú � ò í D�� Ú � òÑúBûü~ýmþ B ü � ; (6.8)

Therefore,in the ñ -fixedcase

I � E �~z & �~zh� ; % ' ò í �� ú ûü~ýmþ B ü � z (6.9)

140

andin the ñ -randomcase

I ��% �~z E �~za� ; ��' ò í &úBûü´ýmþ B ü � ; (6.10)

Given this normalization,subsetselectionproceduresseemto producemodelsof

similar dimensionalitiesto thosegivenin Breiman’s paper. Each B ü is obtainedby

Breiman’smethod,p.379.

Experimentalresultsaregiven in Table6.8. The performanceof paceestimators

areasin Experiment6.5; PACE R is thebestof all. However, PACE � andPACE Y are

notoptimalselectioncriteriain all situations.They areintermediatein performance

to otherselectioncriteria—notthebest,not theworst. While further investigation

is neededto fully understandthe results,a plausibleexplanationis that,asin Ex-

periment6.5, few datapoints are available to accuratelyestimateõ]�?K H ò around| H .6.4 Practical datasets

Thissectionpresentssimulationstudiesonpracticaldatasets.

Experiment 6.7 Practical datasets. The datasetsusedarechosenfrom several

thatareavailable. They arechosenfor their relatively largenumberof continuous

or binaryvariables,to highlight theneedfor subsetselectionandalsoin accordance

with ourD-asymptoticconclusions.To simplify thesituationandfocuson theba-

sic idea, non-binarynominal variablesoriginally containedin somedatasetsare

discarded,andso areobservationswith missingvalues;future extensionsof pace

regressionmaytake theseissuesinto account.

Table6.7lists thenamesof thesedatasets,togetherwith thenumberof observations� andthenumberof variablesD

in eachdataset.ThedatasetsAutos(Automobile),

Cpu (ComputerHardware),andCleveland(HeartDisease—ProcessedCleveland)

141

Autos Bankbill Bodyfat Cholesterol Cleveland Cpun / k 159/ 16 71 / 16 252/ 15 297/ 14 297/ 14 209/ 7

Horses Housing Ozone Pollution Prim9 Uscrimen / k 102/ 14 506/ 14 203/ 10 60 / 16 500/ 9 47 / 16

Table6.9: Practicaldatasetsusedin Experiment6.7.

areobtainedfrom the MachineLearningRepositoryat UCI3 (Blake et al., 1998).

Bankbill, HorsesandUscrimearefrom OzDASL4 (AustralasianDataandStoryLi-

brary). Bodyfat,Housing(Boston)andPollutionarefrom StatLib5. Ozoneis from

LeoBreiman’sWebsite6. Prim9is availablefrom theS-PLUS package.Cholesterol

is usedby Kilpatrick andCameron-Jones(1998).Mostof thesedatasetsarewidely

usedfor testingnumericpredictors.

Paceregressionprocedures,togetherwith others,areappliedto thesedatasetswith-

out muchscrutiny. This is lessthanoptimal,andofteninappropriate.For example,

somedatasetsare intrinsically nonlinear, and/orthe noisecomponentis not nor-

mally distributed.

Although the resultsobtainedare generallysupportive, careful interpretationis

necessary. First, the finite situationmay not be large enoughto supportourD-

asymptoticconclusions,or theelementsin theset N.^K ü<P aretooseparatedto support

meaningfulupdating(Section4.2). In fact,it is well known thatnoestimatoris uni-

formally optimal in all finite situations.Second,the choiceof thesedatasetsmay

not be extensive enough,or they could be biasedin someunknown way. Third,

sincethetruemodelis unknown andthuscross-validationresultsareusedinstead,

thereis no consensuson thevalidity of dataresamplingtechniques(Section2.6).

Theexperimentalresultsaregivenin Table6.7.Eachfigureis theaverageof twenty

runsof 10-fold cross-validationresults,wherethe“Dim” columngivestheaverage

estimateddimensionality(excludingtheintercept)andthe“PE(%)” columntheav-

eragesquaredpredictionerrorrelative to thesamplevarianceof theresponsevari-3http://www.ics.uci.edu/º mlearn/MLRepository.html4http://www.maths.uq.edu.au/º gks/data/index.html5http://lib.stat.cmu.edu/datasets/6ftp://ftp.stat.berkeley.edu/pub/users/breiman

142

Procedure Autos Bankbill Bodyfat CholesterolDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)

NN-GAR. 10.6 21.5 10.39 7.45 8.10 2.76 10.38 101.1LASSO »�¼ 9.72 21.5 9.71 8.65 2.99 2.73 5.75 97.4full (OLS) 15.00 21.4 15.00 7.86 14.00 2.75 13.00 97.1null 0.00 100.5 0.00 101.01 0.00 100.48 0.00 100.4AIC 7.82 23.6 9.64 8.76 3.38 2.71 4.31 97.4BIC 3.87 23.9 8.73 8.99 2.24 2.67 2.54 97.6RIC 3.57 23.6 8.02 9.78 2.24 2.67 2.65 97.0CIC 5.00 24.5 12.54 8.56 2.23 2.68 2.23 99.9PACE � 12.95 21.9 13.77 8.20 2.24 2.68 4.42 97.4PACE Y 12.99 21.8 13.80 8.20 2.25 2.68 4.50 97.3PACE R 10.51 21.1 12.67 8.34 2.32 2.66 4.35 96.2

Cleveland Cpu Horses HousingDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)

NN-GAR. 9.60 47.6 5.10 25.5 4.90 109 7.50 28.0LASSO »�¼ 9.10 47.8 5.15 23.4 1.23 97 6.53 28.1full (OLS) 13.00 48.2 6.00 18.4 13.00 106 13.0 28.3null 0.00 100.3 0.00 100.7 0.00 102 0.0 100.3AIC 8.24 50.2 5.06 18.0 3.87 108 11.0 28.0BIC 5.31 51.3 4.94 17.9 2.50 106 10.7 28.8RIC 5.41 51.3 5.00 17.7 2.14 108 10.8 28.5CIC 11.46 49.1 6.00 18.4 0.29 107 11.9 28.2PACE � 12.35 48.5 5.21 18.2 2.65 107 11.6 28.2PACE Y 12.35 48.6 5.21 18.2 2.68 108 11.6 28.2PACE R 11.17 49.3 5.14 18.2 2.98 102 11.4 28.2

Ozone Pollution Prim9 UscrimeDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)

NN-GAR. 5.78 32.2 7.69 69.0 6.90 58.3 8.81 60.9LASSO »�¼ 5.62 32.3 5.32 47.9 6.43 54.1 9.67 49.8full (OLS) 9.00 31.6 15.00 60.8 7.00 53.5 15.00 52.1null 0.00 100.6 0.00 102.1 0.00 100.2 0.00 102.9AIC 5.54 31.6 7.24 58.3 7.00 53.5 6.91 49.3BIC 3.60 34.3 5.04 65.3 4.13 54.6 5.26 49.3RIC 4.24 33.5 4.51 66.3 5.87 54.8 3.81 52.7CIC 5.92 31.7 5.64 67.1 7.00 53.5 5.00 53.4PACE � 7.17 31.7 8.96 62.7 6.78 53.8 9.16 54.4PACE Y 7.17 31.7 9.06 62.9 6.85 53.8 9.21 54.9PACE R 6.08 31.5 7.86 56.0 6.97 53.8 7.78 50.7

Table6.10:Resultsfor practicaldatasetsin Experiment6.7.

143

able(i.e., thenull modelshouldhave roughly100%relativeerror).

It canbeseenfrom Table6.7 that thepaceestimators,especiallyPACE R , generally

producethebestcross-validationresultsin termsof predictionerror. This is con-

sistentwith theresultson artificial datasetsin theprevioussections.LASSO »�¼ also

generallygivesgoodresultson thesedatasets,in particularHorsesandPollution,

while NN-GAROTTE doesnot seemto performsowell.

6.5 Remarks

In thischapter, experimentshavebeencarriedoutovermany datasets,rangingfrom

artificial to practical. Although theseare all finiteD

situations,paceestimators

performreasonablywell, especiallyPACE R , in termsof squarederror, in accordance

with theD-asymptoticconclusions.

We also notice that paceestimators,mainly PACE � and PACE Y , perform slightly

worsethanthebestof theotherselectionestimatorsin somesituations.This is the

price that paceregressionhasto pay for estimatingõ´�8K H ò in finite situations. In

almostall cases,giventhetrue õ]�?K H ò , theexperimentalresultsarepredictable.

Unlike other modelingproceduresusedin the experiments,paceestimatorsmay

have room for further improvement. This is becausean estimatorof the mixing

distribution that is betterthanthecurrentlyusedNNM maybeavailable—e.g.,the

ML estimator. Even for the NNM estimatoritself, the choicesof potentialsupport

points,fitting intervals,empiricalmeasure,anddistancemeasurecanall affect the

results.Thecurrentchoicesarenotnecessarilyoptimal.

144

Chapter 7

Summary and final remarks

7.1 Summary

This thesispresentsa new approachto fitting linear models,called“paceregres-

sion”, which alsoovercomesthedimensionalitydeterminationproblemintroduced

in Section1.1. It endeavors to minimize theexpectedpredictionloss,which leads

to theD-asymptoticoptimality of paceregressionasestablishedtheoretically. In

orthogonalspaces,it outperforms,in thesenseof expectedloss,many existingpro-

ceduresfor fitting linear modelswhenthe numberof free parametersis infinitely

large. Dimensionalitydeterminationturns out to be a naturalby-productof the

goalof minimizing expectedloss. Thesetheoreticalconclusionsaresupportedby

simulationstudiesin Chapter6.

Paceregressionconsistsof six procedures,denotedby PACE þ z ;<;<; z PACE R , respec-

tively, whichareoptimalin theircorrespondingmodelspaces(Section3.5).Among

them,PACE \ is thebest,but requiresnumericintegrationandthusiscomputationally

expensive. However, it canbe approximatedvery well by PACE R , which provides

consistentlysatisfactoryresultsin our simulationstudies.

Themainideaof theseprocedures,explainedin Chapter3, is to decomposetheini-

tially estimatedmodel—wealwaysusetheOLS full model—inanorthogonalspace

into dimensionalmodels,which arestatisticallyindependent.Thesedimensional

145

modelsarethenemployed to obtainan estimateof the distribution of the true ef-

fectsin eachindividual dimension.This allows theinitially estimateddimensional

modelsto beupgradedbyfilteringouttheestimatedexpecteduncertainty, andhence

overcomesthecompetitionphenomenain modelselectionbasedon dataexplana-

tion. Theconceptof “contribution”, which is introducedin Section3.3,canbeused

to indicatewhetherandhow muchamodeloutperformsthenull one.It is employed

throughouttheremainderof thethesisto explain thetaskof modelselectionandto

obtainpaceregressionprocedures.

Chapter4 discussessomerelatedissues.Severalmajormodelingprinciplesareex-

aminedretrospectively in Section4.5. Orthogonalizationselection—acompetition

phenomenonanalogousto modelselection—iscoveredin Section4.6. Thepossi-

bility of updatingsignedprojectionsis discussedin Section4.7.

Although startingfrom the viewpoint of competingmodels,the ideaof this work

turns out to be similar to the empirical Bayesmethodologypioneeredby Rob-

bins (1951,1955,1964),andhasa closerelationwith Steinor shrinkageestima-

tion (Stein,1955;JamesandStein,1961).Theshrinkageideawasexploredin linear

regressionby Sclove(1968),Breiman(1995)andTibshirani(1996),for example.

Throughthis thesis,we have gaineda deepunderstandingof theproblemof fitting

linear modelsby minimizing expectedquadraticloss. Key issuesin fitting linear

modelsareintroducedin Chapter2, andthroughoutthe thesismajor existing pro-

ceduresfor fitting linearmodelsarereviewedandcompared,boththeoreticallyand

empirically, with theproposedones.They includeOLS, AIC, BIC, RIC, CIC, CV �?½Ðò ,BS �?¾ ò , RIDGE, NN-GAROTTE andLASSO.

Chapter5 investigatestheestimationof anarbitrarymixing distribution. Although

this is anindependenttopicwith its own value,it is anessentialpartof paceregres-

sion. Two minimumdistanceapproachesbasedon theprobabilitymeasureandon

a nonnegative measureareintroduced.While all the estimatorsarestronglycon-

sistent,the minimum distanceapproachbasedon a nonnegative measureis both

computationallyefficient andreliable. The maximumlikelihoodestimatorcanbe

146

reliablebut notefficient;otherminimumdistanceestimatorscanbeefficientbut not

reliable.By “reliable” herewe meanthatno isolateddatapointsareignoredin the

estimatedmixing distribution. This is vital for paceregression,becauseonesingle

ignoredpoint cancauseunboundedloss. The predictive accuracy of thesemini-

mumdistanceproceduresin thecaseof overlappingclustersis studiedempirically

in Section5.6,giving resultsthatgenerallyfavor thenew methods.

Finally, it is worth pointingout thattheoptimality of paceregressionis only in the

senseofD-asymptotics,andhencedoesnot excludethe useof OLS, andperhaps

many otherproceduresfor fitting linearmodels,in situationsof finiteD, especially

whenD

is small or therearefew neighboringdatapoints to supporta reasonable

updating.

7.2 Unsolvedproblemsand possibleextensions

In this section,we list someunsolved problemsandpossibleextensionsof pace

regression—mosthave appearedearlier in the thesis. Sincethe issuescoveredin

the thesisaregeneralandhave broadly-basedpotential,we have not attemptedto

generateanextensive list of extensions.

Orthogonalization selection Whatis theeffect of orthogonalizationselectionand

orthogonalizationselectionmethodson paceregression(Section4.6)?

Other noisedistrib utions Only normally distributed noisecomponentsare con-

sideredin thecurrentimplementationof paceregression.Extensionto other

typesof distribution,evenanempiricaldistribution,seemsfairly straightfor-

ward.

Full modelunavailable How shouldthesituationwhenthefull modelis unavail-

ableand the noisevarianceis estimatedinaccurately(Section4.4) be han-

dled?

147

The ñ -random model It is known that the ñ -randommodel convergesto the ñ -

fixedmodelfor largesamples(Section2.2). However, for smallsamples,theñ -randomassumptionshouldin principleresultin adifferentmodel,because

it involvesmoreuncertainty—namelysamplinguncertainty.

Enumeratedvariables Enumeratedvariablescanbe incorporatedinto linear re-

gression,aswell asinto paceregression,usingdummyvariables.However,

if the transformationinto dummyvariablesrelieson theresponsevector(so

asto reducedimensionality),it involvescompetitiontoo. This time thecom-

petitionoccursamongthemultiple labelsof eachenumeratedvariable.Note

thatbinaryvariablesdo not have this problemandthuscanbeusedstraight-

forwardly in paceregressionasnumericvariables.

Missing values Missingvaluesareanotherunknowns: they couldbefilled in soas

to minimizetheexpectedloss.

Other lossfunctions In many applicationsconcerninggains in monetaryunits,

minimizing the expectedabsolutedeviation, rather than the quadraticloss

consideredhere,is moreappropriate.

Other model structures Theideasof paceregressionareapplicableto many model

structures,suchas thosementionedin Section1.1—althoughthey may re-

quiresomeadaptation.

Classification Classificationdiffersfrom regressionin thatit usesadiscrete,often

unordered,predictionspace,ratherthanacontinuousone.Both sharesimilar

methodology:amethodthatis applicablein oneparadigmis oftenapplicable

in theothertoo. Wewould like to extendtheideasof paceregressionto solve

classificationproblemstoo.

Statistically dependentcompetingmodels Futureresearchmay extend into the

broaderarenaconcerninggeneralissuesof statisticallydependentcompeting

models(Section7.3).

148

7.3 Modeling methodologies

In spiteof rigorousmathematicalanalysis,empiricalBayes(including Steinesti-

mation)is oftencriticizedfrom a philosophicalviewpoint. Oneseeminglyobvious

problem,which is still controversialtoday(see,e.g.,discussionsfollowing Brown,

1990),is thatcompletelyirrelevant (or statisticallyunrelatedlydistributed)events,

given in arbitrarymeasurementunits, canbe employed to improve estimationof

eachother. Efron andMorris (1973a,p.379)commenton this phenomenonasfol-

lows:

Do youmeanthatif I wantto estimateteaconsumptionin TaiwanI will

do betterto estimatesimultaneouslythespeedof light andtheweight

of hogsin Montana?

TheempiricalBayesmethodologycontradictssomewell-known statisticalprinci-

ples,suchasconditionality, sufficiency andinvariance.

In anon-technicalintroductionto Steinestimation,EfronandMorris (1977,pp.125-

126),in thecontext of anexampleconcerningratesof anendemicdiseasein cities,

concludedthat “the James-Steinmethodgivesbetterestimatesfor a majority of

cities,andit reducesthetotal errorof estimationfor thesumof all cities. It cannot

be demonstrated,however, that Stein’s methodis superiorfor any particularcity;

in fact, the James-Steinpredictioncanbe substantiallyworse.” This clearly indi-

catesthat estimationshoulddependon the lossfunction employed in the specific

application—anotionconsistentwith statisticaldecisiontheory.

Two differentlossfunctionsareof interesthere.Oneis thelossconcerningapartic-

ularplayer, andtheothertheoverall lossconcerningagroupof players.Successful

applicationof empiricalBayesimpliesthatit is reasonableto employ overall lossin

this application.Thefurtherassumptionof statisticalindependenceof theplayers’

performanceis neededin orderto applythemathematicaltoolsthatareavailable.

149

Thesecondtypeof lossconcernstheperformanceof a teamof players. Although

employing informationaboutall players’pastperformancethroughempiricalBayes

estimationmayor maynot improvetheestimationof aparticularplayer’s trueabil-

ity, it is always(technically, almostsurely)the correctapproach—inotherwords,

asymptoticallyoptimalusinggeneralempiricalBayesestimationandfinitely better

usingSteinestimation—toestimatinga team’s trueability andhenceto predicting

its futureperformance.

A linearmodel—orany model—withmultiple freeparametersis sucha team,each

parameterbeinga player. Notethatparameterscouldbecompletelyirrelevant,but

they areboundtogetherinto a teamby the predictiontask. It may seemslightly

beyondtheexistingempiricalBayesmethods,sincetheperformanceof theseplay-

erscouldbe,andoftenare,statisticallydependent.Thegeneralquestionshouldbe

how to estimatetheteam’strueability, givenstatisticallydependentperformanceof

players.

Paceregressionhandlesstatisticaldependenceusingdummyplayers, namelythe

absolutedimensionaldistancesof theestimatedmodelin anorthogonalspace.This

approachprovidesa generalway of eliminatingstatisticaldependence.Thereare

otherpossiblewaysof definingdummyplayers,andfrom theempiricalBayesview-

point it is immaterialhow they aredefinedaslongasthey arestatisticallyindepen-

dentof eachother. The choiceof K þ z ;=;<; z K û in paceregressionhastheseadvan-

tages:(1) they sharethesamescaleof signal-to-noiseratio,andthustheestimation

is invariantof thechoiceof measurementunits; (2) themixturedistribution is well

defined,with mathematicalsolutionsavailable; (3) they aresimply relatedto the

loss function of prediction(in the ñ -fixed situation); (4) many applicationshave

somezero, or almostzero, K Hü ’s, and updatingeachcorrespondingK ü often im-

provesestimationandreducesdimensionality(seeSection4.2).

Nevertheless,usingdummyplayersin this way is not a perfectsolution.Whenthe

determinationof orthogonalbasisreliesontheobservedresponse,it suffersfromthe

orthogonalizationselectionproblem(Section4.6). Informationaboutthestatistical

dependenceof the original playersmight be available,or simply estimablefrom

150

data.If so,it mightbeemployableto improvetheestimateof theteam’strueability.

Estimationof a particularplayer’s true ability using the loss function concerning

this singleplayer is actuallyestimationgiven the player’s name, which is differ-

ent from estimationgiven the player’s performancewhile the performanceof all

playersin a teamis mixed togetherwith their nameslost or ignored. This is be-

causethelatteraddressestheteamability giventhatperformance,thussharingthe

samemethodologyof estimatingteamability. For example,pre-selectingparame-

tersbasedonperformance(i.e.,dataexplanation)affectstheestimationof theteam

ability; seeSection4.4.

This point generalizesto situationswith many competingmodels.Whenthefinal

modelis determinedbasedon its performance—ratherthanon someprior belief—

after seeingall the models’performance,the estimationof its true ability should

take into accountall candidates’performance.In fact, the final model,assumed

optimal,doesnot have to beoneof thecandidatemodels—itcanbea modification

of onemodelor a weightedcombinationof all candidates.

Modern computingtechnologyallows the investigationof larger and larger data

setsandtheexplorationof increasinglywide modelspaces.Datamining(see,e.g.,

Fayyadetal.,1996;Friedman,1997;WittenandFrank,1999),asaburgeoningnew

technology, oftenproducesmany candidatemodelsfrom computationallyintensive

algorithms.While thiscanhelpto find moreappropriatemodels,thechanceeffects

increase.How to determinetheoptimalmodelgivena numberof competingones

is preciselyourconcernabove.

The subjective prior usedin Bayesiananalysisconcernsthe specificationof the

modelspacein termsof probabilisticweights,thusrelatingto Fisher’sfirst problem

(seeSection1.1). Without a prior, the modelspacewould otherwisebe specified

implicitly with (for example)auniformdistribution. BothempiricalBayesandpace

regressionfully exploit dataafterall prior information—includingspecificationof

the model space—isgiven. Thus both belongto Fisher’s secondproblemtype.

Thestory told by empiricalBayesandpaceregressionis that thedatacontainsall

151

necessaryinformation for the convergenceof optimal predictive modeling,while

thetips givenby any prior information—thecorrectnessof which is alwayssubject

to practicalexamination—playat mosta role of speedingup this convergence.

“Torturing” datais deemedto beunacceptablemodelingpractice(seeSection1.3).

In fact,however, empiricalmodelingalwaystorturesthedatato someextent—the

larger the model space,the more the datais tortured. Sincesomemodel space,

large or small, is indispensable,the correctapproachis to take the effect of the

necessarydatatorturing into accountwithin themodelingprocedure.This is what

paceregressiondoes.And doessuccessfully.

152

Appendix

Help filesand sourcecode

The ideaspresentedin the thesisare formally implementedin the computerlan-

guagesof S-PLUS/R and FORTRAN 77. The whole programruns under ver-

sions 3.4and5.1of S-PLUS (MathSoft,Inc.1), andunderversion1.1.1of R.2 Help

files for mostimportantS-PLUS functionsareincluded.Theprogramis freesoft-

ware; it canberedistributedand/ormodifiedunderthe termsof theGNU General

PublicLicense(version2) aspublishedby theFreeSoftwareFoundation.3

Oneextra softwarepackageis utilized in the implementation,namelythe elegant

NNLS algorithmby LawsonandHanson(1974,1995),which is public domainand

obtainablefrom NetLib.4

Thesourcecodeis organizedin six files (five in S-PLUS/R andonein FORTRAN

77):

disctfun.q containsfunctionsthathandlediscretefunctions.They aremainly

employed in our work to representdiscretepdfs; in particular, the discrete,

arbitrarymixing distributions.1http://www.mathsoft.com/2http://lib.stat.cmu.edu/R/CRAN/3Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA.

http://www.gnu.org/.4http://www.netlib.org/lawson-hanson/all

153

ls.q is basicallyan S-PLUS/R interfaceto functionsin FORTRAN for solving

linear systemproblems,including QR transformation,non-negative linear

regression,non-negative linear regressionwith equality constraints,upper-

triangularequationsolution.

mixing.q includesfunctionsfor the estimationof an arbitrarymixing distribu-

tions.

pace.q has functionsaboutpaceregressionand for handling objectsof class

“pace”.

util.q givesa few smallfunctions.

ls.f hasfunctionsin FORTRAN for solvingsomelinearsystemsproblems.

154

A.1 Help files

bmixing:

Seemixing.

disctfun:

Constructor of a Discrete Function

DESCRIPTION:Returns a univariate discrete function which is alwayszero except over a discrete set of data points.

USAGE:disctfun(data=NULL, values=NULL, classname, normalize=F,

sorting=F)

REQUIREDARGUMENTS:None.

OPTIONAL ARGUMENTS:data: The set of data points over which the function takes

non-zero values.values: The set of the function values over the specified data

points.classname: Extra class names; e.g., c("sorted", "normed"), if

data are sorted and values are L1-normalized on input.normalize: Will L1-normalize values (so as to make a discrete

pdf function).sorting: Will sort data.

VALUE:Returns an object of class "disctfun".

DETAILS:Each discrete function is stored in a (n x 2) matrix,where the first column stores the data points over whichthe function takes non-zero values, and the second columnstores the corresponding function values at these points.

SEE ALSO:disctfun.object, mixing.

EXAMPLES:disctfun()disctfun(c(1:5,10:6), 10:1, T, T)

155

disctfun.object:

Discrete Function Object

DESCRIPTION:These are objects of class "disctfun". They represent thediscrete functions which take non-zero values over adiscrete set of data points .

GENERATION:This class of objects can be generated by the contructordisctfun.

METHODS:The "disctfun" class of objects has methods for thesegeneric functions: sort, print, is, normalize, unique,plot, "+", "*", etc.

STRUCTURE:Each discrete function is stored in a (n x 2) matrix,where the first column stores the data points over whichthe function takes non-zero values, and the second columnstores the corresponding function values at these points.

Since it is a special matrix, methods for matrices can beused.

SEE ALSO:disctfun, mixing.

fintervals:

Fitting Intervals

DESCRIPTION:Returns a set of fitting intervals given the set offitting points.

USAGE:fintervals(fp, rb=0, cn=3, dist=c("chisq", "norm"),

method=c("nnm", "pm", "choi", "cramer"), minb=0.5)

REQUIREDARGUMENTS:fp: fitting points. Allow being unsorted.

OPTIONAL ARGUMENTS:rb: Only used for Chi-sqaured distribution. No fitting

intervals’ endpoints are allowed to fall into the interval(0,rb). If the left endpoint does, it is replaced by 0; ifthe right endpoint does, replaced by rb. This is usefulfor the function mixing() when ne != 0.

156

cn: Only for normal distribution. The length of each fittinginterval is 2*cn.

dist: type of distribution; only implemented for non-centralChi-sqaure ("chisq") with one degree of freedom and normaldistribution with variance 1 ("norm").

method: One of "nnm", "pm", "choi" and "cramer"; see mixing.minb: Only for Chi-squared distribution and method "nnm". No

right endpoint is allowed between (0, minb); delete suchintervals if generated.

VALUE:fitting intervals (stored in a two-coloumn matrix).

SEE ALSO:mixing, spoints.

EXAMPLES:fintervals(1:10)fintervals(1:10, rb=2)

lsqr:

QR Transformation of the Least Squares Problem

DESCRIPTION:QR-transform the least squares problem

A x = binto

R x = t(Q) bwhere

A = Q R, andR is an upper-triagular matrix (could be rank-

deficient)Q is an orthogonal matrix.

USAGE:lsqr(a,b,pvt,ks=0,tol=1e-6,type=1 )

REQUIREDARGUMENTS:a: matrix Ab: vector b

OPTIONAL ARGUMENTS:pvt: column pivoting vector, only columns specified in pvt are

used in QR-transformed. If not provided, all columns areconsidered.

ks: the first ks columns specified in pvt are QR-transformedin the same order. (But they may be deleted due to rankdeficiency.)

tol: tolerance for collinearitytype: used for finding the most suitable pivoting column.

1 based on explanation ratio.2 based on largest diagonal element.

157

VALUE:a: R (including the zeroed elements)b: t(Q) bpvt: pivoting indexks: pseudorank of R (determined by the value of tol)w: explanation matrix.

w[1,] sum of squares in a[,j] and b[,j]w[2,] sum of unexplained squares

DETAILS:This is an S-Plus/R interface to the function LSQR inFORTRAN. Refer to the source code of this function formore detail.

REFERENCES:Lawson, C. L. and Hanson, Richard J. (1974, 1995)."Solving Least Squares Problems", Prentice-Hall.

Dongarra, J. J., Bunch, J.R., Moler, C.B. and Stewart,G.W. (1979). "LINPACK Users’ Guide." Philadelphia, PA:SIAM Publications.

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel,J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D. (1999). "LAPACK Users’Guide." Third edition. Philadelphia, PA: SIAMPublications.

EXAMPLES:a <- matrix(rnorm(500),ncol=10)b <- 1:50 / 50lsqr(a,b)

mixing, bmixing:

Minimum Distance Estimator of an Arbitrary Mixing Distribution.

DESCRIPTION:The minimum distance estimation of an arbitrary mixingdistribution based on a sample from the mixturedistribution, given the component distribution.

USAGE:mixing(data, ne=0, t=3, dist=c("chisq", "norm"),

method=c("nnm", "pm", "choi", "cramer"))bmixing(data, ne = 0, t = 3, dist = c("chisq", "norm"),

method = c("nnm", "pm", "choi", "cramer"),minb = 0.5)

REQUIREDARGUMENTS:data: A sample from the mixture distribution.

158

OPTIONAL ARGUMENTS:ne: number of extra points with small values; only used for

Chi-square distribution.t: threshold for support points; only used for Chi-square

distribution.dist: type of distribution; only implemented for non-

central Chi-square ("chisq") with one degree of freedomand normal distribution with variance 1 ("norm").

method: one of the four methods:"nnm": based on nonnegative measures (Wang, 2000)"pm": based on probability measures (Wang, 2000)"choi": based on CDFs (Choi and Bulren, 1968)"cramer": Cramer-von Mises statistic,

also CDF-based (MacDonald, 1971).

The default is "nnm", which, unlike the other three, doesnot have the minority cluster problem.

minb: Only for Chi-squared distribution and method "nnm". Noright endpoint is allowed between (0, minb); delete suchintervals if any.

VALUE:the estimated mixing distribution as an object of class"disctfun".

DETAILS:The seperability of data points are tested by the functionmixing() using the function separable(). Each block ishandled by the function bmixing().

REFERENCES:Choi, K., and Bulgren, W. B. (1968). An estimationprocedure for mixtures of distributions. J. R. Statist.Soc. B, 30, 444-460.

MacDonald, P. D. M. (1971). Comment on a paper by Choi andBulgren. J. R. Statist. Soc. B, 33, 326-329.

Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.

SEE ALSO:pace, disctfun, separable, spoints, fintervals.

EXAMPLES:mixing(1:10) # mixing distribution of nchiqbmixing(1:10)mixing(c(5:15, 50:60)) # two main blocksbmixing(c(5:15, 50:60))mixing(c(5:15, 50:60), ne=20) # ne affects the estimationbmixing(c(5:15, 50:60), ne=20)

159

nnls, nnlse:

Problems NNLS and NNLSE

DESCRIPTION:Problem NNLS (nonnegative least squares):

Minimize || A x - b ||subject to x >= 0

Problem NNLSE (nonnegative least squares with equalityconstraints):

Minimize || A x - b ||subject to E x = fand x >= 0

USAGE:nnls(a,b)nnlse(a, b, e, f)

REQUIREDARGUMENTS:a: matrix Ab: vector be: matrix Ef: vector f

VALUE:x: the solution vectorrnorm: the Euclidean norm of the residual vector.index: defines the sets P and Z as follows:

P: index[1:nsetp] INDEX(1)Z: index[(nsetp+1):n]

mode: the success-failure flag with the following meaning:1 The solution has been computed successfully.2 The dimensions of the problem is bad,

either m <= 0 or n <= 03 Iteration count exceeded.

More than 3*n itereations.

DETAILS:This is an S-Plus/R interface to the algorithm NNLSproposed by Lawson and Hanson (1974, 1995), and to thealgorithm NNLSE by Haskell and Hanson (1981) and Hansonand Haskell (1982).

Problem NNLSE converges to Problem NNLS by consideringMinimize || / A\ / b\ ||

|| | | x - | | |||| \e E/ \e f/ ||

subject to x >= 0as e --> 0+.

Here the implemented nnlse only considers the situationthat xj >= 0 for all j, while the original algorithmallows restriction on some xj only.

160

The two functions nnls and nnlse are tested based on thesource code of NNLS (in FORTRAN 77), which is publicdomain and obtained from http://www.netlib.org/lawson-hanson/all.

REFERENCES:Lawson, C. L. and Hanson, R. J. (1974, 1995). "SolvingLeast Squares Problems", Prentice-Hall.

Haskell, K. H. and Hanson, R. J. (1981). An algorithm forlinear least squares problems with equality andnonnegativity constraints. Math. Prog. 21 (1981), pp.98-118.

Hanson, R. J. and Haskell, K. H. (1982). Algorithm 587.Two algorithms for the linearly constrained least squaresproblem. ACM Trans. on Math. Software, Sept. 1982.

EXAMPLES:a <- matrix(rnorm(500),ncol=10)b <- 1:50 / 50nnls(a,b)e <- rep(1,10)f <- 1nnlse(a,b,e,f)

nnlse:

Seennls.

pace:

Pace Estimator of a Linear Regression Model

DESCRIPTION:Returns an object of class "pace" that represents a fit ofa linear model.

USAGE:pace(x, y, va, yname=NULL, ycol=0, pvt=NULL, kp=0,

intercept=T, ne=0, method=<<see below>>, tau=2,tol=1e-6)

REQUIREDARGUMENTS:x: regression matrix. It is a matrix, or a data.frame, or

anything else that can be coerced into a matrix throughfunction as.matrix().

161

OPTIONAL ARGUMENTS:y: response vector. If y is missing, the last column in x or

the column specified by ycol is the response vector.va: variance of noise component.yname: name of response vector.ycol: column number in x for response vector, if y not

provided.pvt: pivoting vector which stores the column numbers that will

be considered in the estimation. When NULL (default), allcolumns are taken into account.

kp: the ordering of the first kp (=0, by default) columnsspecified in pvt are pre-chosen. The estimation will notchange their ordering, but they are suject to rank-deficiency examination.

intercept: if intercept should be added into the regressionmatrix.

ne: number of extra variables which are pre-excluded beforecalling this function due to small effects (i.e., A’s).Only used for pace methods and cic.

method: one of the methods "pace6" (default), "pace2", "pace4","olsc", "ols", "full", "null", "aic", "bic", "ric", "cic".See DETAILS and REFERENCESbelow.

tau: threshold value for the method "olsc(tau)".tol: tolerance threshold for collinearity. = 1e-6, by default.

VALUE:An object of class "pace" is returned. See pace.object fordetails.

DETAILS:For methods pace2, pace4, pace6 and olsc, refer to Wang(2000) and Wang et al. (2000). Further, n-aymptotically,aic = olsc(2), bic = olsc(log(n)), and ric = olsc(2log(k)), where n is the number of observations and k isthe number of free parameters (or candidate variables).

Method aic is prospoed by Akaike (1973).

Method bic is by Schwarz (1978).

For method ric, refer to Donoho and Johnstone (1994) andFoster and George (1994)

Method cic is due to Tibshirani and Knight (1997).

Method ols is the ordinary least squares estimationincluding all parameters after eliminating (psuedo-)collinearity, while method full is the ols withouteliminating collinearity. Method null returns the mean ofthe response, included for reason of completeness.

REFERENCES:Akaike, H. (1973). Information theory and an extension ofthe maximum likelihood principle. Proc. 2nd Int. Symp.Inform. Theory, suppl. Problems of Control andInformation theory, pp267-281.

162

Donoho, D. L. and Johnstone, I. M. (1994). Ideal SpatialAdaptation via Wavelet Shrinkage. Biometrika 81, 425-55.

Foster, D. and George, E. (1994). The risk inflationcriterion for multiple regression. Annals of Statistics,22, 1947-1975.

Schwarz, G. (1978). Estimating the dimension of a model.Annals of Statistics. Vol. 6, No. 2, March 1978. 461-464

Tibshirani, R. and Knight, K. (1997). The covarianceinflation criterion for model selection . Technicalreport, November, 1997, Department of Statistics,University of Stanford.

Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.

Wang, Y., Witten, I. H. and Scott, A. (2000). Paceregression. (Submitted.)

SEE ALSO:pace.object, predict.pace

EXAMPLES:x <- matrix(rnorm(500), ncol=10) # completely randompace(x) # equivalently, pace(x,method="pace6")for(i in c("ols","null","aic","bic","ric","c ic"," pace2 ",

"pace4","pace6")) {cat("---- METHOD", i,"--- \n\n");print(pace(x, method=i))}

y <- x %*% c(rep(0,5), rep(.5,5)) + rnorm(50)# mixed effect response

for(i in c("ols","null","aic","bic","ric","c ic"," pace2 ","pace4","pace6")) {cat("---- METHOD", i,"--- \n\n");

print(pace(x, y, method=i))}pace(x,y,method="olsc",tau=5)pace(x,y, ne=20) # influence of extra variables

pace.object:

Pace Regression Model Object

DESCRIPTION:These are objects of class "pace". They represent the fitof a linear regression model by pace estimator.

GENERATION:This class of objects is returned from the function paceto represent a fitted linear model.

METHODS:

163

The "pace" class of objects (at the moment) has methodsfor these generic functions: print, predict.

STRUCTURE:coef: fitted coefficients of the linear model.pvt: pivoting columns in the training set.ycol: the column in the training set considered as the

response vector.intercept: intercept is used or not.va: given or estimated noise variance (i.e., the unbiased OLS

estimate).ks: the final model dimensionality.kc: psuedo-rank of the design matrix (determined due to the

value of tol).ndims: total number of dimensions.nobs: number of observations.ne: nubmer of extra variables with small effects.call: function call.A: observed absolute distances of the OLS full model.At: updated absolute distances.mixing: estimated mixing distribution from observed absolute

distances.

SEE ALSO:pace, predict.pace, mixing

predict.pace:

Predicts New Obsevations Using a Pace Linear Model Estimate.

DESCRIPTION:Returns the predicted values for the response variable.

USAGE:predict.pace(p, x, y)

REQUIREDARGUMENTS:p: an object of class "pace".x: test set (may or may not contain the response vector.)

OPTIONAL ARGUMENTS:y: response vector (if not stored in x).

VALUE:pred: predicted response vector.resid: residual vector (if applicable).

SEE ALSO:pace, pace.object.

EXAMPLES:x <- matrix(rnorm(500), ncol=10)p <- pace(x)

164

xtest <- matrix(rnorm(500), ncol=10)predict(p, xtest)

rsolve:

Solving Upper-Triangular Equation

DESCRIPTION:Solve the upper-triangular equation

X * b = y

where X is an upper-triagular matrix (could be rank-deficient). Here the columns of X may be pre-indexed.

USAGE:rsolve(x,y,pvt,ks)

REQUIREDARGUMENTS:x: matrix Xy: vector y

OPTIONAL ARGUMENTS:pvt: column pivoting vector, only columns specified in pvt are

used in solving the equation. If not provided, all columnsare considered.

ks: the first ks columns specified in pvt are QR-transformedin the same order.

VALUE:the solution vector b

DETAILS:This is an S-Plus/R interface to the function RSOLVE inFORTRAN. Refer to the source code of this function formore detail.

REFERENCES:Dongarra, J. J., Bunch, J.R., Moler, C.B. and Stewart,G.W. (1979). "LINPACK Users’ Guide." Philadelphia, PA:SIAM Publications.

Press, W.H., Flanney, B.P., Teukkolky S.A., Vatterling,U.T. (1994). "Numerical Recipes in C." CambridgeUniversity Press.

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel,J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D. (1999). "LAPACK Users’Guide." Third edition. Philadelphia, PA: SIAMPublications.

EXAMPLES:a <- matrix(rnorm(500),ncol=10)

165

b <- 1:50 / 50fit <- lsqr(a,b)rsolve(fit$a,fit$b,fit$pvt)

separable:

Seperability of a Point from a Set of Points.

DESCRIPTION:Test if a point could be separable from a set of datapoints, based on probability, for the purpose ofestimating mixing distribution.

USAGE:separable(data, x, ne=0, thresh=0.1,

dist=c("chisq", "norm"))

REQUIREDARGUMENTS:data: incrementally sorted data points.x: a single point, satisfying that either x <= min(data) or x

>= max(data).

OPTIONAL ARGUMENTS:ne: number of extra data points of small values (= 0, by

default). Only used from chisq distribution.thresh: a threshold value based on probability (= 0.1, by

default). Larger value implies easier seperability.dist: type of distribution; only implemented for non-


VALUE:T or F, implying whether x is separable from data.

DETAILS:Used by the estimation of a mixing distribution so thatthe whole set of data points could be seperated intoblocks to speed up estimation.

REFERENCES:Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.

SEE ALSO:mixing, pace

EXAMPLES:separable(5:10, 25) # Returns Tseparable(5:10, 25, ne=20) # Returns F, for extra small

# values not included in data.

166

spoints:

Support Points

DESCRIPTION:Returns the set of candidate support points for theminimum distance estimation of an arbitrary mixingdistribution.

USAGE:spoints(data, ne=0, dist=c("chisq", "norm"), ns=100, t=3)

REQUIREDARGUMENTS:data: A sample of the mixture distribution.

OPTIONAL ARGUMENTS:ne: number of extra points with small values. Only used for

Chi-sqaured distribution.dist: type of distribution; only implemented for non-


ns: Maximum number of support points. Only for normaldistribution.

t: No support points inside (0, t).

VALUE:Support points (stored as a vector).

SEE ALSO:mixing, fintervals.

EXAMPLES:spoints(1:10)

167

A.2 Sourcecode

disctfun.q:

# ------------------------------------- ----- ------ ----- ----- --## Functions for manipulating discrete functions## ------------------------------------- ----- ------ ----- ----- --

disctfun <- function(data=NULL, values=NULL, classname,normalize = F, sorting = F)

{if( missing(data) ) n <- 0else n <- length(data)if( n == 0) { # empty disctfun

if(is.null( version$language ) ) { # S languaged <- NULLclass(d) <- "disctfun"return (d)

}else { # R language

d <- matrix(nrow=0,ncol=2,dimnames =list(NULL,c("Points", "Values")))

class(d) <- "disctfun"return (d)

}}else {

if( missing(values) ) values <- rep(1/n,n)else if (length(values) != n)

stop("Lengths of data and values must match.")else if( normalize ) values <- values / sum(values)d <- array(c(data, values),dim=c(n,2),dimnames =

list(paste("[",1:n,",]",sep=""),c("Points", "Values")))

class(d) <- "disctfun"}if( ! missing(classname) )

class(d) <- unique( c(class(d), classname) )if( missing(values) || normalize )

class(d) <- unique( c(class(d),"normed") )if( is.sorted(data) )

class(d) <- unique( c(class(d), "sorted") )if( sorting && !is.sorted(d) ) sort.disctfun(d)else d

}

sort.disctfun <- function(d){

if (length(d) == 0) return (d)if( is.sorted(d) ) return (d)n <- dim(d)[1]if( n != 1) {

168

i <- sort.list(d[,1])cl <- class(d)d <- array(d[i,], dim=c(n,2),dimnames =

list(paste("[",1:n,",]",sep=""),c("Points", "Values")))

class(d) <- cl}class(d) <- unique( c(class(d),"sorted") )d

}

is.disctfun <- function(d){

if( is.na( match("disctfun", class(d) ) ) ) Felse T

}

normalize.disctfun <- function(d){

d[,2] <- d[,2] / sum(d[,2])class(d) <- unique( c(class(d),"normed") )d

}

print.disctfun <- function(d){

if( length(d) == 0 ) cat("disctfun()\n")else if(dim(d)[1] == 0) cat("disctfun()\n")else print(unclass(d))invisible(d)

}

unique.disctfun <- function(d)# On input, d is sorted.{

count <- 1if(dim(d)[1] >= 2) {

for ( i in 2:dim(d)[1] )if( d[count,1] != d[i,1] ) {

count <- count + 1d[count,] <- d[i,]

}else d[count,2] <- d[count,2] + d[i,2]

disctfun(d[1:count,1], d[1:count,2],"sorted")}else {

class(d) <- c(class(d),"sorted")d

}}

"+.disctfun" <- function(d1,d2){

if(! is.disctfun(d1) | ! is.disctfun(d2) )stop("the argument(s) not disctfun.")

169

if( length(d1) == 0) d2else if (length(d2) == 0) d1else {

d <- disctfun( c(d1[,1],d2[,1]), c(d1[,2],d2[,2]), sort = T )# In R, unique is not a generic functionunique.disctfun(d)

}}

"*.disctfun" <- function(x,d){

if( is.disctfun(d) ) {if( length(x) == 1 )

{ if( length(d) != 0 ) d[,2] <- x * d[,2] }else stop("Inappropriate type of argument.")# this seems necessary for Rclass(d) <- unique(c("disctfun", class(d)))d

}else if( is.disctfun(x) ) {

if( length(d) == 1 ){ if( length(x) != 0 ) x[,2] <- d * x[,2] }

else stop("Inappropriate type of argument.")# this seems necessary for Rclass(x) <- unique(c("disctfun", class(x)))x

}}

plot.disctfun <- function(d,xlab="x", ylab="y",lty=1,...){

if( length(d) != 0)plot(d[,1],d[,2],xlab=xlab, ylab=ylab,type="h",...)

else cat("NULL (disctfun); nothing ploted.\n")}

ls.q:

# ------------------------------------- ----- ------ ----- ----- --## Functions for solving linear system problems.## ------------------------------------- ----- ------ ----- ----- --

nnls <- function(a,b){

m <- as.integer(dim(a)[1])n <- as.integer(dim(a)[2])storage.mode(a) <- "double"storage.mode(b) <- "double"x <- as.double(rep(0, n)) # only for outputrnorm <- 0storage.mode(rnorm) <- "double" # only for outputw <- x # n-vector of working space

170

zz <- b # m-vector of working spaceindex <- as.integer(rep(0,n))mode <- as.integer(0) # = 1, success.Fortran("nnls",a,m,m,n,b,x=x,rno rm=rn orm,w ,zz,in dex=i ndex,

mode=mode)[c("x","rnorm","index"," mode") ]}

nnls.disctfun <- function(s, sp){

index <- s$index[s$x[s$index]!=0]disctfun(sp[index],s$x[index], normalize=T, sorting=T)

}

nnlse <- function(a, b, e, f){

eps <- 1e-3; # eps is adjustable.

nnls(rbind(e, a * eps), c(f, b * eps))}

lsqr <- function(a,b,pvt,ks=0,tol=1e-6,type=1 ){

if ( ! is.matrix(b) ) dim(b) <- c(length(b),1)if ( ! is.matrix(a) ) dim(a) <- c(1,length(a))m <- as.integer(dim(a)[1])n <- as.integer(dim(a)[2])if ( m != dim(b)[1] )

stop("dim(a)[1] != dim(b)[1] in lsqr()")nb <- as.integer(dim(b)[2])if(missing(pvt) || is.null(pvt)) pvt <- 1:nelse if( ! correct.pvt(pvt,n) ) stop("mis-specified pvt.")kp <- length(pvt)kn <- 0storage.mode(kn) <- "integer"storage.mode(kp) <- "integer"storage.mode(a) <- "double"storage.mode(b) <- "double"storage.mode(pvt) <- "integer"storage.mode(ks) <- "integer"storage.mode(tol) <- "double"storage.mode(type) <- "integer"if(ks < 0 || ks > kp) stop("ks out of range in lsqr()")w <- matrix(0, nrow = 2, ncol = n+nb) # working spacestorage.mode(w) <- "double".Fortran("lsqr", a=a,m,m,n,b=b,m,nb,pvt=pvt,kp,ks=ks,w =w,to l,

type)[c("a","b","pvt","ks","w")]}

correct.pvt <- function(pvt,n) {! ( max(pvt) > n || # out of range

min(pvt) < 1 || # out of rangeany(duplicated(pvt)) # has duplicated)

}

rsolve <- function(x,y,pvt,ks)

171

{if ( ! is.matrix(y) ) dim(y) <- c(length(y),1)if ( ! is.matrix(x) ) dim(x) <- c(1,length(x))m <- as.integer(dim(x)[1])n <- as.integer(dim(x)[2])mn <- min(m,n)ny <- as.integer(dim(y)[2])if(missing(pvt)) {

pvt <- 1:mn}if ( missing(ks) ) ks <- length(pvt)if(ks < 0 || ks > mn) stop("ks out of range in lsqr()")storage.mode(x) <- "double"storage.mode(y) <- "double"storage.mode(pvt) <- "integer"storage.mode(ks) <- "integer".Fortran("rsolve", x,m,y=y,m,ny,pvt,ks)$y[1:ks,]

}

mixing.q:

# ------------------------------------- ----- ------ ----- ----- --## Functions for the estimation of a mixing distribution## ------------------------------------- ----- ------ ----- ----- --

spoints <- function(data, ne =0, dist = c("chisq", "norm"),ns = 100, t = 3)

{if( ! is.sorted(data) ) data <- sort(data)

dist <- match.arg(dist)n <- length(data)if(n < 2) stop("tiny size data.")

if(dist == "chisq" && (data[1] < t || ne != 0) ) {sp <- 0ns <- ns -1if(data[n] == t) ns <- 1

}else sp <- NULLswitch (dist,

"norm" = data, #seq(data[1], data[n], length=ns),c(sp, data[data >= t]) )

}

fintervals <- function( fp, rb = 0, cn = 3,dist = c("chisq", "norm"),method = c("nnm", "pm", "choi", "cramer"),minb = 0.5)

{dist <- match.arg(dist)method <- match.arg(method)

172

switch( method,"pm" = ,"nnm" = switch( dist,

"chisq" = {f <- cbind(c(fp,invlfun(fp[(i<-fp>=minb )])),c(lfun(fp),fp[i]) )

if( rb >= 0 ) {f[f[,1] > 0 & f[,1] < rb, 1] <- 0f[f[,2] > 0 & f[,2] < rb, 2] <- rb

}f

},"norm" = cbind( c(fp,fp-cn), c(fp+cn,fp) )),

"choi" = ,"cramer" = switch( dist,

"chisq" = cbind(pmax(min(fp) - 50, 0), fp),"norm" = cbind(min(fp) - 50, fp) )

)}

lfun <- function( x, c1 = 5, c2 = 20 )# --------------------------------- ----- ----- ------ ----- ----- -# function l(x)## Returns the right boudary points of fitting intervals# given the left ones# --------------------------------- ----- ----- ------ ----- ----- -{

if(length(x) == 0) NULLelse {

x[x<0] <- 0c1 + x + c2 * sqrt(x)

}}

invlfun <- function( x, c1 = 5, c2 = 20 )# --------------------------------- ----- ----- ------ ----- ----- -# inverse l-function# --------------------------------- ----- ----- ------ ----- ----- -{

if(length(x) == 0) NULLelse {

x[x<=c1] <- c1(sqrt( x - c1 + c2 * c2 / 4 ) - c2 / 2 )ˆ2

}}

epm <- function(data, intervals, ne = 0, rb = 0, bw = 0.5)# --------------------------------- ----- ----- ------ ----- ----- -# empirical probability measure over intervals# --------------------------------- ----- ----- ------ ----- ----- -{

n <- length(data)pm <- rep(0, nrow(intervals))for ( i in 1:n )

pm <- pm + (data[i]>intervals[,1] & data[i]<intervals[,2]) +( (data[i] == intervals[,1]) + (data[i] ==

173

intervals[,2])) * bwi <- intervals[,1] < rb & intervals[,2] >= rbpm[i] <- pm[i] + nepm / (n + ne)

}

probmatrix <- function( data, bins, dist = c("chisq", "norm") )# ------------------------------------- ----- ------ ----- ----- --# Probability matrix# ------------------------------------- ----- ------ ----- ----- --{

dist <- match.arg(dist)n<-length(data)e <- matrix(nrow=nrow(bins),ncol=n)for (j in 1:n)

e[,j] <- switch(dist,"chisq" = pchisq1(bins[,2], data[j]) -pchisq1(bins[,1], data[j]),"norm" = pnorm(bins[,2], data[j]) -pnorm(bins[,1], data[j]) )

e}

bmixing <- function(data, ne = 0, t = 3, dist = c("chisq","norm"),method = c("nnm","pm","choi","cramer"),minb=0.5)


if( !is.sorted(data) ) {data <- sort(data)class(data) <- unique( c(class(data),"sorted") )

}

n <- length(data)

if( n == 0 ) return (disctfun())if( n == 1 ) {

if( dist == "chisq" && data < t ) return ( disctfun(0) )else return ( disctfun(max( data )) )

}

# support pointssp <- spoints(data, ne = ne, t = t, dist = dist)rb <- 0# fitting intervalsif(ne != 0) {

rb <- max(minb,data[min(3,n)])fi <- fintervals(data, rb = rb, dist = dist, method = method,

minb=minb)}else fi <- fintervals(data, dist = dist, method = method,

minb=minb)

bw <- switch(method,

174

"cramer" =,"nnm" =,"pm" = 0.5,"choi" = 1 )

# empirical probability measureb <- epm(data, fi, ne=ne, rb=rb, bw=bw)a <- probmatrix(sp, fi, dist = dist)ns <- length(sp)nnls.disctfun(switch( method,

"nnm" = nnls(a,b),nnlse(a,b,matrix(1,nrow=1,ncol=ns),1) )

, sp)}

mixing <- function(data, ne = 0, t = 3, dist = c("chisq","norm"),method = c("nnm","pm","choi","cramer"))


n <- length(data)if( n <= 1 )

return ( bmixing(data,ne,t=t,dist=dist,method=m ethod ) )

if( ! is.sorted(data) ) data <- sort(data)d <- disctfun()start <- 1for( i in 1:(n-1) ) {

if( separable(data[start:i], data[i+1],ne, dist=dist) &&separable(data[(i+1):n], data[i], dist=dist) ) {

x <- data[start:i]class(x) <- "sorted"d <- d + ( i - start + 1 + ne) *

bmixing(x,ne,t=t,dist=dist,method=met hod)start <- i+1ne <- 0

}}x <- data[start:n]class(x) <- "sorted"normalize( d + ( n - start + 1 + ne) *

bmixing(x,ne,t=t,dist=dist,method =metho d) )}

separable <- function(data, x, ne = 0, thresh = 0.10,dist = c("chisq","norm") )

{if (length(x) != 1) stop("length(x) must be one.")

dist <- match.arg(dist)n <- length(data)mi <- data[1]

if (dist == "norm")if( sum( 1 - pnorm( abs( x - data) ) ) <= thresh ) Telse F

175

else {rx <- sqrt(x)p1 <- pnorm( abs(rx - sqrt(data)) )p2 <- pnorm( abs(rx - seq(0,sqrt(mi),len=3)) )if( sum( 1 - p1, ne/3 * (1 - p2) ) <= thresh ) Telse F

}}

pmixture <- function(x, d, dist="chisq"){

n <- length(x)p <- rep(0, len=n)for( i in 1:length(d[,1]) )

p <- p + pchisq1(x, d[i,1]) * d[i,2]p

}

plotmix <- function(d, xlim, ylim=c(0,1), dist="chisq",...){

if( missing(xlim) ) {mi <- min(d[,1])ma <- max(d[,1])low <- max(0, sqrt(mi)-3)ˆ2high <- (sqrt(ma) + 3)ˆ2xlim <- c(low,high)

}x <- seq(xlim[1], xlim[2], len=100)y <- pmixture(x, d, dist=dist)plot(x,y,type="l", xlim=xlim, ylim=ylim, ...)

}

pace.q:

# ------------------------------------- ----- ------ ----- ----- --## Functions for pace regression## ------------------------------------- ----- ------ ----- ----- --

contrib <- function(a, as){

asˆ2 - (as - a)ˆ2}

bhfun <- function(A, As) {hrfun(sqrt(A),sqrt(As))

}

bhrfun <- function(a, as) # hrfun without being divided by (2*a){

contrib(a,as) * dnorm(a,as) + contrib(-a,as) * dnorm(-a,as)}

176

hfun <- function(A, As) {hrfun(sqrt(A),sqrt(As))

}

hrfun <- function(a, as) {(contrib(a,as)*dnorm(a,as)+contri b(-a, as)*d norm(- a,as) )/(2* a)

}

hmixfun <- function(A, d){

n <- length(A)h <- rep(0,n)rc <- sqrt(d[,1])for( i in 1:n )

if( A[i] != 0 ) h[i] <- h[i] + hrfun(sqrt(A[i]),rc) %*% d[,2]h

}

bffun <- function(A, As){

bfrfun(sqrt(A), sqrt(As))}

bfrfun <- function(a, as)# ffun() without being divided by (2*a){

dnorm(a, as) + dnorm(-a, as)}

ffun <- function(A, As){

frfun(sqrt(A), sqrt(As))}

frfun <- function(a, as){

( dnorm(a, as) + dnorm(-a, as) ) / (2*a)}

fmixfun <- function(A, d){

n <- length(A)f <- rep(0,n)rc <- sqrt(d[,1])for( i in 1:n )

f[i] <- f[i] + frfun(sqrt(A[i]),rc) %*% d[,2]f

}

pace2 <- function(A, d){

k <- length(A)con <- cumsum( hmixfun(A, d) / fmixfun(A, d) )ma <- max(0,con)ima <- match(ma, con, nomatch=0) # number of positive a’sif(ima < k) A[(ima+1):k] <- 0 # if not the last one

177

A}

pace4 <- function(A, d){

con <- hmixfun(A, d) / fmixfun(A, d)A[con <= 0] <- 0A

}

pace6 <- function(A, d, t = 0.5){

for( i in 1:length(A) ) {de <- sum( bffun(A[i], d[,1]) * d[,2] )if(de > 0)

A[i] <- (sum(sqrt(d[,1]) * bffun(A[i], d[,1])*d[,2])/de)ˆ2}A[A<=t] <- 0A

}

pace <- function(x,y,va,yname=NULL,ycol=0,pv t=NULL ,kp=0 ,intercept=T,ne=0,method=c("pace6","pace2","pace4","olsc ","ol s",

"ful ","null","aic","bic","ric","cic"),tau=2,tol=1e-6)

{method <- match.arg(method)if ( is.data.frame(x) ) x <- as.matrix(x)ncol <- dim(x)[2]if(is.null(dimnames(x)[[2]]))

dimnames(x) <- list(dimnames(x)[[1]],paste("X",1:nc ol,se p=""))namelist <- dimnames(x)[[2]]if ( missing(y) ) {

if(ycol == 0 || ycol == "last") ycol <- ncoly <- x[,ycol]if( is.null(yname) ) yname <- dimnames(x)[[2]][ycol]x <- x[,-ycol]if ( ! is.matrix(x) ) x <- as.matrix(x)dimnames(x) <- list(NULL,namelist[-ycol])

}if ( ! is.matrix(y) ) y <- as.matrix(y)if (intercept) {

x <- cbind(1,x)ki <- 1kp <- kp + 1if(! is.null(pvt)) pvt <- c(1,pvt+1)dimnames(x)[[2]][1] <- "(Intercept)"

}else ki <- 0m <- as.integer(dim(x)[1])n <- as.integer(dim(x)[2])mn <- min(m,n)if ( m != dim(y)[1] )

stop("dim(x)[1] != dim(y)[1]")# first kp columns in pvt[] will maintain the same order

178

kp <- max(ki,kp)# QR transformationif(method == "full" ) ans <- lsqr(x,y,pvt,tol=0,ks=kp)else ans <- lsqr(x,y,pvt,ks=kp,tol=tol)kc <- ks <- ans$ks # kc is psuedo-rank (unchangable),

# ks is used as the model dimensionalityb2 <- ans$bˆ2if(missing(va)) {

if ( m - ks <= 0 ) va <- 0 # data wholly explainedelse va <- sum(b2[(ks+1):m])/(m-ks) # OLS variance estimator

}At <- A <- d <- NULLmi <- min(b2[ki:ks])if(va > max(1e-10, mi * 1e-10) ) { # Is va is tiny?

A <- b2[1:ks] / vanames(A) <- dimnames(x)[[2]][ans$pvt[1:ks]]

}ne <- ne + (mn - ks)# Check (asymptotically) olsc-equivalent methods.switch( method,

"full" = , # full is slightly different from# ols in that collinearity not# excluded

"ols" = { # ols = olsc(0)method <- "olsc"tau <- 0

},"aic" = { # aic = olsc(2)

method <- "olsc"tau <- 2

},"bic" = { # bic = olsc(log(n))

method <- "olsc"tau <- log(m)

},"ric" = { # ric = olsc(2log(k))

method <- "olsc"tau <- 2 * log(ks)

},"null" = { # null = olsc(+inf)

method <- "olsc"tau <- 2*max(b2) + 1e20ks <- kp

})

switch( method, # default method is pace2,4,6"olsc" = { # olsc-equivalent methods

if( ! is.null(A) ) {At <- Anames(At) <- names(A)if(kp < ks)

ksp <- match(max(0,cum<-cumsum(At[(kp+1): ks]-t au)),cum, nomatch=0)

else ksp <- 0if(kp+ksp+1 <= ks) At[(kp+ksp+1):ks] <- 0ks <- kp + ksp

179

ans$b[1:ks] <- sign(ans$b[1:ks])*sqrt(At[1:ks]*va)}

},"cic" = { # cic method

if( ! is.null(A) ) {At <- Anames(At) <- names(A)kl <- ks - kpif(kp < ks)

ksp <- match(max(0,cum <-cumsum(At[(kp+1):ks] -

4*log((ne+kl)/1:kl))),cum, nomatch=0)

else ksp <- 0if(kp+ksp+1 <= ks) At[(kp+ksp+1):ks] <- 0ks <- kp + kspans$b[1:ks] <- sign(ans$b[1:ks]) *

sqrt(At[1:ks] * va)}

},if( ! is.null(A) && ks > 1) { # runs when va is not tiny

At <- Anames(At) <- names(A)d <- mixing(A[2:ks],ne=ne) # mixing distributionAt[2:ks] <- get(method)(A[2:ks],d)# get rid of zero coefsks <- ks - match(F, rev(At[2:ks] < 0.001),

nomatch = ks) + 1ans$b[1:ks] <- sign(ans$b[1:ks]) * sqrt(At[1:ks] * va)

}else { # delete variables that explain little data

ave <- mean(b2[1:ks])ks <- ks - match(T, rev( b2[1:ks] > 1e-5 * ave),

nomatch=0) + 1})

coef <- rsolve(ans$a[1:ks,],ans$b[1:ks,],a ns$pvt [1:ks ],ks)i <- sort.list(ans$pvt[1:ks])pvt <- ans$pvt[i]coef <- coef[i]names(coef) <- dimnames(x)[[2]][pvt[1:ks]]if( ! is.null(yname) ) {

coef <- as.matrix(coef)dimnames(coef)[[2]] <- yname

}fit <- list(coef = coef, pvt = pvt, ycol = ycol, intercept =

intercept, va = va, ks = ks, kc = kc, ndims = n,nobs = m, ne = ne, call = match.call() )

if( !is.null(A)) fit$A <- Aif( !is.null(At)) fit$At <- Atif( !is.null(d)) fit$mixing <- unclass(d)class(fit) <- "pace"fit

}

print.pace <- function (p) {

180

cat("Call:\n")print(p$call)cat("\n")cat("Coefficients:\n")print(p$coef)cat("\n")cat("Number of observations:",p$nobs,"\n")cat("Number of dimensions: ",length(p$coef)," (out of ", p$kc,

", plus ", p$ndims - p$kc, " abandoned for collinearity)\n",sep="")

}

predict.pace <- function(p,x,y){

if( ! is.matrix(x) ) x <- as.matrix(x)n <- dim(x)[1]k <- length(p$pvt)if( p$intercept )

if(k == 1) pred <- rep(p$coef[1],n)else {

pred <- p$coef[1] + x[,p$pvt[2:k]-1,drop=FALSE] %*%p$coef[2:k]

}else pred <- x[,p$pvt] %*% p$coeffit <- list(pred = pred)if(! missing(y)) fit$resid <- y - predelse if(p$ycol >= 1 && p$ycol <= ncol(x))

fit$resid <- x[,p$ycol] - predif( length(fit$resid) != 0 ) names(fit$resid) <- NULLfit

}

util.q:

# --------------------------------- ----- ----- ------ ----- ----- -## Utility functions## --------------------------------- ----- ----- ------ ----- ----- -

is.sorted <- function(data){

if( is.na( match("sorted", class(data) ) ) ) Felse T

}

is.normed <- function(data){

if( is.na( match("normed", class(data) ) ) ) Felse T

}

normalize <- function(x,...) UseMethod("normalize")

181

rchisq1 <- function(n, ncp = 0)# rchisq() in S-Plus (up to version 5.1) does not work for# non-central Chi-square distribution.{

rnorm( n, mean = sqrt(ncp) )ˆ2}

pchisq1 <- function(x, ncp = 0)# The S+ language (both versions 3.4 and 5.1) fails to return 0# for, say, pchisq(0,1,1). The same function in R seems to work# fine. In the following, the probability value is set to zero# when the location is at zero.# Here we only consider the situation df = 1, using a new# function name pchisq1. Also, efficiency is taken into account,# within the application’s accuracy requirement.{

i <- x == 0 | sqrt(x) - sqrt(ncp) < -10j <- sqrt(x) - sqrt(ncp) > 10p <- rep(0,length(x))p[j] <- 1p[!i & !j] <- pchisq(x[!i & !j], 1, ncp)p

}

182

ls.f:

c ---------------------------------- ----- ------ ----- ----- -subroutine upw2(a,mda,j,k,m,s2)double precision a(mda,*), s2, e2integer j,k,m

e2 = a(k,j) * a(k,j)if( e2 > .1 * s2 ) then

s2 = sum2(a,mda,j,k+1,m)else

s2 = s2 - e2endifend

c ---------------------------------- ----- ------ ----- ----- -c Constructs single Householder orthogonal transformationc Input: a(mda,*) ; matrix Ac mda ; the first dimension of Ac j ; the j-th columnc k ; the k-th elementc s2 ; s2 = dˆ2c Output: a(k,j) ; = a(k,j) - dc d ; diagonal element of Rc q ; = - u’u/2, where u is the transformationc ; vector. Hence should not be zero.c ---------------------------------- ----- ------ ----- ----- -

subroutine hh1(a,mda,j,k,d,q,s2)integer mda,j,kdouble precision a(mda,*),s2,d,q

if (a(k,j) .ge. 0) thend = - dsqrt(s2)

elsed = dsqrt(s2)

endifa(k,j) = a(k,j) - dq = a(k,j) * dend

c ---------------------------------- ----- ------ ----- ----- -c Performs single Householder transformation on one columnc of a matrixcc Input: a(mda,)c m ; a(k..m,j) stores the transformationc ; vectur uc j ; the j-th columnc k ; the k-th elementc q ; q = -u’u/2; must be negativec b(mdb,) ; stores the to-be-transformed vectorc l ; b(k..m,l) to be transformedc Output: b(k..m,l) ; transformed part of bc ---------------------------------- ----- ------ ----- ----- -

subroutine hh2(a,mda,m,j,k,q,b,mdb,l)

183

integer mda,m,k,j,mdb,ldouble precision a(mda,*),q,b(mdb,*),s,alpha

s = 0.0do 10 i=k,m

s = s + a(i,j) * b(i,l)10 continue

alpha = s / qdo 20 i=k,m

b(i,l) = b(i,l) + alpha * a(i,j)20 continue

end

c --------------------------------- ----- ------ ----- ----- --c Constructs Givens orthogonal rotation matrixcc Algorithm refers to P58, Chapter 10 ofc C. L. Lawson and R. J. Hanson (1974).c "Solving Least Squares Problems". Prentice-Hall, Inc.c --------------------------------- ----- ------ ----- ----- --

subroutine gg1(v1, v2, c, s)double precision v1, v2, c, sdouble precision ma, r

ma = max(dabs(v1), dabs(v2))if (ma .eq. 0.0) then

c = 1.0s = 0.0

elser = ma * dsqrt((v1/ma)**2 + (v2/ma)**2)c = v1/rs = v2/r

endifend

c --------------------------------- ----- ------ ----- ----- --c Performs Givens orthogonal transformationcc Algorithm refering to P59, Chapter 10 ofc C. L. Lawson an R. J. Hanson (1974).c "Solving Least Squares Problems". Prentice-Hall, Inc.c --------------------------------- ----- ------ ----- ----- --

subroutine gg2(c, s, z1, z2)double precision c, s, z1, z2double precision w

w = c * z1 + s * z2z2 = -s * z1 + c * z2z1 = wend

c --------------------------------- ----- ------ ----- ----- --function sum(a,mda,j,l,h)integer mda,j,l,h,idouble precision a(mda,*)

184

sum = 0.0do 10 i=l,h

sum = sum + a(i,j)10 continue

end

c ---------------------------------- ----- ------ ----- ----- -function sum2(a,mda,j,l,h)integer mda,j,l,h,idouble precision a(mda,*)

sum2 = 0.0do 10 i=l,h

sum2 = sum2 + a(i,j) * a(i,j)10 continue

end

c ---------------------------------- ----- ------ ----- ----- -c stepwise least sqaures QR decomposition for equationc a x = bc i.e., a is QR-decomposed and b is QR-transformedc accordingly. Stepwise here means each calling of thisc function either adds or deletes a column vector, into orc from the vector pvt(0..ks-1).cc It is the caller’s responsibility that the added column isc nonsingularcc Input: a(mda,*) ; matrix ac m,n ; (m x n) elements are occupiedc ; it is likely mda = mc b(mdb,*) ; vector or matrix bc ; if is matrix, only the first columnc ; may help the transformationc nb ; (m x nb) elements are occupiedc pvt(kp) ; pivoting column indicesc kp ; kp elements in pvt() are consideredc ks ; on input, the firs ks columns indexedc ; in pvt() are already QR-transformed.c j ; the pvt(j)-th column of a will be addedc ; (if add = 1) or deleted (if add = -1)c w(2,*) ; w(1,pvt) saves the total sum of squaresc ; of the used column vectors.c ; w(2,pvt) saves the unexplained sum ofc ; squares.c add ; = 1, adds a new column and doesc ; QR transformation.c ; = -1, deletes the columnc Output: a(,) ; QR transformedc b(,) ; QR transformedc pvt() ;c ks ; ks = ks + 1, if add = 1c ; ks = ks - 1, if add = -1c w(2, *) ; changed accordinglyc ---------------------------------- ----- ------ ----- ----- -

subroutine steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp ,ks,j ,w,ad d)

185

integer mda,m,n,mdb,nb,pvt(*),j,ks,kpdouble precision a(mda,*),b(mdb,*),w(2,*),q,d,c,sinteger pj,k,add,pk

if( add .eq. 1 ) thenc -- add pvt(j)-th column

ks = ks + 1pj = pvt(j)pvt(j) = pvt(ks)pvt(ks) = pj

call hh1(a,mda,pj,ks,d,q,w(2,pj))

do 20 k = ks+1,kppk = pvt(k)call hh2( a,mda,m,pj,ks,q,a,mda,pk )call upw2( a,mda,pk,ks,m,w(2,pk) )

20 continuedo 30 k = 1,nb

call hh2( a,mda,m,pj,ks,q,b,mdb,k )call upw2( b,mdb,k,ks,m,w(2,n+k) )

30 continuew(2, pj) = 0.0a(ks, pj) = ddo 40 k = ks+1, m

a(k,pj) = 0.040 continue50 continue

elsec -- delete pvt(j)-th column (swap with pvt(ks) andc -- ks decrements)

ks = ks - 1pj = pvt(j)do 110 i = j, ks

pvt(i) = pvt(i+1)110 continue

pvt(ks+1) = pjc -- givens rotation --

do 140 i=j, kscall gg1( a(i,pvt(i)), a(i+1,pvt(i)), c, s )do 120 l=i, kp

call gg2( c, s, a(i,pvt(l)), a(i+1,pvt(l)) )120 continue

do 130 l = 1, nbcall gg2( c, s, b(i,l), b(i+1,l))

130 continue140 continue

do 145 j = ks+1, kppj = pvt(j)w(2,pj) = w(2,pj) + a(ks+1,pj) * a(ks+1,pj)

145 continuedo 150 l = 1, nb

w(2,n+l) = w(2,n+l) + b(ks+1,l) * b(ks+1,l)150 continue

endifend

186

c ---------------------------------- ----- ------ ----- ----- -c The pvt(j)-th column is chosencc choose the column with the largest diagonal elementcc type = 1 VIF-based, otherwise based on diagonal elementcc Input: pvt ; column vectorc ks ; number ofc j ;c ---------------------------------- ----- ------ ----- ----- -

subroutine colj(pvt,ks,j,ln,w,tol,type)integer pvt(*),ks,j,l,ln,typedouble precision w(2,*),ma,now,tol

ma = tol * 0.1do 10 l=ks+1,ln

if (ks .eq. 0) type = 2now = colmax( w, pvt(l), type )if (now .le. ma) goto 10ma = nowj = l

10 continueif(ma .lt. tol) j = 0end

c ---------------------------------- ----- ------ ----- ----- -c returns the pivoting value, being subject to thec examination < tolc Input: w(2,*) ; w(1,j), the total sum of squaresc ; w(2,j), the unexplained sum ofc ; squaresc j ; the j-th column vectorc type ; = 1, VIF-basedc ; = 2, based on the diagonal elementc Output: colmax ; valuec ---------------------------------- ----- ------ ----- ----- -

function colmax(w,j,type)integer j, typedouble precision w(2,*)

if (type .eq. 1) thenif(w(1,j) .le. 0.0) then

colmax = 0.0else

colmax = w(2,j) / w(1,j)endif

elseif(w(2,j) .le. 0.0) then

colmax = 0.0else

colmax = dsqrt( w(2,j) )endif

endifend

187

c --------------------------------- ----- ------ ----- ----- --c QR decomposition for the linear regression problem a x = bcc Important note: on entry, the vector pvt[1:np] is given (andc supposedly correct), the first ks (could be zero) elementsc in pvt(*) will always be kept inside the final selection.c This is useful when, for example, the constant term isc always included in the final model.c --------------------------------- ----- ------ ----- ----- --

subroutine lsqr(a,mda,m,n,b,mdb,nb,pvt,kp,ks,w,to l,typ e)integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,lp,kp,ks,type,pjdouble precision a(mda,*), b(mdb,*)double precision w(2,*),tol,ma

if(tol .lt. 1e-20) tol = 1e-20ld = min(m,kp)

c computes the squared sum for each column of a and bdo 10 j=1,kp

pj = pvt(j)w(1,pj) = sum2(a,mda,pj,1,m)w(2,pj) = w(1,pj)

10 continuedo 15 j=1,nb

w(1,n+j) = sum2(b,mdb,j,1,m)w(2,n+j) = w(1,n+j)

15 continue

c The first ks columns defined in pvt(*) are pre-chosen.c However, they are subject to rank examination.

lp = ksks = 0do 17 j=1,lp

pj = pvt(j)ma = colmax(w,pj,type)if(ma > tol) then

call steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,ks, j,w,1 )if ( ks .ge. ld) go to 18

endif17 continue18 continue

c lp is the number of pre-chosen column variables.lp = ksld = min(ld,kp)

c the rest columns defined in pvt(*) are subject toc rank examination

do 50 k=ks+1,ldc -- choose the pivoting column j --

call colj(pvt,ks,j,kp,w,tol,type)if(j .eq. 0) then

kp = k - 1go to 60

endifcall steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,k s,j,w ,1)

188

50 continue60 continue

c if ks is small, or less than n, forward is unnecessary.c Forward selection is only necessary when backward isc insufficient, such as there are too many (noncollinear)c variables.

if( kp .gt. n .or. kp .gt. 200) thencall forward(a,mda,ks,n,b,mdb,nb,pvt,kp,ks ,lp,w )

endif

call backward(a,mda,ks,n,b,mdb,nb,pvt,k p,ks,l p,w)

end

c ---------------------------------- ----- ------ ----- ----- -c forward : forward ordering based on data explanationc a x = bcc Input: a(mda,*) ; matrix to be QR-transformedc m,n ; (m x n) elements are occupiedc b(mdb,nb) ; (m x nb) elements are occupiedc pvt(kp) ; indices of pivoting columnsc kp ; the first kp elements in pvt()c ; are to be usedc ks ; psuedo-rank of ac lp ; the first lp elements indexed in pvt()c ; are already QR-transformed and will notc ; change.c w(2,n+nb) ; w(1,) stores the total sum of eachc ; column of a and b.c ; w(2,) stores the unexplained sum ofc ; squaresc Output: a(,) ; Upper triagular matrix byc ; QR-transformationc b(,) ; QR-transformedc pvt() ; re-orderedc ---------------------------------- ----- ------ ----- ----- -

subroutine forward(a,mda,m,n,b,mdb,nb,pvt,kp, ks,lp ,w)integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,kp,ks,pj,jma,jth,lidouble precision a(mda,*), b(mdb,*)double precision w(2,*),ma,sum,r

do 1 j=lp+1,kppj = pvt(j)w(2,pj) = sum2(a,mda,pj,lp+1,m)

1 continue

do 2 j=1,nbw(2,n+j) = sum2(b,mdb,j,lp+1,m)

2 continue

do 130 i=lp+1,ksma = -1e20jma = 0

189

do 20 j=i,kpc -- choose the pivoting column j based on data explanation --

pj = pvt(j)sum = 0.0do 10 k = i, n

sum = sum + a(k,pj) * b(k,1)10 continue

c w(2,pj) should not be zeror = sum * sum / w(2,pj)if(r > ma) then

ma = rjma = j

endif20 continue

if(jma .eq. 0) thenks = igoto 200

elsecall steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,i-1 ,jma, w,1)

endif130 continue

200 continue

end

c --------------------------------- ----- ------ ----- ----- --c backward variable eliminationc --------------------------------- ----- ------ ----- ----- --

subroutine backward(a,mda,m,n,b,mdb,nb,pvt,kp,ks, lp,w)integer pvt(*),tmp,lpinteger mda,m,n,mdb,nb,kp,ks,jmi,jth,kscopy, pjdouble precision a(mda,*), b(mdb,*)double precision w(2,*),ma,absd,sumdouble precision xxx(ks+nb),s2

c backward elimination; the rank ks is not supposed to changekscopy = ksdo 50 j = ks, lp+1, -1

call coljd(a,mda,m,n,b,mdb,nb,pvt,kp,ks,l p+1,w ,jmi)call steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,k s,jmi ,w,-1 )

50 continue

c restores the rank and data explanationks = kscopydo 100 j=lp+1,kp

pj = pvt(j)s2 = sum2(a,mda,pj,lp+1,ks)w(2,pj) = w(2,pj) - s2

100 continue

do 110 j=1,nbs2 = sum2(b,mdb,j,lp+1,ks)w(2,n+j) = w(2,n+j) - s2

110 continue

190

end

c ---------------------------------- ----- ------ ----- ----- -subroutine coljd(a,mda,m,n,b,mdb,nb,pvt,kp,ks ,jth, w,jmi )integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,kp,ks,pj,jmi,jthdouble precision a(mda,*), b(mdb,*)double precision w(2,*),mi,val

mi = dabs( b(ks,1) )jmi = ksdo 50 j = jth, ks-1

call coljdval(a,mda,b,mdb,pvt,ks,j,val)if (val .le. mi) then

mi = valjmi = j

endif50 continue

end

c ---------------------------------- ----- ------ ----- ----- -subroutine coljdval(a,mda,b,mdb,pvt,ks,jth,va l)integer pvt(*)integer mda,mdb,ks,jthdouble precision a(mda,*), b(mdb,*)double precision val,c,sdouble precision xxx(ks)

do 10 j = jth+1, ksxxx(j) = a(jth,pvt(j))

10 continueval = b(jth,1)

do 50 j = jth+1, kscall gg1( xxx(j), a(j,pvt(j)), c, s )do 30 l=j+1, ks

xxx(l) = -s * xxx(l) + c * a(j,pvt(l))30 continue

val = -s * val + c * b(j,1)50 continue

val = dabs( val )end

c ---------------------------------- ----- ------ ----- ----- -subroutine rsolve(a,mda,b,mdb,nb,pvt,ks)integer mda,mdb,nb,ksinteger pvt(*)double precision a(mda,*),b(mdb,*)

if(ks .le. 0) returndo 50 k=1,nb

b(ks,k) = b(ks,k) / a(ks,pvt(ks))do 40 i=ks-1,1,-1

sum = 0.0do 30 j=i+1,ks

191

sum = sum + a(i,pvt(j)) * b(j,k)30 continue

b(i,k) = (b(i,k) - sum) / a(i,pvt(i))40 continue50 continue

end

192

References

Aha, D. W., Kibler, D. andAlbert, M. K. (1991). Instance-basedlearningalgo-

rithms. MachineLearning, 6, 37–66.

Akaike,H. (1969). Fitting autoregressivemodelsfor prediction.Ann.Inst.Statist.

Math., 21, 243–247.

Akaike,H. (1970). Statisticalpredictoridentification.Ann.Inst.Statist.Math., 22,

203–217.

Akaike, H. (1973). Information theoryandan extensionof the maximumlikeli-

hoodprinciple. In 2nd InternationalSymposiumon InformationTheory(pp.

267–281).AkademiaiKiado, Budapest.(Reprintedin S.KotzandN. L. John-

son(Eds.)(1992).Breakthroughsin Statistics.Volume1, 610–624.Springer-

Verlag.).

Anderson,E., Bai, Z., Bischof,C., Blackford,S., Demmel,J.,Dongarra,J.,Croz,

J. D., Greenbaum,A., Hammarling, S., McKenney, A. and Sorensen,D.

(1999). LAPACK Users’ Guide (3rd Ed.). Philadelphia,PA: SIAM Publi-

cations.

Atkeson,C.,Moorey, A. andSchaalz,S.(1997).Locally weightedlearning.Artifi-

cial IntelligenceReview, 11, 11–73.

Barbe,P. (1998).Statisticalanalysisof mixturesandtheempiricalprobabilitymea-

sure.ActaApplicandaeMathematicae, 50(3), 253–340.

Barnard,G. A. (1949). Statisticalinference(with discussion).J. Roy. Statist.Soc.,

Ser. B, 11, 115–139.

193

Bayes,T. (1763). An essaytowardssolvinga problemin thedoctrineof chances.

Phil. Trans.Roy. Soc., 53, 370–418.Reprintedin (1958)Biometrika,45, 293–

315.

Berger, J. O. (1985). StatisticaldecisiontheoryandBayesiananalysis(2ndEd.).

New York: Springer-Verlag.

Birnbaum,A. (1962).On thefoundationsof statisticalinference(with discussion).

J. Amer. Statist.Assoc., 57, 269–326. (Reprintedin S. Kotz andN. L. John-

son(Eds.)(1992).Breakthroughsin Statistics.Volume1, 478–518.Springer-

Verlag.).

Blake, C., Keogh,E. andMerz, C. J. (1998). UCI Repositoryof machinelearning

databases[http://www.ics.uci.edu/6 mlearn/MLRepository.html]. Department

of InformationandComputerScience,Universityof California,Irvine.

Blum, J. R. andSusarla,V. (1977). Estimationof a mixing distribution function.

Ann.Probab., 5, 200–209.

Bohning,D. (1982). Convergenceof Simar’s algorithmfor finding the maximum

likelihoodestimateof a compoundPoissonprocess.Ann.Statist., 10, 1006–

1008.

Bohning,D. (1985). Numericalestimationof a probability measure.Journal of

StatisticalPlanningandInference, 11, 57–69.

Bohning, D. (1986). A vertex-exchange-methodin ¿ -optimal design theory.

Metrika, 33, 337–347.

Bohning,D. (1995). A review of reliablealgorithmsfor thesemi-parametricmax-

imum likelihood estimatorof a mixture distribution. Journal of Statistical

PlanningandInference, 47, 5–28.

Bohning,D., Schlattmann,P. andLindsay, B. (1992).C.A.MAN (computerassisted

analysisof mixtures):Statisticalalgorithms.Biometrics, 48, 283–303.

194

Breiman,L. (1992).Thelittle bootstrapandothermethodsfor dimensionalityselec-

tion in regression:X-fixedpredictionerror. Journal of theAmericanStatistics

Association, 87, 738–754.

Breiman,L. (1995).Bettersubsetregressionusingthenonnegativegarrote.Techo-

metrics, 37(4), 373–384.

Breiman,L., Friedman,J.H., Olshen,R. A. andStone,C. J. (1984). Classification

andRegressionTrees. Wadsworth,BelmontCA.

Breiman,L. andSpector, P. (1992). Submodelselectionandevaluationin regres-

sion.thex-randomcase.InternationalStatisticalReview, 60, 291–319.

Brown, L. D. (1990). An ancillarity paradoxwhich appearsin multiple linear re-

gression(with discussion).Ann.Statist., 18, 471–538.

Burnham,K. P. andAnderson,D. R. (1998). Model selectionand inference: a

practical information-theoreticapproach. New York: Springer.

Carlin,B. P. andLouis,T. A. (1996).BayesandEmpiricalBayesMethodsfor Data

Analysis. London:Chapman& Hall.

Carlin,B. P. andLouis,T. A. (2000). EmpiricalBayes:Past,presentandfuture. J.

Am.Statist.Assoc.(To appear).

Chatfield,C. (1995). Model uncertainty, datamining andstatisticalinference. J.

Roy. Statist.Soc.Ser. A, 158, 419–466.

Choi, K. and Bulgren, W. B. (1968). An estimationprocedurefor mixturesof

distributions.J. R.Statist.Soc.B, 30, 444–460.

Cleveland,W. S. (1979). Robust locally weightedregressionandsmoothingscat-

terplot. Journalof theAmericanStatisticalAssociation, 74, 829–836.

Coope,I. D. andWatson,G.A. (1985).A projectedLagrangianalgorithmfor semi-

infinite programming.MathematicalProgramming, 32, 337–293.

195

Copas,J.B. (1969).CompounddecisionsandempiricalBayes.J. Roy. Statist.Soc.

B, 31, 397–425.

Copas,J. B. (1983). Regression,predictionandshrinkage(with discussions).J.

Roy. Statist.Soc.B, 45, 311–354.

Deely, J. J. and Kruse, R. L. (1968). Constructionof sequencesestimatingthe

mixing distribution. Ann.Math.Statist., 39, 286–288.

Dempster, A. P., Laird, N. M. andRubin,D. B. (1977). Maximumlikelihoodfrom

incompletedatavia theEM algorithm.J. Roy. Statist.Soc.,Ser. B, 39, 1–22.

Derksen,S. andKeselman,H. J. (1992). Backward, forward andstepwiseauto-

matedsubsetselectionalgorithms:Frequency of obtainingauthenticandnoise

variables. British Journal of Mathematicaland StatisticalPsychology, 45,

265–282.

DerSimonian,R. (1986).Maximumlikelihoodestimationof a mixing distribution.

J. Roy. Statist.Soc.Ser. C, 35, 302–309.

DerSimonian,R. (1990). Correctionto algorithmAS 221: Maximum likelihood

estimationof amixing distribution. J. Roy. Statist.Soc.Ser. C, 39, 176.

Dongarra,J. J., Bunch, J., Moler, C. and Stewart, G. (1979). LINPACK Users’

Guide. Philadelphia,PA: SIAM Publications.

Donoho,D. L. andJohnstone,I. M. (1994). Ideal spatialadaptationvia wavelet

shrinkage.Biometrika, 81, 425–455.

Efron,B. (1998).R. A. Fisherin the21stcentury. StatisticalScience, 13, 95–126.

Efron,B. andMorris, C. N. (1971).Limiting therisk of BayesandempiricalBayes

estimators- PartI: TheBayescase.J. Amer. Statist.Assoc., 66, 807–815.

Efron, B. andMorris, C. N. (1972a).EmpiricalBayeson vectorobservations:An

extensionof Stein’smethod.Biometrika, 59, 335–347.

196

Efron, B. and Morris, C. N. (1972b). Limiting the risk of Bayesand empirical

Bayesestimators- PartII: TheempiricalBayescase.J. Amer. Statist.Assoc.,

67, 130–139.

Efron, B. andMorris, C. N. (1973a).Combiningpossiblyrelatedestimationprob-

lems(with discussion).J. Roy. Statist.Soc.,Ser. B, 35, 379–421.

Efron,B. andMorris, C. N. (1973b).Stein’sestimationruleandits competitor- An

empiricalBayesapproach.J. Amer. Statist.Assoc., 68, 117–130.

Efron, B. andMorris, C. N. (1975). DataanalysisusingStein’s estimatorandits

generalizations.J. Amer. Statist.Assoc., 70, 311–319.

Efron, B. andMorris, C. N. (1977). Stein’s paradoxin statistics.ScientificAmeri-

can, 236(May), 119–127.

Efron, B. and Tibshirani, J. (1993). An Introductionto the Bootstrap. London:

ChapmanandHall.

Fan, J. andGijbels, I. (1996). Local PolynomialModelingand Its Applications.

Chapman& Hall.

Fayyad,U., Piatetsky-Shapiro,G., Smyth,P. andUthurusamy, R. (Eds.).(1996).

Advancesin Knowledge DiscoveryandData Mining. MIT Press,Cambridge,

MA.

Fedorov, V. V. (1972).Theoryof OptimalExperiments. New York: AcademicPress.

Fisher, R. A. (1922). On the mathemmaticalfoundationsof theoreticalstatistics.

Philos.Trans.Roy. Soc.London,Ser. A, 222, 309–368.(Reprintedin S. Kotz

andN. L. Johnson(Eds.)(1992).Breakthroughsin Statistics.Volume1,11–44.

Springer-Verlag.).

Fisher, R. A. (1925). Theoryof statisticalestimation.Proc. Camb. Phil. Soc., 22,

700–725.

Foster, D. andGeorge,E. (1994).Therisk inflationcriterionfor multipleregression.

Annalsof Statistics, 22, 1947–1975.

197

Frank,E. andFriedman,J.(1993).A statisticalview of somechemometricsregres-

siontools(with discussions).Technometrics, 35, 109–148.

Friedman,J. (1991). Multiple adaptive regressionsplines(with discussion).Ann.

Statist., 19, 1–141.

Friedman,J. and Stuetzle,W. (1981). Projectionpursuit regression. JASA, 76,

817–823.

Friedman,J. H. (1997). Datamining andstatistics:What’s the connection? In

Proceedingsof the29thSymposiumon theInterface:ComputingScienceand

Statistics. TheInterfaceFoundationof North America.May. Houston,Texas.

Galambos,J. (1995). AdvancedProbability Theory. A seriesof Textbooksand

Referencebooks/10.MarcelDekker, Inc.

Gauss,C. F. (1809). Theoriamotuscorporumcoelestium. Werke, 7. (English

transl.:C. H. Davis. Dover, New York, 1963).

Hall, M. A. (1999).Correlation-basedFeatureSubsetSelectionfor MachineLearn-

ing. PhDthesis,Departmentof ComputerScience,Universityof Waikato.

Hannan,E. J. andQuinn,B. G. (1979). Thedeterminationof theorderof autore-

gression.J. R.Statist.Soc.B, 41, 190–195.

Hartigan,J. A. (1996). Introduction. In P. Arabie, L. J. HubertandG. D. Soete

(Eds.),Clusteringand Classification(pp. 1–3).World ScientificPubl.,River

Edge,NJ.

Hastie,T. andTibshirani,R. (1990). GeneralizedAdditiveModels. Chapman&

Hall.

Hoerl,R. W., Schuenemeyer, J.H. andHoerl,A. E. (1986).A simulationof biased

estimationandsubsetselectionregressiontechniques.Technometrics, 9, 269–

380.

James,W. andStein,C. (1961). Estimationwith quadraticloss. In Proc. Fourth

Berkeley Symposiumon Math. Statist.Prob., 1 (pp. 311–319).University of

198

CaliforniaPress.(Reprintedin S.KotzandN. L. Johnson(Eds.)(1992).Break-

throughsin Statistics.Volume1, 443–460.Springer-Verlag.).

Kiefer, J. andWolfowitz, J. (1956). Consistercy of the maximumlikelihoodesti-

mator in the presenceof infinitely many incedentalparameters.Ann. Math.

Statist., 27, 886–906.

Kilpatrick, D. andCameron-Jones,M. (1998). Numericpredictionusinginstance-

basedlearningwith encodinglength selection. In N. Kasaboc,R. Kozma,

K. Ko,R.O’Shea,G.Coghill andT. Gedeon(Eds.),Progressin Connectionist-

BasedInformationSystems, Volume2 (pp.984–987).Springer-Verlag.

Kotz,S. andJohnson,N. L. (Eds.).(1992). Breakthroughsin Statistics, Volume1.

Springer-Verlag.

Kullback,S.(1968).InformationTheoryandStatistics(2ndEd.).New York: Dover.

Reprintedin 1978,Gloucester, MA: PeterSmith.

Kullback,S. andLeibler, R. (1951). On informationandsufficiency. Ann.Math.

Statist., 22, 79–86.

Laird, N. M. (1978). Nonparametricmaximumlikelihoodestimationof a mixing

distribution. J. Amer. Statist.Assoc., 73, 805–811.

Lawson,C. L. andHanson,R. J. (1974,1995). SolvingLeastSquaresProblems.

Prentice-Hall,Inc.

Legendre,A. M. (1805). Nouvellesmethodspur la determinationdesorbits des

cometes.(Appendix:Surla methodedesmoindrecarres).

Lehmann,E. L. andCasella,G. (1998).Theoryof point estimation(2ndEd.). New

York: Spinger-Verlag.

Lesperance,M. L. andKalbfleisch,J. D. (1992). An algoritm for computingthe

nonparametricmle of a mixing distribution. J. Amer. Statist.Assoc., 87, 120–

126.

199

Li, K.-C. (1987).Asymptoticoptimality for CÀ , CÁ , cross-validationandgeneral-

izedcross-validation:discreteindex set.Ann.Statist., 15, 958–975.

Lindsay, B. G. (1983).Thegeometryof mixturelikelihoods:A generaltheory. Ann.

Statist., 11, 86–94.

Lindsay, B. G. (1995). Mixture Models:Theory, Geometry, andApplications, Vol-

ume5 of NSF-CBMSRegionalConferenceSeriesin ProbabilityandStatistics.

Institutefor MathematicalStatistics:Hayward,CA.

Lindsay, B. G. andLesperance,M. L. (1995).A review of semiparametricmixture

models.Journal of StatisticalPlanning& Inference, 47, 29–39.

Linhart,H. andZucchini,W. (1989).ModelSelection. New York: Wiley.

Liu, H. andMotoda,H. (1998). Feature Selectionfor Knowledge Discovery and

DataMining. Kluwer AcademicPublishers.

Loader, C. (1999).Localregressionandlikelihood. Statisticsandcomputingseries.

Springer.

Macdonald,P. D. M. (1971). Commenton a paperby Choi andBulgren. J. R.

Statist.Soc.B, 33, 326–329.

Mallows,C. L. (1973).Somecommentson Cp. Technometrics, 15, 661–675.

Maritz, J. S. andLwin, T. (1989). Empirical BayesMethods(2ndEd.). Chapman

andHall.

McLachlan,G. andBasford,K. (1988). Mixture Models: Inferenceand Applica-

tionsto Clustering. MarcelDekker, New York.

McQuarrie,A. D. andTsai,C.-L. (1998). Regressionandtimeseriesmodelselec-

tion. Singapore;RiverEdge,N.J.: World Scientific.

Miller, A. J. (1990). SubsetSelectionin Regression, Volume40 of Monographson

StatisticsandAppliedProbability. Chapman& Hall.

200

Morris, C. N. (1983). Parametricempiricalbayesinference:theoryandapplica-

tions. J. Amer. Statist.Assoc., 78, 47–59.

Neyman,J.(1962).Two breakthroughsin thetheoryof statisticaldecisionmaking.

Rev. Inst. Internat.Statist., 30, 11–27.

Polya,G. (1920).Ueberdenzentralengrenswertsatzderwahrscheinlichkeitstheorie

unddasmomentenproblem.Math.Zeit., 8, 171.

PrakasaRao,B. L. (1992). Identifiability in StochasitcModels: Characterization

of ProbabilityDistributions. AcademicPress,New York.

Quinlan,J.R. (1993).C4.5: programsfor machinelearning. Mateo,Calif.: Morgan

KaufmannPublishers.

Rao,C. R. andWu, Y. (1989).A stronglyconsistentprocedurefor modelselection

in a regressionproblem.Technometrics, 72(2), 369–374.

Rissanen,J. (1978). Modelingby shortestdatadescription.Automatica, 14, 465–

471.

Rissanen,J. (1983). A universalprior for integersand estimationby minimum

descriptionlength.Ann.Statist., 11, 416–431.

Rissanen,J. (1989). Stochastic Complexity in Statistical Inquiry, Volume 15 of

Seriesin ComputerScience. World ScientificPublishingCo.

Robbins,H. (1950). A generalizationof themethodof maximumlikelihood:Esti-

matingamixing distribution(abstract).Ann.Math.Statist., 21, 314–315.

Robbins,H. (1951).Asymptoticallysubminimaxsolutionsof compoundstatistical

decisionproblems.In Proc.SecondBerkeley SymposiumonMath.Statist.and

Prob., 1 (pp.131–148).Universityof CaliforniaPress.

Robbins,H. (1955). An empirical Bayesapproachto statistics. In Proc. Third

Berkeley Symposiumon Math.Statist.andProb., 1 (pp. 157–164).University

of CaliforniaPress.

201

Robbins,H. (1964).TheempiricalBayesapproachto statisticaldecisionproblems.

Ann.Math.Statist., 35, 1–20.

Robbins,H. (1983). Somethoughtson empiricalBayesestimation.Ann.Statist.,

11, 713–723.

Roecker, E.B. (1991).Predictionerrorandits estimationof subsetselectedmodels.

Technometrics, 33, 459–468.

Schott,J.R. (1997).Matrix analysisfor statistics. New York: Wiley.

Schwarz, G. (1978). Estimatingthe dimensionof a model. Annalsof Statistics,

6(2), 461–464.

Sclove, S. L. (1968). Improvedestimatorsfor coefficientsin linear regression.J.

Amer. Statist.Assoc., 63, 596–606.

Shannon,C. E. (1948). A mathematicaltheory of communication. Bell System

Technical Journal, 27, 398–403.

Shao,J.(1993).Linearmodelselectionby cross-validation.J. Amer. Statist.Assoc.,

88, 486–494.

Shao,J. (1996).Bootstrapmodelselection.J. Amer. Statist.Assoc., 91, 655–665.

Shao,J.andTu, D. (1995).TheJackknifeandBootstrap. SpringerVerlag.

Shibata,R. (1976). Selectionof theorderof anautoregressive modelby Akaike’s

informationcriterion. Biometrika, 63, 117–126.

Shibata,R. (1981). An optimal selectionof regressionvariables.Biometrika, 68,

45–54.

Shibata,R. (1986). Consistency of modelselectionandparameterestimation. In

J. Gani.andM. Priestley (Eds.),Essaysin time seriesand appliedprocesses

(pp.27–41).J.Appl. Prob.

202

Stein,C.(1955).Inadmissibilityof theusualestimatorfor themeanof amultivariate

normaldistribution. In Proc.Third Berkeley Symp.onMath.Statist.andProb.,

1 (pp.197–206).Universityof CaliforniaPress.

Teicher, H. (1960).On themixtureof distributions.Ann.Math.Statist., 31, 55–57.

Teicher, H. (1961). Identifiability of mixtures.Ann.Math.Statist., 32, 244–248.

Teicher, H. (1963). Identifiability of mixtures.Ann.Math.Statist., 34, 1265–1269.

Thompson,M. L. (1978). Selectionof variablesin multiple regression,part1 and

part2. InternationalStatisticalReview, 46, 1–21and129–146.

Tibshirani,R. (1996). Regressionshrinkageandselectionvia thelasso.Journal of

theRoyalStatisticalSocietyB, 58, 267–288.

Tibshirani,R. andKnight, K. (1997). Thecovarianceinflation criterion for model

selection.Technicalreport,Departmentof Statistics,Universityof Stanford.

Titterington,D. M., Smith,A. F. M. andMakov, U. E. (1985). StatisticalAnalysis

of Finite MixtureDistributions. JohnWiley & Sons.

von Neuman,J. (1928). Zur theorieder gesellschafstspiele.Math. Annalen, 100,

295–320.

von Neumann,J. andMorgenstern,O. (1944). Theoryof Gamesand Economic

Behavior. PrincetonUniversityPress,Princeton,NJ.

Wald,A. (1939).Contributionsto thetheoryof statisticalestimationandhypothesis

testing.Ann.Math.Statist., 10, 299–326.

Wald,A. (1950).StatisticalDecisionFunctions. Wiley, New York.

Witten,I. H. andFrank,E. (1999).DataMining: PracticalMachineLearningTools

andTechniqueswith JAVA Implementations. MorganKaufmann.

Wu, C. F. (1978a).Somealgorithmicaspectsof thetheoryof optimaldesigns.Ann.

Statist., 6, 1286–1301.

203

Wu, C. F. (1978b). Someiterative proceduresfor generatingnonsigularoptimal

designs.Comm.Statist., 7, 1399–1412.

Zhang,P. (1993). Model selectionvia multifold cross-validation. Ann.Statist., 21,

299–313.

Zhao,L. C.,Krishnaiah,P. R.andBai,Z. D. (1986).Onthedetectionof thenumber

of signalsin thepresenceof whitenoise.Journalof MultivariateAnalysis, 20,

1–25.

204

a new approach to fitting linear models in high dimensional spaces · 2011. 3. 27. · this thesis...

Documents