1 statistical methods for semiconductor manufacturing massachusetts institute
Post on 12-Sep-2021
0 Views
Preview:
TRANSCRIPT
1
STATISTICAL METHODS FOR SEMICONDUCTOR MANUFACTURING
D
UANE
S. B
ONING
Massachusetts Institute of TechnologyCambridge, MA
J
ERRY
S
TEFANI
AND
S
TEPHANIE
W. B
UTLER
Texas InstrumentsDallas, TX
Semiconductor manufacturing increasingly depends on theuse of statistical methods in order to develop and explorenew technologies, characterize and optimize existingprocesses, make decisions about evolutionary changes toimprove yields and product quality, and monitor andmaintain well-functioning processes. Each of these activitiesrevolves around data: Statistical methods are essential in theplanning of experiments to efficiently gather relevant data,the construction and evaluation of models based on data, anddecision-making using these models.
THE NATURE OF DATA IN SEMICONDUCTOR MANU-FACTURING
An overview of the variety of data utilized in semiconductormanufacturing helps to set the context for application of thestatistical methods to be discussed here. Most data in atypical semiconductor fabrication plant (or “fab”) consists ofin-line and end-of-line data used for process control andmaterial disposition. In-line data includes measurementsmade on wafer characteristics throughout the process(performed on test or monitor wafers, as well as productwafers), such as thicknesses of thin films after deposition,etch, or polish; critical dimensions (such as polysiliconlinewidths) after lithography and etch steps; particle counts,sheet resistivities, etc. In addition to in-line wafermeasurements, a large amount of data is collected on processand equipment parameters. For example, in plasma etchprocesses time series of the intensities of plasma emissionare often gathered for use in detecting the “endpoint”(completion) of etch, and these endpoint traces (as well asnumerous other parameters) are recorded. At the end-of-line,volumes of test data are taken on completed wafers. Thisincludes both electrical test (or “etest”) data reportingparametric information about transistor and other deviceperformance on die and scribe line structures; and sort oryield tests to identify which die on the wafer function orperform at various specifications (e.g., binning by clockspeed). Such data, complemented by additional sources (e.g.,fab facility environmental data, consumables or vendor data,offline equipment data) are all crucial not only for day to daymonitoring and control of the fab, but also to investigateimprovements, characterize the effects of uncontrolledchanges (e.g., a change in wafer supply parameters), and to
redesign or optimize process steps.
For correct application of statistical methods, one mustrecognize the idiosyncrasies of such measurements arising insemiconductor manufacturing. In particular, the usualassumptions in statistical modeling of independence andidenticality of measurements often do not hold, or only holdunder careful handling or aggregation of the data. Considerthe measurement of channel length (polysilicon criticaldimension or CD) for a particular size transistor. First, theposition of the transistor within the chip may affect thechannel length: Nominally identical structures mayindirectly be affected by pattern dependencies or otherunintentional factors, giving rise to “within die” variation. Ifone wishes to compare CD across multiple die on the wafer(e.g., to measure “die to die” variation), one must thus becareful to measure the same structure on each die to becompared. This same theme of “blocking” againstextraneous factors to study the variation of concern isrepeated at multiple length and time scales. For example, wemight erroneously assume that every die on the wafer isidentical, so that any random sample of, say, five die acrossthe wafer can be averaged to give an accurate value for CD.Because of systematic wafer level variation (e.g., due tolarge scale equipment nonuniformities), however, eachmeasurement of CD is likely to also be a function of theposition of the die on the wafer. The average of these fivepositions can still be used as an aggregate indicator for theprocess, but care must be taken that the same five positionsare always measured, and this average value cannot be takenas indicative of any one measurement. Continuing with thisexample, one might also take five die averages of CD forevery wafer in a 25 wafer lot. Again, one must be careful torecognize that systematic wafer-to-wafer variation sourcesmay exist. For example, in batch polysilicon depositionprocesses, gas flow and temperature patterns within thefurnace tube will give rise to systematic wafer positiondependencies; thus the 25 different five-die averages are notindependent. Indeed, if one constructs a control chart underthe assumption of independence, the estimate of five-dieaverage variance will be erroneously inflated (it will alsoinclude the systematic within-lot variation), and the chartwill have control limits set too far apart to detect importantcontrol excursions. If one is careful to consistently measuredie-averages on the same wafers (e.g., the 6th, 12th, and 18thwafers) in each lot and aggregate these as a “lot average”,one can then construct a historical distribution of thisaggregate “observation” (where an observation thus consistsof several measurements) and correctly design a controlchart. Great care must be taken to account for or guardagainst these and other systematic sources of variation whichmay give rise to lot-to-lot variation; these include tooldependencies, consumables or tooling (e.g., the specificreticle used in photolithography), as well as otherunidentified time-dependencies.
Encyclopedia of Electrical Engineering, J. G. Webster, Ed., Wiley, to appear, Feb. 1999.
2
Formal statistical assessment is an essential complementto engineering knowledge of known and suspected causaleffects and systematic sources of variation. Engineeringknowledge is crucial to specify data collection plans thatensure that statistical conclusions are defensible. If data iscollected but care has not been not taken to make “fair”comparisons, then the results will not be trusted no matterwhat statistical method is employed or amount of datacollected. At the same time, correct application of statisticalmethods are also crucial to correct interpretation of theresults. Consider a simple scenario: The deposition area in aproduction fab has been running a standard process recipefor several months, and has monitored defect counts on eachwafer. An average of 12 defects per 8 in. wafer has beenobserved over that time. The deposition engineer believesthat a change in the gasket (from a different supplier) on thedeposition vacuum chamber can reduce the number ofdefects. The engineer makes the change and runs two lots of25 wafers each with the change (observing an average of 10defects per wafer), followed by two additional lots with theoriginal gasket type (observing an average of 11 defects perwafer). Should the change be made permanently? A“difference” in output has been observed, but a key questionremains: Is the change “statistically significant”? That is tosay, considering the data collected and the system’scharacteristics, has a change really occurred or might thesame results be explained by chance?
Overview of Statistical Methods
Different statistical methods are used in semiconductormanufacturing to understand and answer questions such asthat posed above and others, depending on the particularproblem. Table 1 summarizes the most common methodsand for what purpose they are used, and it serves as a briefoutline of this article. The key engineering tasks involved insemiconductor manufacturing drawing on these methods aresummarized in Table 2. In all cases, the issues of samplingplans and significance of the findings must be considered,
and all sections will periodically address these issues. Tohighlight these concepts, note that in Table 1 the words“different,” “same,” “good,” and “improve” are mentioned.These words tie together two critical issues: significance andengineering importance. When applying statistics to the data,one first determines if a statistically significant differenceexists, otherwise the data (or experimental effects) areassumed to be the same. Thus, what is really tested iswhether there is a difference that is big enough to findstatistically. Engineering needs and principles determinewhether that difference is “good,” the difference actuallymatters, or the cost of switching to a different process will beoffset by the estimated improvements. The size of thedifference that can be statistically seen is determined by thesampling plans and the statistical method used.Consequently, engineering needs must enter into the designof the experiments (sampling plan) so that the statistical testwill be able to see a difference of the appropriate size.Statistical methods provide the means to determine samplingand significance to meet the needs of the engineer andmanager.
In the first section, we focus on the basic underlyingissues of statistical distributions, paying particular attentionto those distributions typically used to model aspects ofsemiconductor manufacturing, including the indispensableGaussian distribution as well as binomial and Poissondistributions (which are key to modeling of defect and yieldrelated effects). An example use of basic distributions is toestimate the interval of oxide thickness in which the engineeris confident that 99% of wafers will reside; based on thisinterval, the engineer could then decide if the process ismeeting specifications and define limits or tolerances forchip design and performance modeling.
In the second section, we review the fundamental tool ofstatistical inference, the hypothesis test as summarized inTable 1. The hypothesis test is crucial in detecting
Table 1: Summary of Statistical Methods Typically Used in Semiconductor Manufacturing
Topic Statistical Method Purpose
1 Statistical distributions Basic material for statistical tests. Used to characterize a populationbased upon a sample.
2 Hypothesis testing Decide whether data under investigation indicates that elements of concern are the “same” or “different.”
3 Experimental design and analysis of variance
Determine significance of factors and models; decomposeobserved variation into constituent elements.
4 Response surface modeling Understand relationships, determine process margin, andoptimize process.
5 Categorical modeling Use when result or response is discrete (such as “very rough,”“rough,” or “smooth”). Understand relationships, determineprocess margin, and optimize process.
6 Statistical process control Determine if system is operating as expected.
3
differences in a process. Examples include: determining ifthe critical dimensions produced by two machines aredifferent; deciding if adding a clean step will decrease thevariance of the critical dimension etch bias; determining ifno appreciable increase in particles will occur if the intervalbetween machine cleans is extended by 10,000 wafers; ordeciding if increasing the target doping level will improve adevice’s threshold voltage.
In the third section, we expand upon hypothesis testingto consider the fundamentals of experimental design andanalysis of variance, including the issue of sampling requiredto achieve the desired degree of confidence in the existenceof an effect or difference, as well as accounting for the risk innot detecting a difference. Extensions beyond single factorexperiments to the design of experiments which screen foreffects due to several factors or their interactions are thendiscussed, and we describe the assessment of suchexperiments using formal analysis of variance methods.Examples of experimental design to enable decision-makingabound in semiconductor manufacturing. For example, onemight need to decide if adding an extra film or switchingdeposition methods will improve reliability, or decide whichof three gas distribution plates (each with different holepatterns) provides the most uniform etch process.
In the fourth section, we examine the construction ofresponse surface or regression models of responses as afunction of one or more continuous factors. Of particularimportance are methods to assess the goodness of fit anderror in the model, which are essential to appropriate use ofregression models in optimization or decision-making.Examples here include determining the optimal values fortemperature and pressure to produce wafers with no morethan 2% nonuniformity in gate oxide thickness, or
determining if typical variations in plasma power will causeout-of-specification materials to be produced.
Finally, we note that statistical process control (SPC) formonitoring the “normal” or expected behavior of a process isa critical statistical method (1,2). The fundaments ofstatistical distributions and hypothesis testing discussed herebear directly on SPC; further details on statistical processmonitoring and process optimization can be found inS
EMICONDUCTOR
F
ACTORY
C
ONTROL
AND
O
PTIMIZATION
.
STATISTICAL DISTRIBUTIONS
Semiconductor technology development andmanufacturing are often concerned with both continuousparameters (e.g., thin-film thicknesses, electricalperformance parameters of transistors) and discreteparameters (e.g., defect counts and yield). In this section, webegin with a brief review of the fundamental probabilitydistributions typically encountered in semiconductormanufacturing, as well as sampling distributions which arisewhen one calculates statistics based on multiplemeasurements (3). An understanding of these distributions iscrucial to understanding hypothesis testing, analysis ofvariance, and other inferencing and statistical analysismethods discussed in later sections.
Descriptive Statistics
“Descriptive” statistics are often used to concisely presentcollections of data. Such descriptive statistics are basedentirely (and only) on the available empirical data, and theydo not assume any underlying probability model. Suchdescriptions include histograms (plots of relative frequency
versus measured values or value ranges in some
Table 2: Engineering Tasks in Semiconductor Manufacturing and Statistical Methods Employed
Engineering Task Statistical Method
Process and equipment control (e.g., CDs, thickness, deposition/removal rates, endpoint time, etest and sort data, yield)
Control chart for (for wafer or lot mean) and
s
(for within-wafer unifor-mity) based on the distribution of each parameter. Control limits and con-trol rules are based on the distribution and time patterns in the parameter.
Equipment matching ANOVA on and/or
s
over sets of equipment, fabs, or vendors.
Yield/performance modeling and prediction
Regression of end-of-line measurement as response against in-line measurements as factors.
Debug ANOVA, categorical modeling, correlation to find most influential factors to explain the difference between good versus bad material.
Feasibility of process improvement One-sample comparisons on limited data; single lot factorial experiment design; response surface models over a wider range of settings looking for opportunities for optimization, or to establish process window over a nar-rower range of settings.
Validation of process improvement Two-sample comparison run over time to ensure a change is robust and works over multiple tools, and no subtle defects are incurred by change.
x
x
φi xi
4
parameter), as well as calculation of the mean:
(1)
(where is the number of values observed), calculation ofthe median (value in the “middle” of the data with an equalnumber of observations below and above), and calculation ofdata percentiles. Descriptive statistics also include thesample variance and sample standard deviation:
(2)
(3)
A drawback to such descriptive statistics is that theyinvolve the study of observed data only and enable us todraw conclusions which relate only to that specific data.Powerful statistical methods, on the other hand, have cometo be based instead on probability theory; this allows us torelate observations to some underlying probability modeland thus make inferences about the population (thetheoretical set of all possible observations) as well as thesample (those observations we have in hand). It is the use ofthese models that give computed statistics (such as the mean)explanatory power.
Probability Model
Perhaps the simplest probability model of relevance insemiconductor manufacturing is the Bernoulli distribution. ABernoulli trial is an experiment with two discrete outcomes:success or failure. We can
model
the
a priori
probability(based on historical data or theoretical knowledge) of asuccess simply as
p
. For example, we may have aggregatehistorical data that tells us that line yield is 95% (i.e., that95% of product wafers inserted in the fab successfullyemerge at the end of the line intact). We make the leap fromthis descriptive information to an assumption of anunderlying probability model: we suggest that theprobability of any one wafer making it through the line isequal to 0.95. Based on that probability model, we canpredict an outcome for a new wafer which has not yet beenprocessed and was not part of the original set ofobservations. Of course, the use of such probability modelsinvolves assumptions -- for example that the fab and allfactors affecting line yield are essentially the same for thenew wafer as for those used in constructing the probabilitymodel. Indeed, we often find that each wafer is
not
independent of other wafers, and such assumptions must beexamined carefully.
Normal (Gaussian) Distribution.
In addition to discreteprobability distributions, continuous distributions also play acrucial role in semiconductor manufacturing. Quite often,one is interested in the probability density function (or pdf)for some parametric value. The most important continuous
distribution (in large part due to the central limit theorem) isthe Gaussian or normal distribution. We can write that arandom variable
x
is “distributed as” a normal distribution
with mean and variance as . Theprobability density function for x is given by
(4)
which is also often discussed in unit normal form through the
normalization , so that :
(5)
Given a continuous probability density function, onecan talk about the probability of finding a value in somerange. Consider the oxide thickness at the center of a waferafter rapid thermal oxidation in a single wafer system. If thisoxide thickness is normally distributed with Å and
Å
2
(or a standard deviation of 3.16 Å), then theprobability of any one such measurement
x
falling between105 Å and 120 Å can be determined as
(6)
where is the cumulative density function for the unitnormal, which is available via tables or statistical analysispackages.
We now briefly summarize other common discreteprobability mass functions (pmf’s) and continuousprobability density functions (pdf’s) that arise insemiconductor manufacturing, and then we turn to samplingdistributions that are also important.
Binomial Distribution.
Very often we are interested in thenumber of successes in repeated Bernoulli trials (that is,repeated “succeed” or “fail” trials). If
x
is the number ofsuccesses in
n
trials, then
x
is distributed as a binomialdistribution , where
p
is the probability of eachindividual “success”. The pmf is given by
x1n--- xi
i 1=
n
∑=
n
sx2
Sample Var x{ } 1n 1–------------ xi x–( )2
i 1=
n
∑= =
sx Sample std. dev. x{ } sx2
= =
µ σ2x N µ σ2,( )∼
f x( ) 1
σ 2π--------------e
1–2
------ x µ–σ
------------ 2
=
zx µ–
σ------------= z N 1 0,( )∼
f z( ) 1
2π----------e
12---z
2–=
µ 100=
σ210=
Pr 105 x 120≤ ≤( ) 1
σ 2π--------------e
12--- x µ–
σ------------
2–
xd
105
120
∫=
Pr105 µ–
σ------------------ x µ–
σ------------ 120 µ–
σ------------------≤ ≤
=
Pr zl z zu≤ ≤( ) 1
2π----------e
12---z
2–zd
zl
zu
∫= =
1
2π----------e
12---z
2–zd
∞–
zu
∫
1
2π----------e
12---z
2–zd
∞–
zl
∫
–=
Φ zu( ) Φ zl( )–=
Φ 6.325( ) Φ 1.581( )– 0.0569= =
Φ z( )
x B n p,( )∼
5
(7)
where “n choose x” is
(8)
For example, if one is starting a 25-wafer lot in the fababove, one may wish to know what is the probability thatsome number x (x being between 0 and 25) of those waferswill survive. For the line yield model of p = 95%, theseprobabilities are shown in Fig. 1. When n is very large (muchlarger than 25), the binomial distribution is wellapproximated by a Gaussian distribution.
A key assumption in using the Bernoulli distribution isthat each wafer is independent of the other wafers in the lot.One often finds that many lots complete processing with allwafers intact, and other lots suffer multiple damaged wafers.While an aggregate line yield of 95% may result,independence does not hold and alternative approaches suchas formation of empirical distributions should be used.
Poisson Distribution. A third discrete distribution is highlyrelevant to semiconductor manufacturing. An approximationto the binomial distribution that applies when n is large and pis small is the Poisson distribution:
(9)
for integer and .
For example, one can examine the number of chips thatfail on the first day of operation: The probability p of failurefor any one chip is (hopefully) exceedingly small, but onetests a very large number n of chips, so that the observationof the mean number of failed chips is Poissondistributed. An even more common application of thePoisson model is in defect modeling. For example, if defects
are Poisson-distributed with a mean defect count of 3particles per 200 mm wafer, one can ask questions about theprobability of observing x [e.g., Eq. (9)] defects on a sample
wafer. In this case, , or lessthan 0.3% of the time would we expect to observe exactly 9defects. Similarly, the probability that 9 or more defects are
observed is .
In the case of defect modeling, several otherdistributions have historically been used, including theexponential, hypergeometric, modified Poisson, and negativebinomial. Substantial additional work has been reported inyield modeling to account for clustering, and to understandthe relationship between defect models and yield (e.g., seeRef. 4).
Population Versus Sample Statistics
We now have the beginnings of a statistical inference theory.Before proceeding with formal hypothesis testing in the nextsection, we first note that the earlier descriptive statistics ofmean and variance take on new interpretations in theprobabilistic framework. The mean is the expectation (or“first moment”) over the distribution:
(10)
for continuous pdf’s and discrete pmf’s, respectively.Similarly, the variance is the expectation of the squareddeviation from the mean (or the “second central moment”)over the distribution:
(11)
Further definitions from probability theory are also highlyuseful, including the covariance:
(12)
where x and y are each random variables with their ownprobability distributions, as well as the related correlationcoefficient:
(13)
The above definitions for the mean, variance,covariance, and correlation all relate to the underlying or
Figure 1. Probabilities for number of wafers survivingfrom a lot of 25 wafers, assuming a line yield of 90%,calculated using the binomial distribution.
f x p n, ,( ) nx
px
1 p–( )n x–=
nx
n!x! n x–( )!-----------------------=
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 5 1 0 1 5 2 0 2 5
Wafers surviving
f x λ,( ) eλ– λ x
x!--------------=
λ 0 1 2 …, , ,= λ np≅
λ np=
f 9 3,( ) e3–3
9( ) 9!⁄ 0.0027= =
1 f x 3,( )x 0=
8
∑– 0.0038=
µx E x{ } xf x( ) xd
∞–
∞
∫= =
xi p⋅r
xi( )i 1=
n
∑=
σx2
Var x{ } x E x{ }–( )2f x( ) xd
∞–
∞
∫= =
xi E xi{ }–( )2p⋅ r xi( )
i 1=
n
∑=
σxy2
Cov x y,{ } E x E x{ }–( ) y E y{ }–( ){ }= =
E xy{ } E x{ } E y{ }–=
ρxy Corr x y,{ } Cov x y,{ }Var x{ } Var y{ }
------------------------------------------σxy
2
σxσy------------= = =
6
assumed population. When one only has a sample (that is, afinite number of values drawn from some population), onecalculates the corresponding sample statistics. These are nolonger “descriptive” of only the sample we have; rather,these statistics are now estimates of parameters in aprobability model. Corresponding to the populationparameters above, the sample mean is given by Eq. (1), the
sample variance by Eq. (2), the sample std. dev. by Eq.
(3), the sample covariance is given by
(14)
and the sample correlation coefficient is given by
(15)
Sampling Distributions
Sampling is the act of making inferences about populationsbased on some number of observations. Random sampling isespecially important and desirable, where each observationis independent and identically distributed. A statistic is afunction of sample data which contains no further unknowns(e.g., the sample mean can be calculated from theobservations and has no further unknowns). It is important tonote that a sample statistic is itself a random variable and hasa “sampling distribution” which is usually different than theunderlying population distribution. In order to reason aboutthe “likelihood” of observing a particular statistic (e.g., themean of five measurements), one must be able to constructthe underlying sampling distribution.
Sampling distributions are also intimately bound upwith estimation of population distribution parameters. Forexample, suppose we know that the thickness of gate oxide(at the center of the wafer) is normally distributed:
. We sample five random wafers
and compute the mean oxide thickness
. We now have two key questions:
(1) What is the distribution of ? (2) What is the probability
that ?
In this case, given the expression for above, we can
use the fact that the variance of a scaled random variable
is simply , and the variance of a sum ofindependent random variables is the sum of the variances
(16)
where by the definition of the mean. Thus,
when we want to reason about the likelihood of observing
values of (that is, averages of five sample measurements)
lying within particular ranges, we must be sure to use the
distribution for rather than that for the underlyingdistribution T. Thus, in this case the probability of finding a
value between 105 Å and 120 Å is
(17)
which is relatively unlikely. Compare this to the result fromEq. (6) for the probability 0.0569 of observing a single value(rather than a five sample average) in the range 105 Å to 120Å.
Note that the same analysis would apply if was itself
formed as an average of underlying oxide thicknessmeasurements (e.g., from a nine point pattern across thewafer), if that average and each of the five
wafers used in calculating the sample average are
independent and also distributed as .
Chi-Square Distribution and Variance Estimates. Severalother vitally important distributions arise in sampling, andare essential to making statistical inferences in experimentaldesign or regression. The first of these is the chi-squaredistribution. If for and
, then y is distributed as chi-square with
n degrees of freedom, written as . While formulas for
the probability density function for exist, they are almostnever used directly and are again instead tabulated or
available via statistical packages. The typical use of the isfor finding the distribution of the variance when the mean isknown.
Suppose we know that . As discussed
previously, we know that the mean over our n observations is
distributed as . How is the sample variance over our n observations distributed? We note that each
is normally distributed; thus if we
normalize our sample variance by we have a chi-square distribution:
x
sx2
sx
sxy2 1
n 1–------------ xi x–( ) yi y–( )
i 1=
n
∑=
rxy
sxy
sxsy---------=
Ti N µ σ2,( )∼ N 100 10,( )=
T15--- T1 T2 … T5+ + +( )=
T
a T b≤ ≤
T
ax
a2Var x{ }
T N µ σ2n⁄,( )∼
µT
µT µ= =
T
T
T
Pr 105 T 120≤ ≤( )
Pr105 µ–
σ n( )⁄------------------- T µ–
σ n( )⁄------------------- 120 µ–
σ n( )⁄-------------------≤ ≤
=
Pr105 100–
10( ) 5( )⁄------------------------------- T µ–
σ n( )⁄------------------- 120 100–
2------------------------≤ ≤
=
Pr zl z zu≤ ≤( )=
Φ 14.142( ) Φ 3.536( )– 0.0002= =
Ti
Ti N 100 10,( )∼
T
N 100 10,( )
xi N 0 1,( )∼ i 1 2 … n, , ,=
y x12
x22 … xn
2+ + +=
y χn2∼
χ2
χ2
xi N µ σ2,( )∼
x N µ σ2n⁄,( )∼ s
2
xi x–( ) N 0 σ2,( )∼
s2 σ2
7
(18)
where one degree of freedom is used in calculation of .Thus, the sample variance for n observations drawn from
is distributed as chi-square as shown in Eq. (18).
For our oxide thickness example where
and we form sample averages as above, then
and 95% of the time the observed sample variable will be
less than 1.78 Å2. A control chart could thus be constructedto detect when unusually large five-wafer sample variancesoccur.
Student t Distribution. The Student t distribution is anotherimportant sampling distribution. The typical use is when wewant to find the distribution of the sample mean when thetrue standard deviation is not known. Consider
(19)
In the above, we have used the definition of the Student tdistribution: if is a random variable, , then
is distributed as a Student t with k degrees of
freedom, or , if . is a random variable
distributed as . As discussed previously, the normalized
sample variance is chi-square-distributed, so that ourdefinition does indeed apply. We thus find that thenormalized sample mean is distributed as a Student t with n-1 degrees of freedom when we do not know the true standarddeviation and must estimate it based on the sample as well.We note that as , the Student t approaches a unit
normal distribution .
F Distribution and Ratios of Variances. The last samplingdistribution we wish to discuss here is the F distribution. Weshall see that the F distribution is crucial in analysis ofvariance (ANOVA) and experimental design in determiningthe significance of effects, because the F distribution isconcerned with the probability density function for the ratio
of variances. If (that is, is a random variable
distributed as chi-square with u degrees of freedom) and
similarly , then the random variable
(that is, distributed as F with u and v degrees of freedom).The typical use of the F distribution is to compare the spreadof two distributions. For example, suppose that we have twosamples and , where
and . Then
(20)
Point and Interval Estimation
The above population and sampling distributions formthe basis for the statistical inferences we wish to draw inmany semiconductor manufacturing examples. Oneimportant use of the sampling distributions is to estimatepopulation parameters based on some number ofobservations. A point estimate gives a single “best”estimated value for a population parameter. For example, the
sample mean is an estimate for the population mean .Good point estimates are representative or unbiased (that is,the expected value of the estimate should be the true value),as well as minimum variance (that is, we desire the estimatorwith the smallest variance in that estimate). Often we restrictourselves to linear estimators; for example, the best linearunbiased estimator (BLUE) for various parameters istypically used.
Many times we would like to determine a confidenceinterval for estimates of population parameters; that is, wewant to know how likely it is that is within some particular
range of . Asked another way, to a desired probability,
where will actually lie given an estimate ? Such intervalestimation is considered next; these intervals are used in latersections to discuss hypothesis testing.
First, let us consider the confidence interval forestimation of the mean, when we know the true variance ofthe process. Given that we make n independent andidentically distributed samples from the population, we cancalculate the sample mean as in Eq. (1). As discussed earlier,we know that the sample mean is normally distributed asshown in Eq. (16). We can thus determine the probability
that an observed is larger than by a given amount:
(21)
where is the alpha percentage point for the normalized
variable z (that is, measures how many standard
deviations greater than we must be in order for the
s2
xi x–( )2
i∑
n 1–( )⁄=
n 1–( )s2
σ2---------------------- χn 1–
2∼
s2 σ2
n 1–( )---------------- χ⋅
n 1–
2
∼
x
N µ σ2,( )
Ti N 100 10,( )∼
T sT
210χ4
24⁄∼
σ
xi N µ σ2,( )∼
x µ–
s n( )⁄------------------
x µ–
σ n( )⁄-------------------
s σ⁄------------------------- N 0 1,( )
1n 1–------------χn 1–
2----------------------------- tn 1–∼ ∼=
z z N 0 1,( )∼
z y k⁄( )⁄
z y k⁄( )⁄ tk∼ y
χk2
s2 σ2⁄
k ∞→tk N 0 1,( )→
y1 χu2∼ y1
y2 χv2∼ Y
y1 u⁄y2 v⁄------------ Fu v,∼=
x1 x2 … xn, , , w1 w2 … wm, , ,
xi N µx σx2,( )∼ wi N µw σw
2,( )∼
sx2 σx
2⁄
sw2 σw
2⁄---------------- Fn 1– m 1–,∼
x µ
x
µµ x
x µ
zx µ–
σ n( )⁄-------------------=
Pr z zα>( ) α 1 Φ zα( )–= =
zα
zα
µ
8
integrated probability density to the right of this value toequal ). As shown in Fig. 2, we are usually interested inasking the question the other way around: to a givenprobability [ , e.g., 95%], in what range will the
true mean lie given an observed ?
(22)
.
For our oxide thickness example, we can now answer animportant question. If we calculate a five-wafer average, howfar away from the average 100 Å must the value be for us todecide that the process has changed away from the mean
(given a process variance of 10 Å2)? In this case, we mustdefine the “confidence” in saying that the process haschanged; this is equivalent to the probability of observing thedeviation by chance. With 95% confidence, we can declarethat the process is different than the mean of 100 Å if we
observe Å or Å:
(23)
A similar result occurs when the true variance is notknown. The confidence interval in this case isdetermined using the appropriate Student t samplingdistribution for the sample mean:
(24)
In some cases, we may also desire a confidence intervalon the estimate of variance:
(25)
For example, consider a new etch process which wewant to characterize. We calculate the average etch rate(based on 21-point spatial pre- and post-thicknessmeasurements across the wafer) for each of four wafers weprocess. The wafer averages are 502, 511, 515, and 504 Å/min. The sample mean is Å/min, and the sample
standard deviation is Å/min. The 95% confidence
intervals for the true mean and std. dev. are
Å/min, and Å/min. Increasing the number ofobservations (wafers) can improve our estimates andintervals for the mean and variance, as discussed in the nextsection.
Many other cases (e.g., one-sided confidence intervals)can also be determined based on manipulation of theappropriate sampling distributions, or through consultationwith more extensive texts (5).
HYPOTHESIS TESTING
Given an underlying probability distribution, it now becomespossible to answer some simple, but very important,questions about any particular observation. In this section,we formalize the decision-making earlier applied to ouroxide thickness example. Suppose as before that we knowthat oxide thickness is normally distributed, with a mean of100 Å and standard deviation of 3.162 Å. We may know thisbased on a very large number of previous historicalmeasurements, so that we can well approximate the truepopulation of oxide thicknesses out of a particular furnacewith these two distribution parameters. We suspectsomething just changed in the equipment, and we want todetermine if there has been an impact on oxide thickness. Wetake a new observation (i.e., run a new wafer and form ouroxide thickness value as usual, perhaps as the average ofnine measurements at fixed positions across the wafer). Thekey question is: What is the probability that we would getthis observation if the process has not changed, versus theprobability of getting this observation if the process hasindeed changed?
We are conducting a hypothesis test. Based on theobservation, we want to test the hypothesis (label this H1)
that the underlying distribution mean has increased from
by some amount to . The “null hypothesis” H0 is that
nothing has changed and the true mean is still . We are
looking for evidence to convince us that H1 is true.
We can plot the probability density function for each ofthe two hypotheses under the assumption that the variancehas not changed, as shown in Fig. 3. Suppose now that we
Figure 2. Probability density function for the samplemean. The sample mean is unbiased (the expected value ofthe sample mean is the true mean ). The shaded portionin the tail captures the probability that the sample mean isgreater than a distance from the true mean.
α
100 1 α–( )x
x zα 2⁄σn
-------⋅– µ x zα 2⁄+
σn
-------⋅ ≤ ≤
µ x zα 2⁄σn
-------⋅±=
x
f x( ) N µ σ2n⁄,( )∼
µ α 2⁄α 2⁄ x
µ
x µ–
1 α–
T 97.228< T 102.772>
α 2⁄ Pr zα2---
T µ–
σ n( )⁄-------------------≤
Pr 1.95996–T µ–
σ n( )⁄-------------------≤
= =
T µ– 1.95996σ n( )⁄ 2.7718= =
100 1 α–( )
x tα 2⁄ n 1–,s
n-------⋅–
µ x tα 2⁄ n 1–,+s
n-------⋅
≤ ≤
µ x tα 2⁄ n 1–,s
n-------⋅±=
n 1–( )s2
χα 2⁄ n 1–,2
----------------------- σ2 n 1–( )s2
χ1 α 2⁄– n 1–,2
------------------------------≤ ≤
x 511=
s 6.1=
µ 511 12.6±=
3.4 σ 22.6≤ ≤
µ0
δ µ1
µ0
9
observe the value . Intuitively, if the value of is “closer”
to than to , we will more than likely believe that the
value comes from the H0 distribution than the H1
distribution. Under a maximum likelihood approach, we cancompare the probability density functions:
(26)
that is, if is greater than for our observation, we reject
the null hypothesis H0 (that is, we “accept” the alternativehypothesis H1). If we have prior belief (or other knowledge)affecting the a priori probabilities of H0 and H1, these can
also be used to scale the distributions and to
determine a posterior probabilities for H0 and H1. Similarly,we can define the “acceptance region” as the set of values of
for which we accept each respective hypothesis. In Fig. 3,
we have the rule: accept H1 if , and accept H0 if
.
The maximum likelihood approach provides a powerfulbut somewhat complicated means for testing hypotheses. Intypical situations, a simpler approach suffices. We select aconfidence with which we must detect a “difference,”
and we pick a decision point based on that confidence.For two-sided detection with a unit normal distribution, forexample, we select and as the regions for
declaring that unusual behavior (i.e., a shift) has occurred.
Consider our historical data on oxide thickness where. To be 95% confident that a new observation
indicates a change in mean, we must observe Å or
Å. On the other hand, if we form five-wafer
averages, then we must observe Å or Å asin Eq. (23) to convince us (to 95% confidence) that ourprocess has changed.
Alpha and Beta Risk (Type I and Type II Errors)
The hypothesis test gives a clear, unambiguous procedure formaking a decision based on the distributions andassumptions outlined above. Unfortunately, there may be asubstantial probability of making the wrong decision. In themaximum likelihood example of Fig. 3, for the singleobservation as drawn we accept the alternative hypothesis
H1. However, examining the distribution corresponding to
H0, we see that a nonzero probability exists of belonging
to H1.Two types of errors are of concern:
(27)
We note that the Type I error (or probability of a“false alarm”) is based entirely on our decision rule and doesnot depend on the size of the shift we are seeking to detect.The Type II error (or probability of “missing” a real shift),on the other hand, depends strongly on the size of the shift.This Type II error can be evaluated for the distributions anddecision rules above; for the normal distribution of Fig. 3,we find
(28)
where is the shift we wish to detect.
The “power” of a statistical test is defined as
(29)
that is, the power of the test is the probability of correctlyrejecting H0. Thus the power depends on the shift we wish
to detect as well as the level of “false alarms” ( or Type Ierror) we are willing to endure. Power curves are oftenplotted of versus the normalized shift to detect ,
for a given fixed (e.g., ), and as a function ofsampling size.
Hypothesis Testing and Sampling Plans. In the previousdiscussion, we described a hypothesis test for detection of ashift in mean of a normal distribution, based on a singleobservation. In realistic situations, we can often makemultiple observations and improve our ability to make acorrect decision. If we take n observations, then the sample
Figure 3. Distributions of x under the null hypothesis H0,and under the hypothesis H1 that a positive shift has
occurred such that .
xi xi
µ0 µ1
f 0 xi( )H0><H1
f 1 xi( )
f 1 f 0
f 0 f 1
xi
xi x*>
xi x*<
x
H0: f 0 x( ) N µ0 σ2,( )∼
µ1µ0 δ
xix*
µ1 µ0 δ+=
1 α–
x*
z zα 2⁄> z zα 2⁄–<
Ti N 100 10,( )∼
Ti 93.8<
Ti 106.2>
T 97.2< T 102.8>
xi
xi
α Pr (Type I error)=
Pr reject H0 H0 is true( ) f 0 x( ) xd
x*
∞
∫= =
β Pr (Type II error)=
Pr accept H0 H1 is true( ) f 1 x( ) xd
∞–
x*
∫= =
α
β
β Φ zα 2⁄δσ---–
Φ zα 2⁄– δσ---–
–=
δ
Power 1 β–≡ Pr reject H0 H0 is false( )=
δ
α
β d δ σ⁄=
α α 0.05=
10
mean is normally distributed, but with a reduced variance
as illustrated by our five-wafer average of oxide
thickness. It now becomes possible to pick an riskassociated with a decision on the sampling distribution thatis acceptable, and then determine a sample size n in order toachieve the desired level of risk. In the first step, we still
select based on the risk of false alarm , but now this
determines the actual unnormalized decision point using
. We finally pick n (which determines
as just defined) based on the Type II error, which also
depends on the size of the normalized shift to bedetected and the sample size n. Graphs and tables areindispensable in selecting the sample size for a given d and
; for example, Fig. 4 shows the Type II error associatedwith sampling from a unit normal distribution, for a fixedType I error of 0.05, and as a function of sampling size n andshift to be detected d.
Consider once more our wafer average oxide thickness
. Suppose that we want to detect a 1.58 Å (or 0.5 ) shift
from the 100 Å mean. We are willing to accept a false alarm5% of the time ( ) but can accept missing the shift
20% of the time ( ). From Fig. 4, we need to take
. Our decision rule is to declare that a shift has
occurred if our 30 sample average Å or Å.
If it is even more important to detect the 0.5 shift in mean
(e.g., with probability of missing the shift), then
from Fig. 4 we now need samples, and our decision
rule is to declare a shift if the 40 sample average Å
or Å.
Control Chart Application. The concepts of hypothesistesting, together with issues of Type I and Type II error aswell as sample size determination, have one of their mostcommon applications in the design and use of control charts.
For example, the control chart can be used to detect whena “significant” change from “normal” operation occurs. Theassumption here is that when the underlying process isoperating under control, the fundamental process population
is distributed normally as . We periodically
draw a sample of n observations and calculate the average ofthose observations ( ). In the control chart, we essentiallyperform a continuous hypothesis test, where we set the upperand lower control limits (UCLs and LCLs) such that falls
outside these control charts with probability when the
process is truly under control (e.g., we usually select togive a 0.27% chance of false alarm). We would then choosethe sample size n so as to have a particular power (that is, aprobability of actually detecting a shift of a given size) aspreviously described. The control limits (CLs) would then beset at
(30)
Note that care must be taken to check key assumptionsin setting sample sizes and control limits. In particular, oneshould verify that each of the n observations to be utilized informing the sample average (or the sample standarddeviation in an s chart) are independent and identicallydistributed. If, for example, we aggregate or take the averageof n successive wafers, we must be sure that there are nosystematic within-lot sources of variation. If such additionalvariation sources do exist, then the estimate of varianceformed during process characterization (e.g., in our etchexample) will be inflated by accidental inclusion of thissystematic contribution, and any control limits one sets willbe wider than appropriate to detect real changes in theunderlying random distribution. If systematic variation doesindeed exist, then a new statistic should be formed thatblocks against that variation (e.g., by specifying whichwafers should be measured out of each lot), and the controlchart based on the distribution of that aggregate statistic.
EXPERIMENTAL DESIGN AND ANOVA
In this section, we consider application and furtherdevelopment of the basic statistical methods alreadydiscussed to the problem of designing experiments toinvestigate particular effects, and to aid in the construction ofmodels for use in process understanding and optimization.First, we should recognize that the hypothesis testingmethods are precisely those needed to determine if a newtreatment induces a significant effect in comparison to aprocess with a known distribution. These are known as one-sample tests. In this section, we next consider two-sample
Figure 4. Probability of Type II error ( ) for a unit
normal process with Type I error fixed at , as afunction of sample size n and shift delta (equal to thenormalized shift ) to be detected.
σ2n⁄
α
βzα 2⁄ α
x* µ z+ α 2⁄ σ n( )⁄⋅=
x*
d δ σ⁄=
α
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2
Delta (in units of standard deviation)
n = 2
n = 3
n = 5
n = 10n = 20
n = 30
n = 40
βα 0.05=
δ σ⁄
Ti σ
α 0.05=
β 0.20=
n 30=
T 98.9< T 101.1>σ
β 0.10=
n 40=
T 99.0<
T 101.0>
x
xi N µ σ2,( )∼
x
x
α3σ
CL µ zα 2⁄σn
-------⋅ ±=
11
tests to compare two treatments in an effort to detect atreatment effect. We will then extend to the analysis ofexperiments in which many treatments are to be compared(k-sample tests), and we present the classic tool for studyingthe results --- ANOVA (6).
Comparison of Treatments: Two-Sample Tests
Consider an example where a new process B is to becompared against the process of record (POR), process A. Inthe simplest case, we have enough historical information onprocess A that we assume values for the yield mean and
standard deviation . If we gather a sample of 10 wafers forthe new process, we can perform a simple one-samplehypothesis test using the 10 wafer sampling
distribution and methods already discussed, assuming thatboth process A and B share the same variance.
Consider now the situation where we want to comparetwo new processes. We will fabricate 10 wafers with processA and another 10 wafers with process B, and then we willmeasure the yield for each wafer after processing. In order toblock against possible time trends, we alternate betweenprocess A and process B on a wafer by wafer basis. We areseeking to test the hypothesis that process B is better thanprocess A, that is H1: , as opposed to the null
hypothesis H0: . Several approaches can be used
(5).
In the first approach, we assume that in each process Aand B we are random sampling from an underlying normaldistribution, with (1) an unknown mean for each process and(2) a constant known value for the population standard
deviation of . Then , ,
, and . We can then
construct the sampling distribution for the difference inmeans as
(31)
(32)
Even if the original process is moderately non-normal, thedistribution of the difference in sample means will beapproximately normal by the central limit theorem, so wecan normalize as
(33)
allowing us to examine the probability of observing themean difference based on the unit normal
distribution, .
The disadvantage of the above method is that it dependson knowing the population standard deviation . If suchinformation is indeed available, using it will certainlyimprove the ability to detect a difference. In the secondapproach, we assume that our 10 wafer samples are againdrawn by random sampling on an underlying normalpopulation, but in this case we do not assume that we know apriori what the population variance is. In this case, we mustalso build an internal estimate of the variance. First, we
estimate from the individual variances:
(34)
and similarly for . The pooled variance is then
(35)
Once we have an estimate for the population variance, wecan perform our t test using this pooled estimate:
(36)
One must be careful in assuming that process A and B sharea common variance; in many cases this is not true and moresophisticated analysis is needed (5).
Comparing Several Treatment Means via ANOVA
In many cases, we are interested in comparing severaltreatments simultaneously. We can generalize the approachdiscussed above to examine if the observed differences intreatment means are indeed significant, or could haveoccurred by chance (through random sampling of the sameunderlying population).
A picture helps explain what we are seeking toaccomplish. As shown in Fig. 5, the population distributionfor each treatment is shown; the mean can differ because thetreatment is shifted, while we assume that the populationvariance is fixed. The sampling distribution for eachtreatment, on the other hand, may also be different if adifferent number of samples is drawn from each treatment(an “unbalanced” experiment), or because the treatment is infact shifted. In most of the analyses that follow, we willassume balanced experiments; analysis of the unbalancedcase can also be performed (7). It remains important torecognize that particular sample values from the samplingdistributions are what we measure. In essence, we mustcompare the variance between two groups (a measure of thepotential “shift” between treatments) with the variancewithin each group (a measure of the sampling variance).Only if the shift is “large” compared to the sampling
µA
σ
µB µA>
µB µA>
µB µA=
σ nA 10= nB 10=
Var yA{ } σ 2nA⁄= Var yB{ } σ 2
nB⁄=
Var yB yA–{ } σ2
nA------ σ2
nB------+ σ2 1
nA------ 1
nB------+
= =
sB A– σ 1nA------ 1
nB------+=
z0
yB yA–( ) µB µA–( )–
σ 1nA------ 1
nB------+
-----------------------------------------------------=
yB yA–
Pr z z0>( )
σ
σ
sA2 1
nA 1–--------------- yAi
yA–( )i 1=
nA
∑=
sB2
s2 nA 1–( )sA
2nB 1–( )sB
2+
nA nB 2–+----------------------------------------------------------=
t0
yB yA–( ) µB µA–( )–
s1
nA------ 1
nB------+
-----------------------------------------------------=
12
variance are we confident that a true effect is in place. Anappropriate sample size must therefore be chosen usingmethods previously described such that the experiment ispowerful enough to detect differences between thetreatments that are of engineering importance. In thefollowing, we discuss the basic methods used to analyze theresults of such an experiment (8).
First, we need an estimate of the within-group variation.Here we again assume that each group is normally
distributed and share a common variance . Then we can
form the “sum of squares” deviations within the tth group as
(37)
where is the number of observations or samples in group
t. The estimate of the sample variance for each group, alsoreferred to as the “mean square,” is thus
(38)
where is the degrees of freedom in treatment t. We can
generalize the pooling procedure used earlier to estimate acommon shared variance across k treatments as
(39)
where is defined as the within-treatment or within-
group mean square and is the total number of
measurements, , or simply if
all treatments consist of the same number of samples, .
In the second step, we want an estimate of between-group variation. We will ultimately be testing the hypothesis
. The estimate of the between-group
variance is
(40)
where is defined as the between treatment mean
square, , and is the overall mean.
We are now in position to ask our key question: Are thetreatments different? If they are indeed different, then thebetween-group variance will be larger than the within-groupvariance. If the treatments are the same, then the between-group variance should be the same as the within-groupvariance. If in fact the treatments are different, we thus findthat
(41)
where is treatment t’s effect. That is, is
inflated by some factor related to the difference betweentreatments. We can perform a formal statistical test fortreatment significance. Specifically, we should consider theevidence to be strong for the treatments being different if the
ratio is significantly larger than 1. Under our
assumptions, this should be evaluated using the F
distribution, since .
We can also express the total variation (total deviationsum of squares from the grand mean ) observed in the
data as
(42)
where is recognized as the variance in the data.
Analysis of variance results are usually expressed intabular form, such as shown in Table 3. In addition tocompactly summarizing the sum of squares due to variouscomponents and the degrees of freedom, the appropriate Fratio is shown. The last column of the table usually containsthe probability of observing the stated F ratio under the nullhypothesis. The alternative hypothesis --- that one or more
Figure 5. Pictorial representation of multiple treatmentexperimental analysis. The underlying population for eachtreatment may be different, while the variance for eachtreatment population is assumed to be constant. The darkcircles indicate the sample values drawn from thesepopulations (note that the number of samples may not bethe same in each treatment).
A B C
y
yA
yB
yC
Populationdistributions
σ2
SSt
SSt yt jyt–( )
2
j 1=
nt
∑=
nt
st2 SSt
υ t-------
SSt
nt 1–-------------= =
υ t
sR2 υ1s1
2 υ2s22 … υksk
2+ + +
υ1 υ2 … υk+ + +----------------------------------------------------------
SSR
N k–-------------
SSR
υR---------= = =
SSR υR⁄
N
N n1 n2 … nk+ + += N nk=
k n
µ1 µ2 … µk= = =
sT2
nt yt y–( )2
t 1=
k
∑SST
υT---------= =
SST υT⁄
υT k 1–= y
sT2
estimates σ2 ntτ t2
k 1–( )----------------
t 1=
k
∑+
τ t µt µ–= sT2
sT2
sR2⁄
sT2
sR2⁄ Fk 1– N k–,∼
SSD
SSD yti y–2( )
i 1=
nt
∑t 1=
k
∑=
sD2 SSD
υD----------
SSD
N 1–-------------= =
sD2
13
treatments have a mean different than the others --- can beaccepted with confidence.
As summarized in Table 3, analysis of variance can alsobe pictured as a decomposition of the variance observed inthe data. That is, we can express the total sum of squareddeviations from the grand mean as , or the
between-group sum of squares added to the within-groupsum of squares. We have presented the results here aroundthe grand mean. In some cases, an alternative presentation is
used where a “total” sum of squares
is used. In this case the total sum of squares also includes
variation due to the average, , or
. In Table 3, k is the
number of treatments and N is the total number of
observations. The F ratio value is , which is compared
against the F ratio for the degrees of freedom and .
The p value shown in such a table is
Pr( ).
While often not explicitly stated, the above ANOVAassumes a mathematical model:
(43)
where are the treatment means, and are the residuals:
(44)
where is the estimated treatment mean. It is
critical that one check the resulting ANOVA model. First, theresiduals should be plotted against the time order in
which the experiments were performed in an attempt todistinguish any time trends. While it is possible to randomizeagainst such trends, we lose resolving power if the trend islarge. Second, one should examine the distribution of theresiduals. This is to check the assumption that the residualsare “random” [that is, independent and identically distributed(IID) and normally distributed with zero mean] and look forgross non-normality. This check should also include anexamination of the residuals for each treatment group. Third,one should plot the residuals versus the estimates and be
especially alert to dependencies on the size of the estimate(e.g., proportional versus absolute errors). Finally, oneshould also plot the residuals against any other variables ofinterest, such as environmental factors that may have beenrecorded. If unusual behavior is noted in any of these steps,additional measures should be taken to stabilize the variance(e.g., by considering transformations of the variables or byreexamining the experiment for other factors that may needto be either blocked against or otherwise included in theexperiment).
Two-Way Analysis of Variance
Suppose we are seeking to determine if various treatmentsare important in determining an output effect, but we mustconduct our experiment in such a way that another variable(which may also impact the output) must also vary. Forexample, suppose we want to study two treatments A and Bbut must conduct the experiments on five different processtools (tools 1-5). In this case, we must carefully design theexperiment to block against the influence of the process toolfactor. We now have an assumed model:
(45)
where are the block effects. The total sum of squares
can now be decomposed as(46)
with degrees of freedom(47)
where b is the number of blocking groups, k is the
number of treatment groups, and . As
before, if the blocks or treatments do in fact include anymean shifts, then the corresponding mean sum of squares(estimates of the corresponding variances) will again beinflated beyond the population variance (assuming thenumber of samples at each treatment is equal):
(48)
Table 3: Structure of the Analysis of Variance Table, for Single Factor (Treatment) Case
Source of Variation Sum of Squares Degrees of Freedom Mean Square
F ratio Pr(F)
Between treatments
Within treatments
Total about the grand mean
SST υT k 1–= sT2
sT2
sR2⁄ p
SSR υR N k–= sR2
SSD SST SSR+= υD υT υR+ N 1–= = sD2
100 1 p–( )
SSD SST SSR+=
SS y12
y22 … yN
2+ + +=
SSA Ny2
=
SS SSA SSD+ SSA SST SSR++= =
sT2
sR2⁄
υT υR
FυT υR, sT2
sR2⁄> µ1 … µk= =
yti µt εti+ µ τ t εti+ += =
µt εti
εti yti ytiˆ– N 0 σ2,( )∼=
ytiˆ yt
ˆ µ τ t+= =
εti
yti µ τ t β+ +i
εti+=
βi SS
SS SSA SSB+ SST SSR+ +=
bk 1 b 1–( ) k 1–( ) b 1–( ) k 1–( )+ + +=
SSB k yi y–( )2
i 1=
b
∑=
sB2
estimates σ2k
βi2
b 1–( )----------------
i 1=
b
∑+
sT2
estimates σ2 ntτ t2
k 1–( )----------------
t 1=
k
∑+
14
where we assume that k samples are observed for each of theb blocks. So again, we can now test the significance of thesepotentially “inflated” variances against the pooled estimate
of the variance with the appropriate F test as summarized
in Table 4. Note that in Table 4 we have shown the ANOVAtable in the non-mean centered or “uncorrected” format, sothat due to the average is shown. The table can be
converted to the mean corrected form as in Table 3 byremoving the row, and replacing the total with the
mean corrected total with degrees of
freedom.
Two-Way Factorial Designs
While the above is expressed with the terminology of thesecond factor being considered a “blocking” factor, preciselythe same analysis pertains if two factors are simultaneouslyconsidered in the experiment. In this case, the blockinggroups are the different levels of one factor, and thetreatment groups are the levels of the other factor. Theassumed analysis of variance above is with the simpleadditive model (that is, assuming that there are nointeractions between the blocks and treatments, or betweenthe two factors). In the blocked experiment, the intent of theblocking factor was to isolate a known (or suspected) sourceof “contamination” in the data, so that the precision of theexperiment can be improved.
We can remove two of these assumptions in ourexperiment if we so desire. First, we can treat both variablesas equally legitimate factors whose effects we wish toidentify or explore. Second, we can explicitly design theexperiment and perform the analysis to investigateinteractions between the two factors. In this case, the modelbecomes
(49)
where is the effect that depends on both factors
simultaneously. The output can also be expressed as
(50)
where is the overall grand mean, and are the main
effects, and are the interaction effects. In this case, the
subscripts are
The resulting ANOVA table will be familiar; the one keyaddition is explicit consideration of the interaction sum ofsquares and mean square. The variance captured in thiscomponent can be compared to the within-group variance asbefore, and be used as a measure of significance for theinteractions, as shown in Table 5.
Another metric often used to assess “goodness of fit” of
a model is the R2. The fundamental question answered by R2
is how much better does the model do than simply using thegrand average.
(51)
where is the variance or sum of square deviations
around the grand average, and is the variance captured
by the model consisting of experimental factors ( and
) and interactions ( ). An R2 value near zero indicates
that most of the variance is explained by residuals( ) rather than by the model terms ( ),
while an R2 value near 1 indicates that the model sum ofsquare terms capture nearly all of the observed variation in
Table 4: Structure of the Analysis of Variance Table,for the Case of a Treatment with a Blocking Factor
Source of Variation Sum of Squares Degrees of Freedom Mean Square
F ratio Pr(F)
Average (correction factor) 1
Between blocks
Between treatments
Residuals
Total
SSA nky2
=
SSB k yi y–( )2
i 1=
b
∑=υB b 1–= sB
2sB
2sR
2⁄ pB
SST b yt y–( )2
t 1=
k
∑=υT k 1–= sT
2sT
2sR
2⁄ pT
SSR υR b 1–( ) k 1–( )= sR2
SS υ N bk= = sD2
sR2
SSA
SSA SS
SSD υ N 1–=
ytij µti εtij+=
µti
µti µ τ t βi ωti+ + +=
y yt y–( ) yi y–( ) yti yt– yi– y+( )+ + +=
µ τ t βi
ωti
t 1 2 … k, , ,= where k is the # levels of first factor
i 1 2 … b, , ,= where b is the # levels of second factor
j 1 2 … m, , ,= where m is the # replicates at t, i factor
R2 SSM
SSD----------
SST SSB SSI+ +( )SSD
---------------------------------------------= =
SSD
SSM
SST
SSB SSI
SSR SSD SSM–= SSM
15
the data. It is clear that a more sophisticated model withadditional model terms will increase , and thus an
“apparent” improvement in explanatory power may resultfrom adding model terms. An alternative metric is the
adjusted R2, where a penalty is added for the use of degreesof freedom in the model:
(52)
which is more easily interpreted as the fraction of the
variance that is not explained by the residuals ( ). In
important issue, however, is that variance may appear to beexplained when the model in fact does not “fit” thepopulation. One should formally test for lack of fit (asdescribed in the regression modeling section to follow)
before reporting R2, since the R2 is only a meaningfulmeasure if there is no lack of fit.
Several mnemonics within the factorial design ofexperiments methodology facilitate the rapid or manualestimation of main effects, as well as interaction effects (8).Elements of the methodology include (a) assignment of“high” (+) and “low” (-) values for the variables, (b) codingof the experimental combinations in terms of these high andlow levels, (c) randomization of the experimental runs (asalways!), and (d) estimation of effects by attention to theexperimental design table combinations.
For example, one might perform a 23 full factorialdesign [where the superscript indicates the number of factors(three in this case), and the base indicates the number oflevels for each factor (two in this case)], where the factorsare labeled A, B, and C. The unique eight combinations of
these factors can be summarized as in Table 6, with theresulting measured results added during the course of theexperiment.The main effect of a factor can be estimated bytaking the difference between the average of the + level forthat factor and the average of the - levels for that factor ---for example, EffectA = and similarly for the other
main effects. Two level interactions can be found in an
analogous fashion; where
one takes the difference between one factor averages at thehigh and low values of the second factor. Simple methods arealso available in the full factorial case for estimation offactor effect sampling variances, when replicate runs havebeen performed. In the simple case above where only asingle run is performed at each experimental replicate, thereare no simple estimates of the underlying process ormeasurement variance, and so assessment of significance isnot possible. If, however, one performs replicates at the
ith experimental condition, one can pool the individual
estimates of variance at each of the experimental
conditions to gain an overall variance estimate (8):
(53)
where are the degrees of freedom at condition i,
and g is the total number of experimental conditionsexamined.
The sampling variance for an effect estimate can then becalculated; in our previous example we might perform tworuns at each of the eight experimental points, so that
and
(54)
Table 5: Structure of the Analysis of Variance Table, for Two-factor Case with Interaction Between Factors
Source of Variation Sum of Squares Degrees of Freedom Mean Square
F ratio Pr(F)
Between levels offactor 1
Between levels offactor 2
Interaction
Within groups (error)
Total (mean corrected)
SST b yt y–( )2
t 1=
k
∑=υT k 1–= sT
2sT
2sE
2⁄ pT
SSB k yi y–( )2
i 1=
b
∑=υB b 1–= sB
2sB
2sE
2⁄ pB
SSI υ I k 1–( ) b 1–( )= sI2
sI2
sE2⁄ pI
SSE υE bk m 1–( )= sE2
SSD υ bkm 1–= sD2
SSM
Adj. R2
1SSR νR⁄SSD νD⁄---------------------–=
1sR
2
sD2
------– 1 Mean Sq. of ResidualMean Sq. of Total
----------------------------------------------------–= =
sR2
sD2⁄
yA+ yA-–
InteractionAB12--- yA B+( ) yA B-( )–( )=
mi
si2
s2 ν1s1
2 ν2s22 … νgsg
2+ + +
ν1 ν2 … νg+ + +----------------------------------------------------------=
νi mi 1–=
νi 1=
Var{EffectA } Var yA+{ } Var yA-{ }+ s2
8---- s
2
8----+ s
2
4----= = =
16
These methods can be helpful for rapid estimation ofexperimental results and for building intuition aboutcontrasts in experimental designs; however, statisticalpackages provide the added benefit of assisting not only inquantifying factor effects and interactions, but also inexamination of the significance of these effects and creationof confidence intervals on estimation of factor effects andinteractions. Qualitative tools for examining effects includepareto charts to compare and identify the most importanteffects, and normal or half-normal probability plots todetermine which effects are significant.
In the discussion so far, the experimental factor levels aretreated as either nominal or continuous parameters. Animportant issue is the estimation of the effect of a particularfactor, and determination of the significance of any observedeffect. Such results are often pictured graphically in asuccinct fashion, as illustrated in Fig. 6 for a two-factorexperiment, where two levels for each of factor A and factorB are examined.
In the full factorial case, interactions can also beexplored, and the effects plots modified to show the effect of
each factor on the output parameter of concern (yield in thiscase), but at different levels of the other factor. Various casesmay result; as shown in Fig. 7 no interaction may beobserved, or a synergistic (or anti-synergistic) interactionmay be present. These analyses are applicable, for example,when the factor levels are discrete or nominal decisions to bemade; perhaps level (+) for factor A is to perform a cleanstep while (-) is to omit the step, and level (+) for factor B isto use one chemical in the step while level (-) is to use adifferent chemical.
Nested Variance Structures
While blocking factors may seem somewhat esoteric orindicative of an imprecise experiment, it is important torealize that blocking factors do in fact arise extremelyfrequently in semiconductor manufacturing. Indeed, mostexperiments or sets of data will actually be taken undersituations where great care must be taken in the analysis ofvariance. In effect, such blocking factors arise due to nestedvariance structures in typical semiconductor manufacturingwhich restrict the full randomization of an experimentaldesign (9). For example, if one samples die from multiplewafers, it must be recognized that those die reside withindifferent wafers; thus the wafer is itself a blocking factor andmust be accounted for.
For example, consider multiple measurements of oxidefilm thickness across a wafer following oxide deposition.One might expect that measurements from the same waferwould be more similar to each other than those acrossmultiple wafers; this would correspond to the case where thewithin-wafer uniformity is better than the wafer-to-waferuniformity. On the other hand, one might also find that themeasurements from the corresponding sites on each wafer(e.g., near the lower left edge of the wafer) are more similarto each other than are the different sites across the samewafer; this would correspond to the case where wafer-to-
Table 6: Full Factorial 23 Experimental Design (with Coded Factor Levels)
Experiment Condition Number
Factor A Factor B Factor CMeasured
Result
1 - - -
2 - - +
3 - + -
4 - + +
5 + - -
6 + - +
7 + + -
8 + + +
Figure 6. Main effects plot for two factor, two levelexperiment. The influence of Factor A on yield is largerthan that of Factor B. Analysis of variance is required inorder to determine if the observed results are significant(and not the result of chance variation).
Factor A
(+)(-)
Yie
ld
Factor B
(+)(-)
Yie
ld
Figure 7. Interaction plot. In case 1, no clear interaction isobserved: Factor B does not appear to change the effect offactor A on the output. Rather, the effects from factor Aand factor B appear to be additive. In case 2, the level offactor B does influence strongly the response of yield tothe high level of factor A.
Factor A
(+)(-)
Yie
ld
Factor A
(+)(-)
Yie
ld B+
B-
B+
B-
Case 1 C ase 2
17
wafer repeatability is very good, but within-wafer uniformitymay be poor. In order to model the important aspects of theprocess and take the correct improvement actions, it will beimportant to be able to distinguish between such cases andclearly identify where the components of variation arecoming from.
In this section, we consider the situation where webelieve that multiple site measurements “within” the wafercan be treated as independent and identical samples. This isalmost never the case in reality, and the values of “withinwafer” variance that result are not true measures of wafervariance, but rather only of the variation across those(typically fixed or preprogrammed) sites measured. In ouranalysis, we are most concerned that the wafer is acting as ablocking factor, as shown in Fig. 8. That is, we first considerthe case where we find that the five measurements we takeon the wafer are relatively similar, but the wafer-to-waferaverage of these values varies dramatically.This two-levelvariance structure can be described as (9):
(55)
where is the number of measurements taken on each of
wafers, and and are independent normal
random variables drawn from the distributions of wafer-to-
wafer variations and measurements taken within the ith
wafer, respectively. In this case, the total variation in oxide
thickness is composed of both within-wafer variance and
wafer-to-wafer variance :
(56)
Note, however, that in many cases (e.g., for control charting),
one is not interested in the total range of all individualmeasurements, but rather, one may desire to understand howthe set of wafer means itself varies. That is, one is seeking toestimate
(57)
where indicates averages over the measurements withinany one wafer.
Substantial care must be taken in estimating thesevariances; in particular, one can directly estimate the
measurement variance and the wafer average variance
, but must infer the wafer-level variance using
(58)
The within-wafer variance is most clearly understood as the
average over the available wafers of the variance
within each of those i wafers:
(59)
where indicates an average over the j index (that is, a
within-wafer average):
(60)
The overall variance in wafer averages can be estimatedsimply as
(61)
where is the grand mean over all measurements:
(62)
Thus, the wafer-level variance can finally be estimated as
(63)
The same approach can be used for more deeply nestedvariance structures (9,10). For example, a common structureoccurring in semiconductor manufacturing is measurementswithin wafers, and wafers within lots. Confidence limits canalso be established for these estimates of variance (11). Thecomputation of such estimates becomes substantiallycomplicated, however (especially if the data are unbalancedand have different numbers of measurements per samples ateach nested level), and statistical software packages are thebest option.
Figure 8. Nested variance structure. Oxide thicknessvariation consists of both within-wafer and wafer-to-wafercomponents.
Wafer number
Oxi
de T
hick
ness
(A
)
2 31 4 5 6
1010
1020
1030
1040
1050
1060
yij µ Wi M j i( )+ +=
Wi N 0 σW2,( )∼ for i 1…nW=
M j i( ) N 0 σM2,( )∼ for j 1…nM=
nM
nW Wi M j i( )
σM2
σW2
σT2 σW
2 σM2
+=
σW2 σW
2 σM2
nM-------+=
W
σM2
σW2 σW
2
σW2 σ
W2 σM
2
nM-------–=
nW si2
sM2 1
nW------- si
2
i 1=
nW
∑ 1nW-------
Yij Yi •–( )2
nM 1–--------------------------------
j 1=
nM
∑
i 1=
nW
∑= =
Yi •
Yi •1
nM------- Yij
j 1=
nM
∑=
sW2 1
nW 1–---------------- Yi • Y● ●–( )2
i 1=
nW
∑=
Y● ●
Y● ●
1nWn
M
-------------- Yijj 1=
nM
∑i 1=
nW
∑=
sW2
sW2 sM
2
nM-------–=
18
Several assumptions are made in the analysis ofvariance components for the nested structures above.Perhaps the most important is an assumption of randomsampling within each level of nesting. For example, weassume that each measurement (within each wafer) is IIDand a random sample from within the wafer. If the samemeasurement points are taken on each wafer, however, one isnot in fact truly estimating the within-wafer variation, butrather the fixed-effect variance between these measurementpoints. Systematic spatial patterns, e.g., “bowl,” “donut,” orslanted plane patterns across the wafer often occur due to gasflow, pressure, or other equipment nonuniformities. For thisreason, it is common practice to use a spatially consistentfive-point (or 21-point or 49-point) sampling scheme whenmaking measurements within a wafer. An option which addscomplexity but also adds precision is to model each of thesesites separately (e.g., maintain left, right, top, bottom, andcenter points) and consider how these compare with otherpoints within the wafer, as well as from wafer-to-wafer.Great care is required in such site modeling approaches,however, because one must account for the respectivevariances at multiple levels appropriately in order to avoidbiased estimates (12-14).
Experimental designs that include nested variancesampling plans are also sometimes referred to as split-plotdesigns, in which a factorial design in fact has restrictions onrandomization (7,15). Among the most common restrictionsare those due to spatial factors, and spatial modeling likewiserequires great care (16). Other constraints of the “real”world, such as hardware factors, may make completerandomization infeasible due to the time and cost ofinstalling/removing hardware (e.g., in studying alternativegas distribution plates in a plasma reactor). Methods exist forhandling such constraints (split-plot analysis), but theanalysis cannot be done if the experimental sampling plandoes not follow an appropriate split-plot design. Using split-plot analyses, however, we can resolve components ofvariation due (in the case of sites within wafers) intoresidual, site, wafer, and wafer-site interactions, as well asthe effects of the treatment under consideration. For thesereasons, it can be expected that nested variance structures orsplit-plot designs will receive even greater future attentionand application in semiconductor manufacturing.
Progression of Experimental Designs
It is worth considering when and how various experimentaldesign approaches might best be used. When confrontedwith a new problem which lacks thorough preexistingknowledge, the first step should be screening experimentswhich seek to identify what the important variables are. Atthis stage, only crude predictions of experimental effects asdiscussed above are needed, but often a large number ofcandidate factors (often six or more) are of potential interest.By sacrificing accuracy and certainty in interpretation of the
results (primarily by allowing interactions to confound withother interactions or even with main effects), one can oftengain a great deal of initial knowledge with reasonable cost.In these cases, fractional factorial, Plackett-Burmann, andother designs may be used.
Once a smaller number of effects have been identified,full factorial or fractional factorial designs are often utilized,together with simplified linear model construction andanalysis of variance. In such cases, the sampling plan mustagain be carefully considered in order to ensure thatsufficient data is taken to draw valid conclusions. It is oftenpossible to test for model lack of fit, which may indicate thatmore thorough experiments are needed or that additionalexperimental design points should be added to the existingexperimental data (e.g., to complete half fractions). The thirdphase is then undertaken, which involves experimentaldesign with small numbers of factors (e.g., two to six) tosupport linear effects, interactions, and second-order(quadratic) model terms. These regression models will beconsidered in the next section.
A variety of sophisticated experimental design methodsare available and applicable to particular problems. Inaddition to factorial and “optimal” design methods (8,10),robust design approaches (as popularized by Taguchi) arehelpful, particularly when the goal is to aid in theoptimization of the process (17,18).
RESPONSE SURFACE METHODS
In the previous section, we considered the analysis ofvariance, first in the case of single treatments, and then in thecase when blocking factors must also be considered. Thesewere generalized to consideration of two factor experiments,where the interaction between these factors can also beconsidered. If the factors can take on continuous values, theabove analysis is still applicable. However, in these cases, itis often more convenient and useful to consider or seek tomodel (within the range of factor levels considered) an entireresponse surface for the parameter of interest. Specifically,we wish to move from our factorial experiments with anassumed model of the form
(64)
where we can only predict results at discrete prescribed i, jlevels of factors A and B, toward a new model of the processof the form
(65)
where each is a particular factor of interest.
In this section, we briefly summarize the methods forestimation of the factor response coefficients , as well as
yij µ Ai B j εij+ + +=
y µ β1x1( ) β2x
2( ) ε+ + +=
xj( )
xminj( )
xmaxj( ),[ ] ℜ∈ ∈
xj( )
β j
19
for analysis of the significance of such effects based onexperimental design data. We begin with a simple one-parameter model, and we build complexity and capabilityfrom there.
Single Variable Least-squares Regression
The standard approach used here is least-squares regressionto estimate the coefficients in regression models. In thesimple one-parameter case considered here, our actualresponse is modeled as
(66)
where indicates the ith measurement, taken at a value
for the explanatory variable x. The estimate for the output is
thus simply where we expect some residual
error . least-squares regression finds the best fit of our
model to the data, where “best” is that b which minimizesthe sum of squared errors between the prediction andobserved n data values:
(67)
It can easily be shown that for linear problems, a directsolution for b which gives is possible and will occur
when the vector of residuals is normal to the vector of
values:
(68)
An important issue is the estimate of the experimentalerror. If we assume that the model structure is adequate, we
can form an estimate of simply as
(69)
One may also be interested in the precision of theestimate b; that is, the variance in b:
(70)
assuming that the residuals are independent and identicallynormally distributed with mean zero. More commonly, one
refers to the standard error , and writes
. In a similar fashion, a confidence interval for our
estimate of can be defined by noting that the standardizedvalue for b should be t-distributed:
(71)
where is the true value for , so that
(72)
Regression results should also be framed in an analysisof variance framework. In the simple one factor case, asimple ANOVA table might be as shown in Table 7. In thiscase, is the sum of squared values of the estimates, and
is an estimate of the variance “explained” by the model,
where our model is purely linear (no intercept term) as givenin Eq. (66). In order to test significance, we must compare
the ratio of this value to the residual variance using the
appropriate F test. In the case of a single variable, we note
that the F test degenerates into the t test: , and a t-
test can be used to evaluate the significance of the modelcoefficient.
In the analysis above, we have assumed that the valuesfor have been selected at random and are thus unlikely to
be replicated. In many cases, it may be possible to repeat theexperiment at particular values, and doing so gives us theopportunity to decompose the residual error into twocontributions. The residual sum of squares can be
broken into a component due to “lack of fit” and a
component due to “pure error” or “replication error”:
(73)
This enables a further test for “lack of fit” in our model bycomparing the ratio of the corresponding variances: that is,
we compare with , where is the
degrees of freedom corresponding to the pure error for mreplicates, and is the degrees of freedom
corresponding to the lack-of-fit variance estimate. It is highlyrecommended that at least some points (if not the entireexperiment) be replicated, so that the lack of fit and pureerror can be assessed. Similarly, careful residual analysismust be performed to look for any signs of non-normality, ordependencies on time order or other factors, as previouslydiscussed. Without lack of fit and residual analysis, somequestion will remain as to the validity of the model.
These tests can also be summarized as part of theANOVA table, as shown in Table 8. In this case, we assume atrue response of the form which we estimate or
fit with a two-parameter model or
to also capture the mean (or intercept ). We
assume that we have made n total measurements, of which mare replicates. In this table, one should first check for lack offit. If no evidence of lack of fit exists, then there is no reasonto reject the assumed model structure, and one can assess thesignificance of the overall model or individual model
coefficients. Note that the test of significance for
compared to , and the probability of observing the
corresponding ratio, is then equivalent to testing if the
adjusted R2 = 0. If evidence of lack of fit does indeed exist,however, then one must seek alternative model forms, eitherthrough transformations of the data or by seeking a higher-t
yi βxi εi+=
yi xi
yi βxi bxi= =
εi
SSmin SSR yi yi–( )2
i 1=
n
∑= =
SSmin
εi xi
b xy∑( ) x2∑( )⁄=
s2 σ2
s2
SSR n 1–( )⁄=
Var b{ } s2
xi2
i∑
⁄=
s.e. b( ) Var b{ }=
b s.e. b( )±β
tb β′–s.e. b( )----------------=
β′ β
β b tα 2⁄ s.e. b( )⋅[ ]±=
SSM
sM2
sR2
F1 n, tn2
=
xi
SSR
SSL
SSE
SSR SSL SSE+=
sL2
sE2⁄ FυL υE, υE m 1–=
υL υR υE–=
y β0 β1x+=
y b0 b1x ε+ +=
y b0 b1x+= β0
sM2
sE2
pM
20
order (e.g., polynomial) model structure. As always, oneshould also examine the residuals.
Just as one can assess significance and formulateconfidence intervals for the single model coefficient case, sotoo can one find interval estimates for the model coefficients.Typically, statistical packages can be utilized to assist in theformulation of such estimates, but care must be taken tounderstand the above framework in order to correctlyinterpret the output of such packages.
Response Surface Modeling - Experimental Designs
The first part of this section focused on regression modelingand analysis for single factor experiments, concluding withpolynomial models of the response. In many cases, one isinterested in modeling the response as a function of multiplefactors, with linear or quadratic models. Here we brieflyreview aspects of popular experimental designs, and weinterpret the results of analysis of variance in this multiplefactor context. While a number of experimental designs withdifferent properties (and indeed an entire arena of designmethods that are “optimal” in various senses) exist, two ofthe more popular designs will be summarized here. The
central composite design, as pictured in the two factor casein Fig. 9, is especially useful as a complement or addition toexisting factorial design data. In this case, the addition ofcenter point and axial points completes the centralcomposite, and it supports quadratic modeling of theresponses (if found to be necessary). This two stageapproach can potentially save substantial experimentalresources: if the original factorial also includes replicatedcenter points, then one can determine if the response isindeed linear and potentially avoid the need for the rest ofthe central composite experiment.
A second popular option is the Box-Bhenken design, asillustrated for the two factor case in Fig. 9. In this case, thecenter point is complemented by experimental points at themidpoint of each segment of the n-dimensional “boundingbox” around that center point. Alternatively, this can beviewed as the center point augmented by the aggregate of nfull factorial designs in experimental factors whileholding each remaining factor at its center value. The Box-Bhenken design is generally used when the expense or timeof the experiment is influenced by the number of levels,because the Box Bhenken only requires three different levels
Table 7: Structure of the Analysis of Variance Table,for Single Factor Response Surface Regression
Source of Vari-ation
Sum of Squares Degrees of Freedom Mean Square
F ratio Pr(F)
Model (number of
model coefficients)
Residual
Total
* The degrees of freedom in the model are shown for the case when only one model coefficient is used (strictly linear response).
Table 8: Structure of the analysis of variance table for a single factor response surface regression, in the case of replication of experimental design points.
Source of Vari-ation
Sum of Squares Degrees of Freedom Mean Square F ratio Pr(F)
Model
Residual
lack of fit Pr(lack of fit)
pure error
Total
SSM υM 1= sM2
sM2
sR2⁄ p
SSR υR n υM–= sR2
SS υ n= sT2
SSM υM 2= sM2
SSM υM⁄= sM2
sE2⁄ pM
b0 SS0 υ0 1= s02
SS0= s02
sE2⁄ p0
b1 SS1 υ1 1= s12
SS1= s12
sE2⁄ p1
SSR υR n υM–= sR2
SSL υL υR υE–= sL2
sL2
sE2⁄
SSE υE m= sE2
SS υ n= s2
n 1–
21
for each factor, while the central composite design requiresfive. In both of these designs, it should be again emphasizedthat replicates at one or more experimental points (typicallythe center point) are highly recommended so that lack of fitcan be assessed, and so a measure of pure or experimentalerror can be established. In addition, if the center pointreplicates are spaced in time throughout the experiment, theyprovide a means to check for unwanted time dependencies inthe experiment.
The response surface models for each case are foundusing a least-squares fit to a specified model structure(typically quadratic or polynomial) as previously discussed.An analysis-of-variance examination is required to check formodel lack of fit, examine factor and model coefficientsignificance, and to establish confidence intervals on modelcoefficients. Careful examination of residuals is crucial toensure the validity of the modeling assumptions -- namely,that the residuals are IID and normally distributed.
In the case of multiple model coefficients, one oftendesires the most parsimonious or simple model possible.Analysis of variance can indicate those coefficients whichappear to be insignificant. In step-wise regression, modelcoefficients are dropped or added one at a time, and thereduction (or improvement) in the model is evaluated untilsome stopping criteria are met.
CATEGORICAL MODELING
Sometimes the measurement of the result is discrete.One example is an engineer's evaluation of a photograph,usually from a scanning electron microscope (SEM). Thephotograph might be a top-down image of a metal orpolysilicon patterned line. The engineer decides whether theline is too rough, a little rough, or smooth. While many ofthese evaluations are being replaced with automatic defectdetection and classification tools which provide continuousnumbers, especially in production, early development stillrelies heavily on manual inspection and evaluation. The
engineer would like to perform an experiment whereby he orshe can determine what optimal value of the baketemperature will result in the best chance of smooth lines. Inaddition, he or she would like to predict how often theprocess will produce rough lines. Another example is theprofile or sidewall slope of a line. Except perhaps AtomicForce Microscopy (AFM), no measurement method exists toobtain a line profile quickly and easily. However, a cross-sectional SEM can be used to roughly estimate the profile --that is, to note if the slope in degrees is >88, 85-88, or <88.No definite constant scale exists, but a relative scale does(i.e., >88 is bigger than 85-88 which is bigger than <88). Theengineer would like to know if performing a clean (and thelength of the clean step) will result in a sharper profile (>88).The ranking by “experts” as the measured response iscommon in other industries (e.g., to assess the quality offood) and such approaches may also be applicable insemiconductor manufacturing. Categorical methods arestatistical methods aimed at use for these questions (19).While the mathematics are too complicated to introducehere, many statistical packages (such as SAS, JMP, andStatgraphics) provide these methods and can be applied topractical problems. The methods can be shown to be similarin nature to fuzzy logic (20).
SUMMARY
In this article, we have focused on the fundamentalissues in modeling important statistical elements ofsemiconductor manufacturing. In many cases, we have onlybegun to touch on the issues of statistical distributionmodeling, hypothesis testing, experimental design andanalysis of variance, and response surface modeling. Theintent here has been to assist in the proper interpretation ofresults that are now readily available by way of statisticalsoftware packages; further consultation with the statisticalmodeling literature and statisticians is highly recommendedfor those seeking to get the most value out of experimentalresources and data. An excellent source for further readingare the case studies of statistical methods applied tosemiconductor manufacturing contained in (21).
BIBLIOGRAPHY
1. C. J. Spanos, Statistical process control in semiconduc-tor manufacturing, Proc. IEEE, 80 (6): 819-830, 1992.
2. J. B. Keats and D. C. Montgomery (eds.), Statistical Applications in Process Control, New York: Marcel Dekker, 1996.
3. A. Madansky, Prescriptions for Working Statisticians, Berlin: Springer-Verlag, 1988.
4. D. M. H. Walker, Yield Simulation for Integrated Cir-cuits, Norwell, MA: Kluwer, 1987.
5. D. C. Montgomery, Introduction to Statistical Quality Control, New York: Wiley, 1985.
Figure 9. Experimental designs often used in responsesurface modeling. (a) The factorial design (augmentedwith center points) can be extended into (b) the centralcomposite design by adding axial design points. The Box-Bhenken design features design points at midpoints ofbounding box segments around the center point.
Factor 1
Fact
or 2
(b) Added Axial Points toform Central Composite (c) Box-Bhenken
Fact
or 3
(a) Two-level Factorwith Centerpoints
22
6. R. G. Miller, Jr., Beyond ANOVA - Basics of applied statistics, New York: Chapman & Hall, 1997.
7. G. A. Milliken and D. E. Johnson, Analysis of Messy Data, Volume I: Designed Experiments, New York: Chapman & Hall, 1992.
8. G. E. P. Box, W. G. Hunter, and J. S. Hunter, Statistics for Experimenters, New York: Wiley, 1978.
9. D. Drain, Statistical Methods for Industrial Process Control, New York: Chapman & Hall, 1997.
10. D. Drain, Handbook of Experimental Methods for Pro-cess Improvement, New York: Chapman & Hall, 1997.
11. R. K. Burdick, F. A. Graybill, Confidence Intervals on Variance Components, New York: Marcel Dekker, 1992.
12. T. H. Smith et al., Bias and variance in single and mul-tiple response surface modeling, Third Int. Workshop Stat, Met.: 60-63, Honolulu, HI, 1998.
13. R. Guo and E. Sachs, Modeling, optimization, and con-trol of spatial uniformity in manufacturing processes, IEEE Trans. Semi. Manuf., 6 (1): 41-57, 1993.
14. P. K. Mozumder and L. M. Lowenstein, Method for semiconductor process optimization using functional representations of spatial variations and selectivity, IEEE Trans. Comp. Hybrids, Manuf. Tech., 15 (3): 311-316, 1992.
15. R. L. Mason, R. F. Gunst, and J. L. Hess, Statistical Design and Analysis of Experiments with Applications to Engineering and Science, New York: Wiley; 1989.
16. B. D. Ripley, Spatial Statistics, New York: Wiley, 1981.
17. P. J. Ross, Taguchi Techniques for Quality Engineering, 2nd ed., New York: McGraw-Hill, 1996.
18. M. S. Phadke, Quality Engineering Using Robust Design, Englewood Cliffs, NJ: Prentice Hall, 1989.
19. A. Agresti, Categorical Data Analysis, New York: Wiley, 1990.
20. C. Spanos and R. Chen, Using Qualitative Observa-tions for Process Tuning and Control, IEEE Trans. Semi. Manuf., 10 (2): 307-316, 1997.
21. V. Czitrom and P. D. Spagon, (eds.), Statistical Case Studies for Industrial Process Improvement, Philadel-phia: ASA-SIAM, 1997.
top related