probdist ref

256
PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Tue, 18 Sep 2012 20:05:06 UTC Probability Distributions A Basic Reference

Upload: richarddmorey

Post on 28-Oct-2014

562 views

Category:

Documents


2 download

TRANSCRIPT

Probability DistributionsA Basic Reference

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Tue, 18 Sep 2012 20:05:06 UTC

ContentsArticlesProbability distribution 1 8 8 10 17 20 30 37 46 51 55 62 70 70 76 134 134 162 176 176 187 198 201 208 212 220 230 230 241

Discrete distributionsBernoulli distribution Binomial distribution Uniform distribution (discrete) Poisson distribution Beta-binomial distribution Negative binomial distribution Geometric distribution Multinomial distribution Categorical distribution Dirichlet distribution

Continuous Distributions on [a,b]Uniform distribution (continuous) Beta distribution

Continuous Distributions on (-Inf, Inf)Normal distribution Student's t-distribution

Continuous Distributions on [0,Inf)Gamma distribution Pareto distribution Inverse-gamma distribution Chi-squared distribution F-distribution Log-normal distribution Exponential distribution

Multivariate Continuous DistributionsMultivariate normal distribution Wishart distribution

ReferencesArticle Sources and Contributors Image Sources, Licenses and Contributors 246 249

Article LicensesLicense 253

Probability distribution

1

Probability distributionIn probability and statistics, a probability distribution assigns a probability to each of the possible outcomes of a random experiment. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variables, where the distribution is a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution is a probability density functions. More complex experiments, such as those involving stochastic processes defined in continuous-time, may demand the use of more general probability measures. In applied probability, a probability distribution can be specified in a number of different ways, often chosen for mathematical convenience: by supplying a valid probability mass function or probability density function by supplying a valid cumulative distribution function or survival function by supplying a valid hazard function by supplying a valid characteristic function

by supplying a rule for constructing a new random variable from other random variables whose joint probability distribution is known. Important and commonly encountered probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution.

IntroductionTo define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. In the discrete case, one can easily assign a probability to each possible value: when throwing a die, each of the six values 1 to 6 has the probability 1/6. In contrast, when a random variable takes values from a continuum, probabilities are nonzero only if they refer to finite intervals: in quality control one might demand that the probability of a "500g" package containing between 490g and 510g should be no less than 98%.Discrete probability distribution for the sum of two dice.

Probability distribution

2

If the random variable is real-valued (or more generally, if a total order is defined for its possible values), the cumulative distribution function gives the probability that the random variable is no larger than a given value; in the real-valued case it is the integral of the density.

Terminology

Normal distribution, also called Gaussian or "bell curve", the most

important continuous random distribution. As probability theory is used in quite diverse applications, terminology is not uniform and sometimes confusing. The following terms are used for non-cumulative probability distribution functions:

Probability mass, Probability mass function, p.m.f.: for discrete random variables. Categorical distribution: for discrete random variables with a finite set of values. Probability density, Probability density function, p.d.f: Most often reserved for continuous random variables. The following terms are somewhat ambiguous as they can refer to non-cumulative or cumulative distributions, depending on authors' preferences: Probability distribution function: Continuous or discrete, non-cumulative or cumulative. Probability function: Even more ambiguous, can mean any of the above, or anything else. Finally, Probability distribution: Either the same as probability distribution function. Or understood as something more fundamental underlying an actual mass or density function.

Basic terms Mode: most frequently occurring value in a distribution Tail: region of least frequently occurring values in a distribution Support: the smallest closed interval/set whose complement has probability zero. It may be understood as the points or elements that are actual members of the distribution.

Discrete probability distributionA discrete probability distribution shall be understood as a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is then called a discrete random variable, if

as u runs through the set of all possible values of X. It follows that such a random variable can assume only a finite or countably infinite number of values.

In cases more frequently considered, this set of possible values is a topologically discrete set in the sense that all its points are isolated points. But there are discrete random variables for which this countable set is dense on the real line (for example, a distribution over rational numbers).

The probability mass function of a discrete probability distribution. The probabilities of the singletons {1}, {3}, and {7} are respectively 0.2, 0.5, 0.3. A set not containing any of these points has probability zero.

Probability distribution

3

Among the most well-known discrete probability distributions that are used for statistical modeling are the Poisson distribution, the Bernoulli distribution, the binomial distribution, the geometric distribution, and the negative binomial distribution. In addition, the discrete uniform distribution is commonly used in computer programs that make equal-probability random selections between a number of choices.

The cdf of a discrete probability distribution, ...

Cumulative densityEquivalently to the above, a discrete random variable can be defined as a random variable whose cumulative distribution function (cdf) increases only by jump discontinuitiesthat is, its cdf increases only where it "jumps" to a higher value, and is constant between those jumps. The points where jumps occur are precisely the values which the random variable may take. The number of such jumps may be finite or countably infinite. The set of locations of such jumps need not be topologically discrete; for example, the cdf might jump at each rational number.

... of a continuous probability distribution, ...

... of a distribution which has both a continuous part and a discrete part.

Delta-function representationConsequently, a discrete probability distribution is often represented as a generalized probability density function involving Dirac delta functions, which substantially unifies the treatment of continuous and discrete distributions. This is especially useful when dealing with probability distributions involving both a continuous and a discrete part.

Indicator-function representationFor a discrete random variable X, let u0, u1, ... be the values it can take with non-zero probability. Denote These are disjoint sets, and by formula (1)

It follows that the probability that X takes any value except for u0, u1, ... is zero, and thus one can write X as

except on a set of probability zero, where definition of discrete random variables.

is the indicator function of A. This may serve as an alternative

Continuous probability distributionA continuous probability distribution is a probability distribution that has a probability density function. Mathematicians also call such a distribution absolutely continuous, since its cumulative distribution function is absolutely continuous with respect to the Lebesgue measure . If the distribution of X is continuous, then X is called a continuous random variable. There are many examples of continuous probability distributions: normal, uniform, chi-squared, and others.

Probability distribution Intuitively, a continuous random variable is the one which can take a continuous range of values as opposed to a discrete distribution, where the set of possible values for the random variable is at most countable. While for a discrete distribution an event with probability zero is impossible (e.g. rolling 3 on a standard die is impossible, and has probability zero), this is not so in the case of a continuous random variable. For example, if one measures the width of an oak leaf, the result of 3cm is possible, however it has probability zero because there are uncountably many other potential values even between 3cm and 4cm. Each of these individual outcomes has probability zero, yet the probability that the outcome will fall into the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved by the fact that the probability that X attains some value within an infinite set, such as an interval, cannot be found by naively adding the probabilities for individual values. Formally, each value has an infinitesimally small probability, which statistically is equivalent to zero. Formally, if X is a continuous random variable, then it has a probability density function (x), and therefore its probability of falling into a given interval, say [a, b] is given by the integral

4

In particular, the probability for X to take any single value a (that is a X a) is zero, because an integral with coinciding upper and lower limits is always equal to zero. The definition states that a continuous probability distribution must possess a density, or equivalently, its cumulative distribution function be absolutely continuous. This requirement is stronger than simple continuity of the cdf, and there is a special class of distributions, singular distributions, which are neither continuous nor discrete nor their mixture. An example is given by the Cantor distribution. Such singular distributions however are never encountered in practice. Note on terminology: some authors use the term "continuous distribution" to denote the distribution with continuous cdf. Thus, their definition includes both the (absolutely) continuous and singular distributions. By one convention, a probability distribution is called continuous if its cumulative distribution function for all . is continuous and, therefore, the probability measure of singletons

Another convention reserves the term continuous probability distribution for absolutely continuous distributions. These distributions can be characterized by a probability density function: a non-negative Lebesgue integrable function defined on the real numbers such that

Discrete distributions and some continuous distributions (like the Cantor distribution) do not admit such a density. To better understand continuous distributions, take this example of probability. Tim Cook Associate Professor Kevin Gue taught this in his lecture in Stochastic Operations at Auburn University: Assume you are playing golf. What is the probability that you can hit the golf ball exactly, on the dot, 200 yards? Answer: 0 This is not directly intuitive. Your sense wants you to think that there must be some small probability that you can make that ball stop at 200 yards. After all it is there right? Well because we are evaluating it continuously, There are infinite points you could hit the ball (ex: 199.2304930234930 yards). Therefore, since your possiblilities are infinity, your chances of hitting 200yds becomes zero. You have a zero percent chance of hitting the ball on that specific spot.

Probability distribution

5

Probability distributions of scalar random variablesThe following applies to all types of scalar random variables. Because a probability distribution Pr on the real line is determined by the probability of a scalar random variable X being in a half-open interval (-,x], the probability distribution is completely characterized by its cumulative distribution function:

Some properties The probability distribution of the sum of two independent random variables is the convolution of each of their distributions. Probability distributions are not a vector space they are not closed under linear combinations, as these do not preserve non-negativity or total integral 1 but they are closed under convex combination, thus forming a convex subset of the space of functions (or measures).

Kolmogorov definitionIn the measure-theoretic formalization of probability theory, a random variable is defined as a measurable function X from a probability space to measurable space . A probability distribution is the pushforward measure, P, satisfying X*P=PX 1 on .

Random number generationA frequent problem in statistical simulations (the Monte Carlo method) is the generation of pseudo-random numbers that are distributed in a given way. Most algorithms are based on a pseudorandom number generator that produces numbers X that are uniformly distributed in the interval [0,1). These random variates X are then transformed via some algorithm to create a new random variate having the required probability distribution.

ApplicationsThe concept of the probability distribution and the random variables which they describe underlies the mathematical discipline of probability theory, and the science of statistics. There is spread or variability in almost any value that can be measured in a population (e.g. height of people, durability of a metal, sales growth, traffic flow, etc.); almost all measurements are made with some intrinsic error; in physics many processes are described probabilistically, from the kinetic properties of gases to the quantum mechanical description of fundamental particles. For these and many other reasons, simple numbers are often inadequate for describing a quantity, while probability distributions are often more appropriate. As a more specific example of an application, the cache language models and other statistical language models used in natural language processing to assign probabilities to the occurrence of particular words and word sequences do so by means of probability distributions.

Probability distribution

6

Common probability distributionsThe following is a list of some of the most common probability distributions, grouped by the type of process that they are related to. For a more complete list, see list of probability distributions, which groups by the nature of the outcome being considered (discrete, continuous, multivariate, etc.) Note also that all of the univariate distributions below are singly peaked; that is, it is assumed that the values cluster around a single point. In practice, actually observed quantities may cluster around multiple values. Such quantities can be modeled using a mixture distribution.

Related to real-valued quantities that grow linearly (e.g. errors, offsets) Normal distribution (Gaussian distribution), for a single such quantity; the most common continuous distribution

Related to positive real-valued quantities that grow exponentially (e.g. prices, incomes, populations) Log-normal distribution, for a single such quantity whose log is normally distributed Pareto distribution, for a single such quantity whose log is exponentially distributed; the prototypical power law distribution

Related to real-valued quantities that are assumed to be uniformly distributed over a (possibly unknown) region Discrete uniform distribution, for a finite set of values (e.g. the outcome of a fair die) Continuous uniform distribution, for continuously distributed values

Related to Bernoulli trials (yes/no events, with a given probability) Basic distributions: Bernoulli distribution, for the outcome of a single Bernoulli trial (e.g. success/failure, yes/no) Binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs Geometric distribution, for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special case of the negative binomial distribution Related to sampling schemes over a finite population: Hypergeometric distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, using sampling without replacement Beta-binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, sampling using a Polya urn scheme (in some sense, the "opposite" of sampling without replacement)

Probability distribution

7

Related to categorical outcomes (events with K possible outcomes, with a given probability for each outcome) Categorical distribution, for a single categorical outcome (e.g. yes/no/maybe in a survey); a generalization of the Bernoulli distribution Multinomial distribution, for the number of each type of categorical outcome, given a fixed number of total outcomes; a generalization of the binomial distribution Multivariate hypergeometric distribution, similar to the multinomial distribution, but using sampling without replacement; a generalization of the hypergeometric distribution

Related to events in a Poisson process (events that occur independently with a given rate) Poisson distribution, for the number of occurrences of a Poisson-type event in a given period of time Exponential distribution, for the time before the next Poisson-type event occurs

Useful for hypothesis testing related to normally distributed outcomes Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples (see chi-squared test) Student's t distribution, the distribution of the ratio of a standard normal variable and the square root of a scaled chi squared variable; useful for inference regarding the mean of normally distributed samples with unknown variance (see Student's t-test) F-distribution, the distribution of the ratio of two scaled chi squared variables; useful e.g. for inferences that involve comparing variances or involving R-squared (the squared correlation coefficient)

Useful as conjugate prior distributions in Bayesian inference Beta distribution, for a single probability (real number between 0 and 1); conjugate to the Bernoulli distribution and binomial distribution Gamma distribution, for a non-negative scaling parameter; conjugate to the rate parameter of a Poisson distribution or exponential distribution, the precision (inverse variance) of a normal distribution, etc. Dirichlet distribution, for a vector of probabilities that must sum to 1; conjugate to the categorical distribution and multinomial distribution; generalization of the beta distribution Wishart distribution, for a symmetric non-negative definite matrix; conjugate to the inverse of the covariance matrix of a multivariate normal distribution; generalization of the gamma distribution

References B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7 Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8

External links Hazewinkel, Michiel, ed. (2001), "Probability distribution" (http://www.encyclopediaofmath.org/index. php?title=p/p074900), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4

8

Discrete distributionsBernoulli distributionBernoulli Parameters Support PMF

CDF

Mean Median

Mode

Variance Skewness Ex. kurtosis Entropy MGF CF PGF

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and value 0 with failure probability . So if X is a random variable with this distribution, we have:

A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads with probability p and tails with probability 1-p. The experiment is called fair if p=0.5, indicating the origin of the terminology in betting (the bet is fair if both possible outcomes have the same probability). The probability mass function f of this distribution is

Bernoulli distribution This can also be expressed as

9

The expected value of a Bernoulli random variable X is

, and its variance is

The above can be derived from the Bernoulli distribution as a special case of the Binomial distribution.[1] The kurtosis goes to infinity for high and low values of p, but for kurtosis than any other probability distribution, namely -2. The Bernoulli distribution is a member of the exponential family. The maximum likelihood estimator of p based on a random sample is the sample mean. the Bernoulli distribution has a lower

Related distributions If are independent, identically distributed (i.i.d.) random variables, all Bernoulli distributed with (binomial distribution). The Bernoulli success probabilityp, then

distribution is simply . The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant number of discrete values. The Beta distribution is the conjugate prior of the Bernoulli distribution. The geometric distribution is the number of Bernoulli trials needed to get one success.

Notes[1] McCullagh and Nelder (1989), Section 4.2.2.

References McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN0-412-31760-5. Johnson, N.L., Kotz, S., Kemp A. (1993) Univariate Discrete Distributions (2nd Edition). Wiley. ISBN 0-471-54897-9

External links Hazewinkel, Michiel, ed. (2001), "Binomial distribution" (http://www.encyclopediaofmath.org/index. php?title=p/b016420), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Weisstein, Eric W., " Bernoulli Distribution (http://mathworld.wolfram.com/BernoulliDistribution.html)" from MathWorld.

Binomial distribution

10

Binomial distributionProbability mass function

Cumulative distribution function

Notation Parameters Support PMF CDF Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF

B(n, p) n N0 number of trials p [0,1] success probability in each trial k { 0, , n } number of successes

np np or np (n + 1)p or (n + 1)p 1 np(1p)

Binomial distribution

11CF PGF Fisher information

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.Binomial distribution for

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution is a good approximation, and widely used.

(blue),

(green) and

(red)

SpecificationProbability mass functionIn general, if the random variable K follows the binomial distribution with parameters n and p, we write K~B(n,p). The probability of getting exactly k successes in n trials is given by the probability mass function:

Binomial distribution for with and as in Pascal's triangle ) ends up in the . ) is The probability that a ball in a Galton box with 8 layers ( central bin (

for k=0,1,2,...,n, where

Binomial distribution is the binomial coefficient (hence the name of the distribution) "nchoosek", also denoted C(n,k),nCk, or nCk. The formula can be understood as follows: we want k successes (pk) and nk failures (1p)nk. However, the k successes can occur anywhere among the n trials, and there are C(n,k) different ways of distributing k successes in a sequence of n trials. In creating reference tables for binomial distribution probability, usually the table is filled in up to n/2 values. This is because for k>n/2, the probability can be calculated by its complement as

12

Looking at the expression (k,n,p) as a function of k, there is a k value that maximizes it. This k value can be found by calculating

and comparing it to 1. There is always an integer M that satisfies

(k,n,p) is monotone increasing for kM, with the exception of the case where (n+1)p is an integer. In this case, there are two values for which is maximal: (n+1)p and (n+1)p1. M is the most probable (most likely) outcome of the Bernoulli trials and is called the mode. Note that the probability of it occurring can be fairly small.

Cumulative distribution functionThe cumulative distribution function can be expressed as:

where

is the "floor" under x, i.e. the greatest integer less than or equal to x.

It can also be represented in terms of the regularized incomplete beta function, as follows:

For k np, upper bounds for the lower tail of the distribution function can be derived. In particular, Hoeffding's inequality yields the bound

and Chernoff's inequality can be used to derive the bound

Moreover, these bounds are reasonably tight when p = 1/2, since the following expression holds for all k 3n/8[1]

Binomial distribution

13

Mean and varianceIf X ~ B(n, p) (that is, X is a binomially distributed random variable), then the expected value of X is

and the variance is

Mode and medianUsually the mode of a binomial B(n, p) distribution is equal to , where is the floor function. However when (n+1)p is an integer and p is neither 0 nor 1, then the distribution has two modes: (n+1)p and (n+1)p1. When p is equal to 0 or 1, the mode will be 0 and n correspondingly. These cases can be summarized as follows:

In general, there is no single formula to find the median for a binomial distribution, and it may even be non-unique. However several special results have been established: If np is an integer, then the mean, median, and mode coincide and equal np.[2][3] Any median m must lie within the interval npmnp.[4] A median m cannot lie too far away from the mean: |m np| min{ ln 2, max{p, 1 p} }.[5] The median is unique and equal to m=round(np) in cases when either p 1 ln 2 or p ln 2 or |mnp|min{p,1p} (except for the case when p= and n is odd).[4][5] When p=1/2 and n is odd, any number m in the interval (n1)m(n+1) is a median of the binomial distribution. If p=1/2 and n is even, then m=n/2 is the unique median.

Covariance between two binomialsIf two binomially distributed random variables X and Y are observed together, estimating their covariance can be useful. Using the definition of covariance, in the case n=1 (thus being Bernoulli trials) we have

The first term is non-zero only when both X and Y are one, and X and Y are equal to the two probabilities. Defining pB as the probability of both happening at the same time, this gives and for n such trials again due to independence

If X and Y are the same variable, this reduces to the variance formula given above.

Binomial distribution

14

Relationship to other distributionsSums of binomialsIf X~B(n,p) and Y~B(m,p) are independent binomial variables with the same probability p, then X+Y is again a binomial variable; its distribution is

Conditional binomialsIf X~B(n,p) and, conditional on X, Y~B(X,q), then Y is a simple binomial variable with distribution

Bernoulli distributionThe Bernoulli distribution is a special case of the binomial distribution, where n=1. Symbolically, X~B(1,p) has the same meaning as X~Bern(p). Conversely, any binomial distribution, B(n,p), is the sum of n independent Bernoulli trials, Bern(p), each with the same probability p.

Poisson binomial distributionThe binomial distribution is a special case of the Poisson binomial distribution, which is a sum of n independent non-identical Bernoulli trials Bern(pi). If X has the Poisson binomial distribution with p1==pn=p then X~B(n,p).

Normal approximationIf n is large enough, then the skew of the distribution is not too great. In this case a reasonable approximation to B(n,p) is given by the normal distribution

and this basic approximation can be improved in a simple way by using a suitable continuity correction. The basic approximation generally improves as n increases (at least 20) and is better when p is not near to 0 or 1.[6] Various rules of thumb may be used to decide whether n is large enough, and p is far enough from the extremes of zero or one: One rule is that both x=np and n(1p) must be greater than5. However, the specific number varies from source to source, and depends on how good an Binomial PDF and normal approximation for n=6 and p=0.5 approximation one wants; some sources give 10 which gives virtually the same results as the following rule for large n until n is very large (ex: x=11, n=7752). A second rule[6] is that for n > 5 the normal approximation is adequate if

Another commonly used rule holds that the normal approximation is appropriate only if everything within 3 standard deviations of its mean is within the range of possible values, that is if

Binomial distribution

15

The following is an example of applying a continuity correction. Suppose one wishes to calculate Pr(X8) for a binomial random variable X. If Y has a distribution given by the normal approximation, then Pr(X8) is approximated by Pr(Y8.5). The addition of 0.5 is the continuity correction; the uncorrected normal approximation gives considerably less accurate results. This approximation, known as de MoivreLaplace theorem, is a huge time-saver when undertaking calculations by hand (exact calculations with large n are very onerous); historically, it was the first use of the normal distribution, introduced in Abraham de Moivre's book The Doctrine of Chances in 1738. Nowadays, it can be seen as a consequence of the central limit theorem since B(n,p) is a sum of n independent, identically distributed Bernoulli variables with parameterp. This fact is the basis of a hypothesis test, a "proportion z-test," for the value of p using x/n, the sample proportion and estimator of p, in a common test statistic.[7] For example, suppose one randomly samples n people out of a large population and ask them whether they agree with a certain statement. The proportion of people who agree will of course depend on the sample. If groups of n people were sampled repeatedly and truly randomly, the proportions would follow an approximate normal distribution with mean equal to the true proportion p of agreement in the population and with standard deviation =(p(1p)/n)1/2. Large sample sizes n are good because the standard deviation, as a proportion of the expected value, gets smaller, which allows a more precise estimate of the unknown parameterp.

Poisson approximationThe binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed. Therefore the Poisson distribution with parameter = np can be used as an approximation to B(n, p) of the binomial distribution if n is sufficiently large and p is sufficiently small. According to two rules of thumb, this approximation is good if n20 and p0.05, or if n100 and np10.[8]

Limiting distributions Poisson limit theorem: As n approaches and p approaches 0 while np remains fixed at >0 or at least np approaches >0, then the Binomial(n,p) distribution approaches the Poisson distribution with expected value . de MoivreLaplace theorem: As n approaches while p remains fixed, the distribution of

approaches the normal distribution with expected value0 and variance1. This result is sometimes loosely stated by saying that the distribution of X is asymptotically normal with expected valuenp and variancenp(1p). This result is a specific case of the central limit theorem.

Confidence intervalsEven for quite large values of n, the actual distribution of the mean is signicantly nonnormal.[9] Because of this problem several methods to estimate confidence intervals have been proposed. Let n1 be the number of successes out of n, the total number of trials, and let

be the proportion of successes. Let z/2 be the 100 ( 1 / 2 )th percentile of the standard normal distribution. Wald method

Binomial distribution A continuity correction of 0.5/n may be added. Agresti-Coull method[10]

16

Here the estimate of p is modified to

ArcSine method[11]

Wilson (score) method[12]

The exact (Clopper-Pearson) method is the most conservative.[9] The Wald method although commonly recommended in the text books is the most biased.

Generating binomial random variatesMethods for random number generation where the marginal distribution is a binomial distribution are well-established. [13][14]

References[1] Matouek, J, Vondrak, J: The Probabilistic Method (lecture notes) (http:/ / kam. mff. cuni. cz/ ~matousek/ prob-ln. ps. gz). [2] Neumann, P. (1966). "ber den Median der Binomial- and Poissonverteilung" (in German). Wissenschaftliche Zeitschrift der Technischen Universitt Dresden 19: 2933. [3] Lord, Nick. (July 2010). "Binomial averages when the mean is an integer", The Mathematical Gazette 94, 331-332. [4] Kaas, R.; Buhrman, J.M. (1980). "Mean, Median and Mode in Binomial Distributions". Statistica Neerlandica 34 (1): 1318. doi:10.1111/j.1467-9574.1980.tb00681.x. [5] Hamza, K. (1995). "The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson distributions". Statistics & Probability Letters 23: 2125. doi:10.1016/0167-7152(94)00090-U. [6] Box, Hunter and Hunter (1978). Statistics for experimenters. Wiley. p.130. [7] NIST/SEMATECH, "7.2.4. Does the proportion of defectives meet requirements?" (http:/ / www. itl. nist. gov/ div898/ handbook/ prc/ section2/ prc24. htm) e-Handbook of Statistical Methods. [8] NIST/SEMATECH, "6.3.3.1. Counts Control Charts" (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm), e-Handbook of Statistical Methods. [9] Brown LD, Cai T. and DasGupta A (2001). Interval estimation for a binomial proportion (with discussion). Statist Sci 16: 101133 [10] Agresti A, Coull BA (1998) "Approximate is better than 'exact' for interval estimation of binomial proportions". The American Statistician 52:119126 [11] Pires MA () Confidence intervals for a binomial proportion: comparison of methods and software evaluation. [12] Wilson EB (1927) "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association 22: 209212 [13] Devroye, Luc (1986) Non-Uniform Random Variate Generation, New York: Springer-Verlag. (See especially Chapter X, Discrete Univariate Distributions (http:/ / cg. scs. carleton. ca/ ~luc/ chapter_ten. pdf)) [14] Kachitvichyanukul, V.; Schmeiser, B. W. (1988). "Binomial random variate generation". Communications of the ACM 31 (2): 216222. doi:10.1145/42372.42381.

Uniform distribution (discrete)

17

Uniform distribution (discrete)discrete uniform Probability mass function

n = 5 where n = ba+1 Cumulative distribution function

Parameters

Support PMF CDF

Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF N/A[1]

Uniform distribution (discrete) In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a finite number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of equally spaced outcomes equally likely to happen." If a random variable has any of possible values that are equally spaced and equally probable, is . A simple example of the are 1, 2, 3, 4, 5, 6; and each time the die

18

then it has a discrete uniform distribution. The probability of any outcome discrete uniform distribution is throwing a fair die. The possible values of

is thrown, the probability of a given score is 1/6. If two dice are thrown and their values added, the uniform distribution no longer fits since the values from 2 to 12 do not have equal probabilities. The cumulative distribution function (CDF) of the discrete uniform distribution can be expressed in terms of a degenerate distribution as

where the Heaviside step function convention that

is the CDF of the degenerate distribution centered at

, using the

Estimation of maximumThis example is described by saying that a sample of k observations is obtained from a uniform distribution on the integers , with the problem being to estimate the unknown maximum N. This problem is commonly known as the German tank problem, following the application of maximum estimation to estimates of German tank production during World War II. The UMVU estimator for the maximum is given by

where m is the sample maximum and k is the sample size, sampling without replacement.[2][3] This can be seen as a very simple case of maximum spacing estimation. The formula may be understood intuitively as: "The sample maximum plus the average gap between observations in the sample", the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum.[4] This has a variance of[2]

so a standard deviation of approximately above.

, the (population) average size of a gap between samples; compare

The sample maximum is the maximum likelihood estimator for the population maximum, but, as discussed above, it is biased. If samples are not numbered but are recognizable or markable, one can instead estimate population size via the capture-recapture method.

Uniform distribution (discrete)

19

Random permutationSee rencontres numbers for an account of the probability distribution of the number of fixed points of a uniformly distributed random permutation.

Notes[1] http:/ / adorio-research. org/ wordpress/ ?p=519 [2] Johnson, Roger (1994), "Estimating the Size of a Population", Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ index. htm) 16 (2 (Summer)), doi:10.1111/j.1467-9639.1994.tb00688.x [3] Johnson, Roger (2006), "Estimating the Size of a Population" (http:/ / www. rsscse. org. uk/ ts/ gtb/ johnson. pdf), Getting the Best from Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ gtb/ contents. html), [4] The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to underestimate the population maximum.

References

Poisson distribution

20

Poisson distributionPoisson Probability mass function

The horizontal axis is the index k, the number of occurrences. The function is only defined at integer values of k. The connecting lines are only guides for the eye. Cumulative distribution function

The horizontal axis is the index k, the number of occurrences. The CDF is discontinuous at the integers of k and flat everywhere else because a variable that is Poisson distributed only takes on integer values. Notation Parameters Support PMF CDF > 0 (real) k { 0, 1, 2, 3, ... }

--or--

(for

where

is the

Incomplete gamma function and Mean Median Mode Variance Skewness Ex. kurtosis

is the floor function)

Poisson distribution

21

Entropy

(for large

)

MGF CF PGF

In probability theory and statistics, the Poisson distribution (pronounced [pwas]) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.[1] (The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.) Suppose someone typically gets on the average 4 pieces of mail per day. There will be however a certain spread: sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or mix of processes, that produce the event flow are essentially random, the Poisson distribution specifies how likely it is that the count will be 3, or 5, or 11, or any other number, during one period of observation. That is, it predicts the degree of spread around a known average rate of occurrence.[2] The distribution's practical usefulness has been explained by the Poisson law of small numbers.[3]

HistoryThe distribution was first introduced by Simon Denis Poisson (17811840) and published, together with his probability theory, in 1837 in his work Recherches sur la probabilit des jugements en matire criminelle et en matire civile (Research on the Probability of Judgments in Criminal and Civil Matters).[4] The work focused on certain random variables N that count, among other things, the number of discrete occurrences (sometimes called arrivals) that take place during a time-interval of given length. A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task of investigating the number of soldiers in the Prussian army killed accidentally by horse kick; this experiment introduced the Poisson distribution to the field of reliability engineering.[5]

DefinitionA discrete stochastic variable X is said to have a Poisson distribution with parameter >0, if for k = 0, 1, 2, ... the probability mass function of X is given by:

where e is the base of the natural logarithm (e = 2.71828...) k! is the factorial of k. The positive real number is equal to the expected value of X, but also to the variance:

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The Poisson distribution is sometimes called a Poissonian.

Poisson distribution

22

PropertiesMean The expected value of a Poisson-distributed random variable is equal to and so is its variance. The coefficient of variation is , while the index of dispersion is 1.[6] The mean deviation about the mean is[6]

The mode of a Poisson-distributed random variable with non-integer is equal to

, which is the largest

integer less than or equal to . This is also written as floor(). When is a positive integer, the modes are and 1. All of the cumulants of the Poisson distribution are equal to the expected value . The nth factorial moment of the Poisson distribution is n.

MedianBounds for the median ( ) of the distribution are known and are sharp:[7]

Higher moments The higher moments mk of the Poisson distribution about the origin are Touchard polynomials in :

where

are Stirling numbers of the second kind.[8] The coefficients of the polynomials have a combinatorial meaning. In fact, when the expected value of the Poisson distribution is 1, then Dobinski's formula says that the nth moment equals the number of partitions of a set of size n. Sums of Poisson-distributed random variables: If are independent, and , then .[9]

A converse is Raikov's theorem, which says that if the sum of two independent random variables is Poisson-distributed, then so is each of those two independent random variables.[10]

Poisson distribution

23

Other properties The Poisson distributions are infinitely divisible probability distributions.[11][12] The directed Kullback-Leibler divergence between Pois() and Pois(0) is given by

Bounds for the tail probabilities of a Poisson random variable bound argument.[13]

can be derived using a Chernoff

Related distributions If and are independent, then the difference are independent, then the distribution of , follows a conditional on Skellam distribution. If and

is a binomial distribution. Specifically, given with parameters 1, 2,..., n then given

. More generally, if X1, X2,..., Xn are independent Poisson random variables

.

In

fact,

. The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes to infinity and the expected number of successes remains fixed see law of rare events below. Therefore it can be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is at least 20 and p is smaller than or equal to 0.05, and an excellent approximation if n 100 and np 10.[14]

The Poisson distribution is a special case of generalized stuttering Poisson distribution (or stuttering Poisson distribution) with only a parameter.[15] Stuttering Poisson distribution can be deduced from the limiting distribution of multinomial distribution. For sufficiently large values of , (say >1000), the normal distribution with mean and variance (standard deviation ), is an excellent approximation to the Poisson distribution. If is greater than about 10, then the normal distribution is a good approximation if an appropriate continuity correction is performed, i.e., P(Xx), where (lower-case) x is a non-negative integer, is replaced by P(Xx+0.5). Variance-stabilizing transformation: When a variable is Poisson distributed, its square root is approximately normally distributed with expected value of about and variance of about 1/4.[16][17] Under this transformation, the convergence to normality (as increases) is far faster than the untransformed variable. Other, slightly more complicated, variance stabilizing transformations are available,[17] one of which is Anscombe transform. See Data transformation (statistics) for more general uses of transformations. If for every t > 0 the number of arrivals in the time interval [0,t] follows the Poisson distribution with mean t, then the sequence of inter-arrival times are independent and identically distributed exponential random variables

Poisson distribution having mean 1 / .[18] The cumulative distribution functions of the Poisson and chi-squared distributions are related in the following ways:[19] and[20]

24

OccurrenceApplications of the Poisson distribution can be found in many fields related to counting: Electrical system example: telephone calls arriving in a system. Astronomy example: photons arriving at a telescope. Biology example: the number of mutations on a strand of DNA per unit time. Management example: customers arriving at a counter or call centre. Civil Engineering example: cars arriving at a traffic light. Finance and Insurance example: Number of Losses/Claims occurring in a given period of Time.

Earthquake Seismology example: An asymptotic Poisson model of seismic risk for large earthquakes. (Lomnitz, 1994). The Poisson distribution arises in connection with Poisson processes. It applies to various phenomena of discrete properties (that is, those that may happen 0, 1, 2, 3, ... times during a given period of time or in a given area) whenever the probability of the phenomenon happening is constant in time or space. Examples of events that may be modelled as a Poisson distribution include: The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (18681931). The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (18761937).[21] The number of phone calls arriving at a call centre per minute. The number of goals in sports involving two competing teams. The number of deaths per year in a given age group. The number of jumps in a stock price in a given time interval. Under an assumption of homogeneity, the number of times a web server is accessed per minute. The number of mutations in a given stretch of DNA after a certain amount of radiation. The proportion of cells that will be infected at a given multiplicity of infection.

Poisson distribution

25

How does this distribution arise? The law of rare eventsIn several of the above examplessuch as, the number of mutations in a given sequence of DNAthe events being counted are actually the outcomes of discrete trials, and would more precisely be modelled using the binomial distribution, that is

In such cases n is very large and p is very small (and so the expectation np is of intermediate magnitude). Then the distribution may be approximated by the less cumbersome Poisson distribution

This is sometimes known as the law of rare events, since each of the n individual Bernoulli events rarely occurs. The name may be misleading Comparison of the Poisson distribution (black lines) and the binomial distribution with because the total count of success n=10 (red circles), n=20 (blue circles), n=1000 (green circles). All distributions have a events in a Poisson process need not be mean of 5. The horizontal axis shows the number of events k. Notice that as n gets larger, rare if the parameter np is not small. the Poisson distribution becomes an increasingly better approximation for the binomial distribution with the same mean. For example, the number of telephone calls to a busy switchboard in one hour follows a Poisson distribution with the events appearing frequent to the operator, but they are rare from the point of view of the average member of the population who is very unlikely to make a call to that switchboard in that hour. The word law is sometimes used as a synonym of probability distribution, and convergence in law means convergence in distribution. Accordingly, the Poisson distribution is sometimes called the law of small numbers because it is the probability distribution of the number of occurrences of an event that happens rarely but has very many opportunities to happen. The Law of Small Numbers is a book by Ladislaus Bortkiewicz about the Poisson distribution, published in 1898. Some have suggested that the Poisson distribution should have been called the Bortkiewicz distribution.[22]

Multi-dimensional Poisson processThe poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional Poisson processes in a directly equivalent way to the result for unidimensional processes. This,is D is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if N(D) is count of the number of events in D, then

Poisson distribution

26

Other applications in scienceIn a Poisson process, the number of observed occurrences fluctuates about its mean with a standard deviation . These fluctuations are denoted as Poisson noise or (particularly in electronics) as shot noise. The correlation of the mean and standard deviation in counting independent discrete occurrences is useful scientifically. By monitoring how the fluctuations vary with the mean signal, one can estimate the contribution of a single occurrence, even if that contribution is too small to be detected directly. For example, the charge e on an electron can be estimated by correlating the magnitude of an electric current with its shot noise. If N electrons pass a point in a given time t on the average, the mean current is ; since the current fluctuations should be of the order (i.e., the standard deviation of the Poisson process), the charge can be estimated from

the ratio . An everyday example is the graininess that appears as photographs are enlarged; the graininess is due to Poisson fluctuations in the number of reduced silver grains, not to the individual grains themselves. By correlating the graininess with the degree of enlargement, one can estimate the contribution of an individual grain (which is otherwise too small to be seen unaided). Many other molecular applications of Poisson noise have been developed, e.g., estimating the number density of receptor molecules in a cell membrane.

Generating Poisson-distributed random variablesA simple algorithm to generate random Poisson-distributed numbers (pseudo-random number sampling) has been given by Knuth (see References below): algorithm poisson random number (Knuth): init: Let L e, k 0 and p 1. do: k k + 1. Generate uniform random number u in [0,1] and let p p u. while p > L. return k 1. While simple, the complexity is linear in . There are many other algorithms to overcome this. Some are given in Ahrens & Dieter, see References below. Also, for large values of , there may be numerical stability issues because of the term e. One solution for large values of is Rejection sampling, another is to use a Gaussian approximation to the Poisson. Inverse transform sampling is simple and efficient for small values of , and requires only one uniform random number u per sample. Cumulative probabilities are examined in turn until one exceeds u.

Poisson distribution

27

Parameter estimationMaximum likelihoodGiven a sample of n measured values ki we wish to estimate the value of the parameter of the Poisson population from which the sample was drawn. The maximum likelihood estimate is

Since each observation has expectation so does this sample mean. Therefore the maximum likelihood estimate is an unbiased estimator of . It is also an efficient estimator, i.e. its estimation variance achieves the CramrRao lower bound (CRLB). Hence it is MVUE. Also it can be proved that the sample mean is a complete and sufficient statistic for .

Confidence intervalThe confidence interval for a Poisson mean is calculated using the relationship between the Poisson and Chi-square distributions, and can be written as:

where k is the number of event occurrences in a given interval and[19][23]

is the chi-square deviate with lower

tail area p and degrees of freedom n. This interval is 'exact' in the sense that its coverage probability is never less than the nominal 1 . When quantiles of the chi-square distribution are not available, an accurate approximation to this exact interval was proposed by DP Byar (based on the WilsonHilferty transformation):[24] , where denotes the standard normal deviate with upper tail area / 2.

For application of these formulae in the same context as above (given a sample of n measured values ki), one would set

calculate an interval for =n, and then derive the interval for .

Bayesian inferenceIn Bayesian inference, the conjugate prior for the rate parameter of the Poisson distribution is the gamma distribution. Let

denote that is distributed according to the gamma density g parameterized in terms of a shape parameter and an inverse scale parameter :

Then, given the same sample of n measured values ki as before, and a prior of Gamma(, ), the posterior distribution is

Poisson distribution The posterior mean E[] approaches the maximum likelihood estimate in the limit as .

28

The posterior predictive distribution for a single additional observation is a negative binomial distribution distribution, sometimes called a Gamma-Poisson distribution.

Bivariate Poisson distributionThis distribution has been extended to the bivariate case.[25] The generating function for this distribution is

with

The marginal distributions are Poisson( 1 ) and Poisson( 2 ) and the correlation coefficient is limited to the range

The Skellam distribution is a particular case of this distribution.

Notes[1] Frank A. Haight (1967). Handbook of the Poisson Distribution. New York: John Wiley & Sons. [2] "Statistics | The Poisson Distribution" (http:/ / www. umass. edu/ wsp/ statistics/ lessons/ poisson/ index. html). Umass.edu. 2007-08-24. . Retrieved 2012-04-05. [3] Gullberg, Jan (1997). Mathematics from the birth of numbers. New York: W. W. Norton. pp.963965. ISBN0-393-04002-X. [4] S.D. Poisson, Probabilit des jugements en matire criminelle et en matire civile, prcdes des rgles gnrales du calcul des probabilitis (Paris, France: Bachelier, 1837), page 206 (http:/ / books. google. com/ books?id=uovoFE3gt2EC& pg=PA206#v=onepage& q& f=false). [5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution. On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getteten." (4. Example: Those killed in the Prussian army by a horse's kick.). [6] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p157 [7] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245251 [8] Riordan, John (1937). "Moment recurrence relations for binomial, Poisson and hypergeometric frequency distributions". Annals of Mathematical Statistics 8: 103111. Also see Haight (1967), p. 6. [9] E. L. Lehmann (1986). Testing Statistical Hypotheses (second ed.). New York: Springer Verlag. ISBN0-387-94919-4. page 65. [10] Raikov, D. (1937). On the decomposition of Poisson laws. Comptes Rendus (Doklady) de l' Academie des Sciences de l'URSS, 14, 911. (The proof is also given in von Mises, Richard (1964). Mathematical Theory of Probability and Statistics. New York: Academic Press.) [11] Laha, R. G. and Rohatgi, V. K.. Probability Theory. New York: John Wiley & Sons. p.233. ISBN0-471-03262-X. [12] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p159 [13] Massimo Franceschetti and Olivier Dousse and David N. C. Tse and Patrick Thiran (2007). "Closing the Gap in the Capacity of Wireless Networks Via Percolation Theory" (http:/ / circuit. ucsd. edu/ ~massimo/ Journal/ IEEE-TIT-Capacity. pdf). IEEE Transactions on Information Theory 53 (3): 10091018. . [14] NIST/SEMATECH, ' 6.3.3.1. Counts Control Charts (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm)', e-Handbook of Statistical Methods, accessed 25 October 2006 [15] Huiming, Zhang; Lili Chu,Yu Diao (2012). "Some Properties of the Generalized Stuttering Poisson Distribution and its Applications". Studies in Mathematical Sciences 5 (1): 1126. doi:10.3968/j.sms.1923845220120501.Z0697. [16] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models. London: Chapman and Hall. ISBN0-412-31760-5. page 196 gives the approximation and higher order terms. [17] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p163 [18] S. M. Ross (2007). Introduction to Probability Models (ninth ed.). Boston: Academic Press. ISBN978-0-12-598062-3. pp. 307308. [19] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p171 [20] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p153 [21] Philip J. Boland. "A Biographical Glimpse of William Sealy Gosset" (http:/ / wfsc. tamu. edu/ faculty/ tdewitt/ biometry/ Boland PJ (1984) American Statistician 38 179-183 - A biographical glimpse of William Sealy Gosset. pdf). The American Statistician, Vol. 38, No. 3. (Aug., 1984), pp. 179-183.. . Retrieved 2011-06-22. "At the turn of the 19th century, Arthur Guinness, Son & Co. became interested in hiring scientists to analyze data concerned with various aspects of its brewing process. Gosset was to be one of the first of these scientists, and so it

Poisson distributionwas that in 1899 he moved to Dublin to take up a job as a brewer at St. James' Gate... Student published 22 papers, the first of which was entitled "On the Error of Counting With a Haemacytometer" (Biometrika, 1907). In it, Student illustrated the practical use of the Poisson distribution in counting the number of yeast cells on a square of a haemacytometer. Up until just before World War II, Guinness would not allow its employees to publish under their own names, and hence Gosset chose to write under the pseudonym of "Student."" [22] Good, I. J. (1986). "Some statistical applications of Poisson's work". Statistical Science 1 (2): 157180. doi:10.1214/ss/1177013690. JSTOR2245435. [23] Garwood, F. (1936). "Fiducial Limits for the Poisson Distribution". Biometrika 28 (3/4): 437442. doi:10.1093/biomet/28.3-4.437. [24] Breslow, NE; Day, NE (1987). Statistical Methods in Cancer Research: Volume 2The Design and Analysis of Cohort Studies (http:/ / www. iarc. fr/ en/ publications/ pdfs-online/ stat/ sp82/ index. php). Paris: International Agency for Research on Cancer. ISBN978-92-832-0182-3. . [25] Loukas S, Kemp CD (1986) The index of dispersion test for the bivariate Poisson distribution. Biometrics 42(4) 941-948

29

References Joachim H. Ahrens, Ulrich Dieter (1974). "Computer Methods for Sampling from Gamma, Beta, Poisson and Binomial Distributions". Computing 12 (3): 223246. doi:10.1007/BF02293108. Joachim H. Ahrens, Ulrich Dieter (1982). "Computer Generation of Poisson Deviates". ACM Transactions on Mathematical Software 8 (2): 163179. doi:10.1145/355993.355997. Ronald J. Evans, J. Boersma, N. M. Blachman, A. A. Jagers (1988). "The Entropy of a Poisson Distribution: Problem 87-6". SIAM Review 30 (2): 314317. doi:10.1137/1030059. Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer Programming, Volume 2. Addison Wesley.

Beta-binomial distribution

30

Beta-binomial distributionProbability mass function

Cumulative distribution function

Parameters n N0 number of trials (real) (real) Support PMF CDF where 3F2(a,b,k) is the generalized hypergeometric function =3F2(1,+k+1,n+k+1;k+2,n+k+2;1) Mean Variance Skewness Ex. kurtosis See text MGF CF k { 0, , n }

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics as an overdispersed binomial distribution.

Beta-binomial distribution It reduces to the Bernoulli distribution as a special case when n=1. For ==1, it is the discrete uniform distribution from 0 ton. It also approximates the binomial distribution arbitrarily well for large and. The beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and beta distributions are special cases of the multinomial and Dirichlet distributions, respectively.

31

Motivation and derivationBeta-binomial distribution as a compound distributionThe Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically tractable compound distribution where one can think of the parameter in the binomial distribution as being randomly drawn from a beta distribution. Namely, if

is the binomial distribution where p is a random variable with a beta distribution

then the compound distribution is given by

Using the properties of the beta function, this can alternatively be written

It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial is the predictive distribution of a binomial random variable with a beta distribution prior on the success probability.

Beta-binomial as an urn modelThe beta-binomial distribution can also be motivated via an urn model for positive integer values of and . Specifically, imagine an urn containing red balls and black balls, where random draws are made. If a red ball is observed, then two red balls are returned to the urn. Likewise, if a black ball is drawn, it is replaced and another black ball is added to the urn. If this is repeated n times, then the probability of observing k red balls follows a beta-binomial distribution with parameters n, and . Note that if the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a binomial distribution and if the random draws are made without replacement, the distribution follows a hypergeometric distribution.

Beta-binomial distribution

32

Moments and propertiesThe first three raw moments are

and the kurtosis is

Letting

we note, suggestively, that the mean can be written as

and the variance as

where parameter.

is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion

Point estimatesMethod of momentsThe method of moments estimates can be gained by noting the first and second moments of the beta-binomial namely

and setting these raw moments equal to the sample moments

and solving for and we get

Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric distribution are alternative candidates respectively.

Beta-binomial distribution

33

Maximum likelihood estimationWhile closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions (gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximum likelihood estimates from empirical data can be computed using general methods for fitting multinomial Plya distributions, methods for which are described in (Minka 2003). The R package VGAM through the function vglm, via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to the beta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.

ExampleThe following data gives the number of male children among the first 12 children of family size 13 in 6115 families taken from hospital records in 19th century Saxony (Sokal and Rohlf, p.59 from Lindsey). The 13th child is ignored to assuage the effect of families non-randomly stopping when a desired gender is reached.Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

We note the first two sample moments are

and therefore the method of moments estimates are

The maximum likelihood estimates can be found numerically

and the maximized log-likelihood is

from which we find the AIC

The AIC for the competing binomial model is AIC=25070.34 and thus we see that the beta-binomial model provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical justification for heterogeneity in gender-proneness among families (i.e. overdispersion). The superior fit is evident especially among the tails

Beta-binomial distribution

34

Males Observed Families Predicted (Beta-Binomial)

0 3

1 24

2 104

3 286

4 670

5 1033

6 1343

7 1112

8 829

9 478

10 181

11 45

12 7

2.3 22.6 104.8 310.9 655.7 1036.2 1257.9 1182.1 853.6 461.9 177.9 43.8 5.2 258.5 628.1 1085.2 1367.3 1265.6 854.2 410.0 132.8 26.1 2.3

Predicted (Binomial p = 0.519215) 0.9 12.1 71.8

Further Bayesian considerationsIt is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let

where

so that

The posterior distribution (|k) is also a beta distribution:

And

while the marginal distribution m(k|, M) is given by

Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult to obtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method of iterated expectations to find the expected value of the marginal moments. Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials for event i:

Beta-binomial distribution We can find iterated moment estimates for the mean and variance using the moments for the distributions in the two-stage model:

35

(Here we have used the law of total expectation and the law of total variance.) We want point estimates for and . The estimated mean is calculated from the sample

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage model:

Solving:

where

Since we now have parameter point estimates, point estimate estimate and

and

, for the underlying distribution, we would like to find a

for the probability of success for event i. This is the weighted average of the event . Given our point estimates for the prior, we may now plug in these values to find a point

estimate for the posterior

Shrinkage factorsWe may write the posterior estimate as a weighted average:

where

is called the shrinkage factor.

Beta-binomial distribution

36

Related distributions where is the discrete uniform distribution.

References* Minka, Thomas P. (2003). Estimating a Dirichlet distribution [1]. Microsoft Technical Report.

External links Using the Beta-binomial distribution to assess performance of a biometric identification device [2] Fastfit [3] contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Plya distributions) to data.

References[1] http:/ / research. microsoft. com/ ~minka/ papers/ dirichlet/ [2] http:/ / it. stlawu. edu/ ~msch/ biometrics/ papers. htm [3] http:/ / research. microsoft. com/ ~minka/ software/ fastfit/

Negative binomial distribution

37

Negative binomial distributionDifferent texts adopt slightly different definitions for the negative binomial distribution. They can be distinguished by whether the support starts at k=0 or at k=r, and whether p denotes the probability of a success or of a failure. Probability mass function

The orange line represents the mean, which is equal to 10 in each of these plots; the green line shows the standard deviation. Notation Parameters r > 0 number of failures until the experiment is stopped (integer, but the definition can also be extended to reals) p (0,1) success probability in each experiment (real) k { 0, 1, 2, 3, } number of successes involving a binomial coefficient the regularized incomplete beta function

Support PMF CDF Mean Mode

Variance Skewness Ex. kurtosis MGF CF PGF

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r) occur. For example, if one throws a die repeatedly until the third time 1 appears, then the probability distribution of the number of non-1s that had appeared will be negative binomial. The Pascal distribution (after Blaise Pascal) and Polya distribution (for George Plya) are special cases of the negative binomial. There is a convention among engineers, climatologists, and others to reserve negative binomial in a strict sense or Pascal for the case of an integer-valued stopping-time parameter r, and use Polya for the real-valued case. The Polya distribution more accurately models occurrences of contagious discrete events, like

Negative binomial distribution tornado outbreaks, than the Poisson distribution by allowing the mean and variance to be different, unlike the Poisson. Contagious events have positively correlated occurrences causing a larger variance than if the occurrences were independent, due to a positive covariance term.

38

DefinitionSuppose there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called success and failure. In each trial the probability of success is p and of failure is (1 p). We are observing this sequence until a predefined number r of failures has occurred. Then the random number of successes we have seen, X, will have the negative binomial (or Pascal) distribution:

When applied to real-world problems, outcomes of success and failure may or may not be outcomes we ordinarily view as good and bad, respectively. Suppose we used the negative binomial distribution to model the number of days a certain machine works before it breaks down. In this case success would be the result on a day when the machine worked properly, whereas a breakdown would be a failure. If we used the negative binomial distribution to model the number of goal attempts a sportsman makes before scoring a goal, though, then each unsuccessful attempt would be a success, and scoring a goal would be failure. If we are tossing a coin, then the negative binomial distribution can give the number of heads (success) we are likely to encounter before we encounter a certain number of tails (failure). The probability mass function of the negative binomial distribution is

Here the quantity in parentheses is the binomial coefficient, and is equal to

This quantity can alternatively be written in the following manner, explaining the name negative binomial:

To understand the above definition of the probability mass function, note that the probability for every specific sequence of ksuccesses and rfailures is (1 p)rpk, because the outcomes of the k+r trials are supposed to happen independently. Since the rthfailure comes last, it remains to choose the ktrials with successes out of the remaining k+r1 trials. The above binomial coefficient, due to its combinatorial interpretation, gives precisely the number of all these sequences of length k+r1.

Extension to real-valued rIt is possible to extend the definition of the negative binomial distribution to the case of a positive real parameter r. Although it is impossible to visualize a non-integer number of failures, we can still formally define the distribution through its probability mass function. As before, we say that X has a negative binomial (or Plya) distribution if it has a probability mass function:

Here r is a real, positive number. The binomial coefficient is then defined by the multiplicative formula and can also be rewritten using the gamma function:

Negative binomial distribution

39

Note that by the binomial series and (*) above, for every 0 p < 1,

hence the terms of the probability mass function indeed add up to one.

Alternative formulationsSome textbooks may define the negative binomial distribution slightly differently than it is done here. The most common variations are: The definition where X is the total number of trials needed to get r failures, not simply the number of successes. Since the total number of trials is equal to the number of successes plus the number of failures, this definition differs from ours by adding constantr. In order to convert formulas written with this definition into the one used in the article, replace everywhere k with k - r, and also subtract r from the mean, the median, and the mode. In order to convert formulas of this article into this alternative definition, replace k with k + r and add r to the mean, the median and the mode. Effectively, this implies using the probability mass function

which perhaps resembles the binomial distribution more closely than the version above. Note that the arguments of the binomial coefficient are decremented due to order: the last "failure" must occur last, and so the other events have one fewer positions available when counting possible orderings. Note that this definition of the negative binomial distribution does not easily generalize to a positive, real parameterr. The definition where p denotes the probability of a failure, not of a success. In order to convert formulas between this definition and the one used in the article, replace p with 1 p everywhere. The definition where the support X is defined as the number of failures, rather than the number of successes. This definition where X counts failures but p is the probability of success has exactly the same formulas as in the previous case where X counts successes but p is the probability of failure. However, the corresponding text will have the words failure and success swapped compared with the previous case. The two alterations above may be applied simultaneously, i.e. X counts total trials, and p is the probability of failure.

OccurrenceWaiting time in a Bernoulli processFor the special case where r is an integer, the negative binomial distribution is known as the Pascal distribution. It is the probability distribution of a certain number of failures and successes in a series of independent and identically distributed Bernoulli trials. For k+r Bernoulli trials with success probability p, the negative binomial gives the probability of k successes and r failures, with a failure on the last trial. In other words, the negative binomial distribution is the probability distribution of the number of successes before the rth failure in a Bernoulli process, with probability p of successes on each trial. A Bernoulli process is a discrete time process, and so the number of trials, failures, and successes are integers. Consider the following example. Suppose we repeatedly throw a die, and consider a 1 to be a failure. The probability of failure on each trial is 1/6. The number of successes before the third failure belongs to the infinite set { 0,1,2,3,... }. That number of successes is a negative-binomially distributed random variable.

Negative binomial distribution When r = 1 we get the probability distribution of number of successes before the first failure (i.e. the probability of the first failure occurring on the (k+1)st trial), which is a geometric distribution:

40

Overdispersed PoissonThe negative binomial distribution, especially in its alternative parameterization described above, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean. See Cumulants of some discrete probability distributions. An application of this is to annual counts of tropical cyclones in the North Atlantic or to monthly to 6-monthly counts of wintertime extratropical cyclones over Europe, for which the variance is greater than the mean.[1][2][3] In the case of modest overdispersion, this may produce substantially similar results to an overdispersed Poisson distribution.[4][5]

Related distributions The geometric distribution (on {0,1,2,3,...}) is a special case of the negative binomial distribution, with

The negative binomial distribution is a special case of the discrete phase-type distribution. The negative binomial distribution is a special case of the stuttering Poisson distribution.[6]

Poisson distributionConsider a sequence of negative binomial distributions where the stopping parameter r goes to infinity, whereas the probability of success in each trial, p, goes to zero in such a way as to keep the mean of the distribution constant. Denoting this mean , the parameter p will have to be

Under this parametrization the probability mass function will be

Now if we consider the limit as r , the second factor will converge to one, and the third to the exponent function:

which is the mass function of a Poisson-distributed random variable with expected value. In other words, the alternatively parameterized negative binomial distribution converges to the Poisson distribution and r controls the deviation from the Poisson. This makes the negative binomial distribution suitable as a robust alternative to the Poisson, which approaches the Poisson for large r, but which has larger variance than the Poisson for small r.

Negative binomial distribution

41

GammaPoisson mixtureThe negative binomial distribution also arises as a continuous mixture of Poisson distributions (i.e. a compound probability distribution) where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can view the negative binomial as a Poisson() distribution, where is itself a random variable, distributed according to Gamma(r, p/(1 p)). Formally, this means that the mass function of the negative binomial distribution can be written as

Because of this, the negative binomial distribution is also known as the gammaPoisson (mixture) distribution.

Sum of geometric distributionsIf Yr is a random variable following the negative binomial distribution with parameters r and p, and support {0,1,2,...}, then Yr is a sum of r independent variables following the geometric distribution (on {0,1,2,3,...}) with parameter 1p. As a result of the central limit theorem, Yr (properly scaled and shifted) is therefore approximately normal for sufficiently larger. Furthermore, if Bs+r is a random variable following the binomial distribution with parameters s+r and1p, then

In this sense, the negative binomial distribution is the "inverse" of the binomial distribution. The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for parameter p is negative-binomially distributed with the same p but with "r-value"r1+r2. The negative binomial distribution is infinitely divisible, i.e., if Y has a negative binomial distribution, then for any positive integer n, there exist independent identically distributed random variables Y1,...,Yn whose sum has the same distribution that Y has.

Negative binomial distribution

42

Representation as compound Poisson distributionThe negative binomial distribution NB(r,p) can be represented as a compound Poisson distribution: Let {Yn, n 0} denote a sequence of independent and identically distributed random variables, each one having the logarithmic distribution Log(p), with probability mass function

Let N be a random variable, independent of the sequence, and suppose that N has a Poisson distribution with parameter = r ln(1 p). Then the random sum

is NB(r,p)-distributed. To prove this, we calculate the probability generating function GX of X, which is the composition of the probability generating functions GN and GY1. Using and

we obtain

which is the probability generating function of the NB(r,p) distribution.

PropertiesCumulative distribution functionThe cumulative distribution function can be expressed in terms of the regularized incomplete beta function:

Sampling and point estimation of pSuppose p is unknown and an experiment is conducted where it is decided ahead of time that sampling will continue until r successes are found. A sufficient statistic for the experiment is k, the number of failures. In estimating p, the minimum variance unbiased estimator is

The maximum likelihood estimate of p is

but this is a biased estimate. Its inverse (r+k)/r, is an unbiased estimate of 1/p, however.[7]

Negative binomial distribution

43

Relation to the binomial theoremSuppose Y is a random variable with a binomial distribution with parameters n and p. Assume p + q = 1, with p, q >=0. Then the binomial theorem implies that

Using Newton's binomial theorem, this can equally be written as:

in which the upper bound of summation is infinite. In this case, the binomial coefficient

is defined when n is a real number, instead of just a positive integer. But in our case of the binomial distribution it is zero when k > n. We can then say, for example

Now suppose r > 0 and we use a negative exponent:

Then all of the terms are positive, and the term

is just the probability that the number of failures before the rth success is equal to k, provided r is an integer. (If r is a negative non-integer, so that the exponent is a positive non-integer, then some of the terms in the sum above are negative, so we do not have a probability distribution on the set of all nonnegative integers.) Now we also allow non-integer values of r. Then we have a proper negative binomial distribution, which is a generalization of the Pascal distribution, which coincides with the Pascal distribution when r happens to be a positive integer. Recall from above that The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for parameter p is negative-binomially distributed with the same p but with "r-value"r1+r2. This property persists when the definition is thus generalized, and affords a quick way to see that the negative binomial distribution is infinitely divisible.

Negative binomial distribution

44

Parameter estimationMaximum likelihood estimationThe likelihood function for N iid observations (k1,...,kN) is

from which we calculate the log-likelihood function

To find the maximum we take the partial derivatives with respect to r and p and set them equal to zero: and

where is the digamma function. Solving the first equation for p gives:

Substituting this in the second equation gives:

This equation cannot be solved in closed form. If a numerical solution is desired, an iterative technique such as Newton's method can be used.

ExamplesSelling candyPat is required to sell candy bars to raise money for the 6th grade field trip. There are thirty houses in the neighborhood, and Pat is not supposed to return home until five candy bars have been sold. So the child goes door to door, selling candy bars. At each house, there is a 0.4 probability of selling one candy bar and a 0.6 probability of selling nothing. What's the probability of selling the last candy bar at the nth house? Recall that the NegBin(r, p) distribution describes the probability of k failures and r successes in k+r Bernoulli(p) trials with success on the last trial. Selling five candy bars means getting five successes. The number of trials (i.e. houses) this takes is therefore k+5=n. The random variable we are interested in is the number of houses, so we substitute k=n5 into a NegBin(5,0.4) mass function and obtain the following mass function of the distribution of houses (for n5):

What's the probability that Pat finishes on the tenth house?

Negative binomial distribution

45

What's the probability that Pat finishes on or before reaching the eighth house? To finish on or before the eighth house, Pat must finish at the fifth, sixth, seventh, or eighth house. Sum those probabilities:

What's the probability that Pat exhausts all 30 houses in the neighborhood? This can be expressed as the probability that Pat does not finish on the fifth through the thirtieth house:

Polygyny in African societiesData on polygyny among a wide range of traditional African societies suggest that the distribution of wives follow a range of binomial profiles. The majority of these are negative binomial indicating the degree of competition for wives. However some tend towards a Poisson Distribution and even bey