binary variables (1) coin flipping: heads=1, tails=0 bernoulli distribution
TRANSCRIPT
![Page 1: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/1.jpg)
![Page 2: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/2.jpg)
Binary Variables (1)
Coin flipping: heads=1, tails=0
Bernoulli Distribution
![Page 3: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/3.jpg)
Binary Variables (2)
N coin flips:
Binomial Distribution
![Page 4: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/4.jpg)
Binomial Distribution
![Page 5: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/5.jpg)
The Multinomial DistributionMultinomial distribution is a generalization of the binominal distribution. Differentfrom the binominal distribution, where the RV assumes two outcomes, the RV formulti-nominal distribution can assume k (k>2) possible outcomes.
Let N be the total number of independent trials, mi, i=1,2, ..k, be the number of timesoutcome i appears. Then, performing N independent trials, the probability that outcome 1 appears m1, outcome 2, appears m2, …,outcome k appears mk times is
![Page 6: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/6.jpg)
The Gaussian Distribution
![Page 7: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/7.jpg)
Moments of the Multivariate Gaussian (1)
thanks to anti-symmetry of z
![Page 8: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/8.jpg)
Moments of the Multivariate Gaussian (2)
![Page 9: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/9.jpg)
Central Limit Theorem
The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows.Example: N uniform [0,1] random variables.
![Page 10: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/10.jpg)
Beta DistributionBeta is a continuous distribution defined on the interval of 0 and 1, i.e.,
parameterized by two positive parameters a and b.
where T(*) is gamma function. beta is conjugate to the binomial and Bernoulli distributions
![Page 11: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/11.jpg)
Beta Distribution
![Page 12: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/12.jpg)
The Dirichlet Distribution
Conjugate prior for the multinomial distribution.
The Dirichlet distribution is a continuous multivariate probability distributions parametrized by a vector of positive reals a. It is the multivariate generalization of the beta distribution.
![Page 13: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/13.jpg)
Mixtures of Gaussians (1)
Old Faithful data set
Single Gaussian Mixture of two Gaussians
![Page 14: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/14.jpg)
Mixtures of Gaussians (2)
Combine simple models into a complex model:
Component
Mixing coefficientK=3
![Page 15: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/15.jpg)
Mixtures of Gaussians (3)
![Page 16: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/16.jpg)
The Exponential Family (1)
where ´ is the natural parameter and
so g(´) can be interpreted as a normalization coefficient.
![Page 17: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/17.jpg)
The Exponential Family (2.1)
The Bernoulli Distribution
Comparing with the general form we see that
and so
Logistic sigmoid
![Page 18: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/18.jpg)
The Exponential Family (2.2)
The Bernoulli distribution can hence be written as
where
![Page 19: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/19.jpg)
The Exponential Family (3.1)
The Multinomial Distribution
where, , and
NOTE: The ´ k parameters are not independent since the corresponding ¹k must satisfy
![Page 20: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/20.jpg)
The Exponential Family (3.2)
Let . This leads to
and
Here the ´ k parameters are independent. Note that
and
Softmax
![Page 21: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/21.jpg)
The Exponential Family (3.3)
The Multinomial distribution can then be written as
where
![Page 22: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/22.jpg)
The Exponential Family (4)
The Gaussian Distribution
where
![Page 23: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/23.jpg)
Conjugate priors
For any member of the exponential family, there exists a prior
Combining with the likelihood function, we get posterior
The likelihood and the prior are conjugate if the prior and posterior have the same distribution.
![Page 24: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/24.jpg)
Conjugate priors (cont’d)
• Beta prior is conjugate to the binomial and Bernoulli distributions
• Dirichlet prior is conjugate to the multinomial distribution.
• Gaussian prior is conjugate to the Gaussian distribution
![Page 25: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/25.jpg)
Noninformative Priors (1)
With little or no information available a-priori, we might choose uniform prior.• ¸ discrete, K-nomial :• ¸2[a,b] real and bounded: • ¸ real and unbounded: improper!
![Page 26: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/26.jpg)
Nonparametric Methods (1)
Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.
Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.
![Page 27: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/27.jpg)
Nonparametric Methods (2)
Histogram methods partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.
• Often, the same width is used for all bins, ¢i = ¢.
• ¢ acts as a smoothing parameter.
•In a D-dimensional space, using M bins in each dimen-sion will require MD bins!
![Page 28: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/28.jpg)
Nonparametric Methods (3)
Let (x1, x2, …, xn) be an iid sample drawn from some distribution with an unknown density p(x) (Parzen window)
It follows that
Kernel Density Estimation: is a non-parametric way of estimating the probability density function of a random variable
k() is the kernel function and h is bandwith, serving as a smoothing parameter. The onlyparameter is h.
2/1||1
,0)( h
xx
elsen
n
h
xxk
N
n
n
h
xxk
Nhxp
1
)(1
)(
![Page 29: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/29.jpg)
Nonparametric Methods (4)
To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian
Any kernel such that
will work.h acts as a smoother.
![Page 30: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/30.jpg)
Nonparametric Methods (5)
Nonparametric models (not histograms) requires storing and computing with the entire data set.
Parametric models, once fitted, are much more efficient in terms of storage and computation.
![Page 31: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/31.jpg)
K-Nearest-Neighbours for Classification
K = 1K = 3
The k-nearest neighbors algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space.
![Page 32: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/32.jpg)
K-Nearest-Neighbours for Classification
The best choice of k depends upon the data; larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by cross-validation.
![Page 33: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/33.jpg)
K-Nearest-Neighbours for Classification (3)
• K acts as a smother• For , the error rate of the 1-nearest-neighbour classifier is never more
than twice the optimal error (obtained from the true conditional class distributions).
![Page 34: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/34.jpg)
Parametric Estimation
Basic building blocks:Need to determine given
Maximum Likelihood (ML)
Maximum Posterior Probability (MAP))|,...,,(maxarg* 21
Nxxxp
),...,,|(maxarg* 21 Nxxxp
![Page 35: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/35.jpg)
ML Parameter Estimation
)|()|,...,,()( 21 n
nN xpxxxpL
)|()( n
nxpLogL
Since samples x1, x2, ..,xn are IID, we have
The log likelihood can be obtained as
q can be obtained by taking the derivative of the Log likelihood with respect to q and setting it to zero
0)|()(
n
nxpLogL
![Page 36: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/36.jpg)
Parameter Estimation (1)
ML for BernoulliGiven:
![Page 37: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/37.jpg)
Maximum Likelihood for the Gaussian
Given i.i.d. data , the log likeli-hood function is given by
Sufficient statistics
![Page 38: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/38.jpg)
Maximum Likelihood for the Gaussian
Set the derivative of the log likelihood function to zero,
and solve to obtain
Similarly
![Page 39: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/39.jpg)
MAP Parameter Estimation
)|()(
)|,...,,()(
),...,,|(
21
21
nn
N
N
xpP
xxxpP
xxxp
Since samples x1, x2, ..,xn are IID, we have
)|(log)(
),...,,|( 21
n
n
N
xpLogPLog
xxxpLog
Taking the log yields posterior
q can be solved by maximizing the log posterior. P(q) is typically chosen to be the conjugate of the likelihood.
![Page 40: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/40.jpg)
Bayesian Inference for the Gaussian (1)
Assume ¾2 is known. Given i.i.d. data , the likelihood function for¹ is given by
This has a Gaussian shape as a function of ¹ (but it is not a distribution over ¹).
![Page 41: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/41.jpg)
Bayesian Inference for the Gaussian (2)
Combined with a Gaussian prior over ¹,
this gives the posterior
Completing the square over ¹, we see that
![Page 42: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/42.jpg)
Bayesian Inference for the Gaussian (3)
… where
Note:
![Page 43: Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution](https://reader037.vdocuments.mx/reader037/viewer/2022102808/56649ddb5503460f94ad232c/html5/thumbnails/43.jpg)
Bayesian Inference for the Gaussian (4)
Example: for N = 0, 1, 2 and 10.