data, uncertainty and inference - causascientia · 2019-05-31 · for deeds do die, however nobly...

Data, Uncertainty and Inference

Michael P. McLaughlin

Data, Uncertainty and InferenceCopyright c© 2019 by Michael P. McLaughlin. All rights reserved.

Any brand names and product names included in this work are trademarks, registered trademarks or tradenames of their respective holders.

Content on referenced websites is not guaranteed to be accurate or relevant for any purpose. Likewise, thepersistence of referenced URLs is not guaranteed.

This document produced using LATEX and TeXShop.Original figures created using MathematicaTM (with SetAxes) except where noted.

https://en.wikipedia.org/wiki/LaTeX

https://en.wikipedia.org/wiki/TeXShop

https://www.wolfram.com/mathematica/

https://www.causascientia.org/math_stat/SetAxes.html

For deeds do die, however nobly done,And thoughts of men do as themselves decay,But wise words taught in numbers for to run,Recorded by the Muses, live for ay.

E. Spenser, 1591

When you can measure what you are speaking about, and express it in numbers, you knowsomething about it; but when you cannot measure it, when you cannot express it innumbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginningof knowledge, but you have scarcely, in your thoughts, advanced to the stage of science.

Lord Kelvin, 1891

Contents

Preface xiv

1 The Uraniborg Legacy 11.1 Data vs. Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Inference in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

I Data 17

2 Information is Real! 19

3 Discrete Data 223.1 Ordinal Data vs. Nominal Data . . . . . . . . . . . . . . . . . . . . . . 223.2 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Missing, Censored and Truncated Data . . . . . . . . . . . . . . . . . . 283.6 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Continuous Data 294.1 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Pseudo-continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . 39

II Uncertainty 40

5 Errors and Ignorance 425.1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Ignorance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Accuracy vs. Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4 Significant Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Probability 506.1 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2 Defining Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Describing Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.4 Some Practical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.5 Basic Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 566.6 Basic Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . 686.7 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.8 Truncated and Censored Distributions . . . . . . . . . . . . . . . . . . . 756.9 An Experiment in Probability . . . . . . . . . . . . . . . . . . . . . . . 76

7 Rules of Probability 807.1 Pop Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.2 Symbology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.4 The Two Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.6 Pop Quiz: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

III Inference 86

8 Data Analysis 888.1 Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.2 Desiderata of Valid Inference . . . . . . . . . . . . . . . . . . . . . . . 908.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9 Making an Inference 929.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2 Traditional Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.3 Bayes’ Rule: Numerical Example . . . . . . . . . . . . . . . . . . . . . 959.4 Complete Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . 979.5 Computing the Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10 Computational Details I 9810.1 Data and Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . 9810.2 The Unknowns (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.3 Markov-chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . 10110.4 The Unknowns (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10710.5 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

ii

11 Computational Details II 11511.1 Bad Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11511.2 Bad Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11611.3 Alternate MCMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . 11911.4 MCMC Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

12 Modeling 12612.1 Melting Point of n-Octadecane . . . . . . . . . . . . . . . . . . . . . . . 12612.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13312.3 Something Incredible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13412.4 Pumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13712.5 Draws in Chess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13912.6 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14012.7 PGA Tour Championship, 2013 . . . . . . . . . . . . . . . . . . . . . . 14212.8 A Triple Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14512.9 Height vs. Weight Correlation . . . . . . . . . . . . . . . . . . . . . . . 15012.10 Models and Information . . . . . . . . . . . . . . . . . . . . . . . . . . 152

13 Goodness-of-fit 15413.1 Discrepancy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15413.2 Posterior-predictive Checking . . . . . . . . . . . . . . . . . . . . . . . 15513.3 Checking the Chess-draws Model . . . . . . . . . . . . . . . . . . . . . 15613.4 Continuous Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15713.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

14 Hierarchical Models 16214.1 Triple-point Temperature Redux . . . . . . . . . . . . . . . . . . . . . . 16314.2 Daytime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16814.3 Hale-Bopp CN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17414.4 Carbon-14 Dating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17814.5 Atmospheric CO2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18114.6 Horned Lizards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18414.7 Hockey Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18814.8 About Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19114.9 Pumps (Hierarchical) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19814.10 SAT and Horsepower . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20114.11 Predicting a Draw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20814.12 Logit: Another Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 21414.13 Modeling Several Discrete Choices . . . . . . . . . . . . . . . . . . . . 21914.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

15 Discrete Mixtures 22315.1 Heterogeneous Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . 223

iii

15.2 Homogeneous Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . 22615.3 Going Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

IV Case Study 241

16 When Data Don’t Exist 24316.1 The Problem and the Data . . . . . . . . . . . . . . . . . . . . . . . . . 24316.2 Bayesian Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

V Epilogue 252

VI Appendices 254

A Gibbs Sampling Code 255

B Other Model Syntax 258B.1 BUGS (WINBUGS, OpenBUGS) . . . . . . . . . . . . . . . . . . . . . 258B.2 JAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259B.3 MCSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260B.4 Stan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

iv

List of Figures

1.1 Mural Quadrant of Tycho Brahe (Tichonicus, 1598) . . . . . . . . . . . . 31.2 Tycho’s Star Catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Horned Lizard (Phrynosoma mcallii) . . . . . . . . . . . . . . . . . . . 71.4 Atmospheric Carbon Dioxide Concentration (1 – 2016 CE) . . . . . . . . 101.5 Atmospheric Carbon Dioxide Concentration + Model . . . . . . . . . . . 111.6 Start of Exponential Growth in CO2 Concentration . . . . . . . . . . . . 121.7 Draws in Chess (M1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.8 Draws in Chess (M2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.9 Excess of QP Draws in Chess . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 Parameter A Uncertainty vs. Information . . . . . . . . . . . . . . . . . 21

3.1 Russia’s Conquest of Napoleon . . . . . . . . . . . . . . . . . . . . . . 233.2 Dice Totals: 10 Tosses . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 More Dice Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Current Stellar Longitudes for Aries . . . . . . . . . . . . . . . . . . . . 314.2 Boxplot for Longitudes . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Historical Boxplots for Aries Longitudes . . . . . . . . . . . . . . . . . 324.4 Correlation (or not) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Magnitudes of Relative Brightness . . . . . . . . . . . . . . . . . . . . . 354.6 Magnitudes of Relative Brightness (log plot) . . . . . . . . . . . . . . . 364.7 Standard 3-D Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.8 A Contour Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.9 A Color-coded Density Plot . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Observed vs. Standard Ozone Concentrations . . . . . . . . . . . . . . . 455.2 Residuals from Ozone Predictions . . . . . . . . . . . . . . . . . . . . . 455.3 “Probability Plot” of Ozone Residuals . . . . . . . . . . . . . . . . . . . 465.4 Accuracy vs. Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.1 1,000 Min of Five Uniform(0, 1) . . . . . . . . . . . . . . . . . . . . . 536.2 107 Min of Five Uniform(0, 1) . . . . . . . . . . . . . . . . . . . . . . . 536.3 Uniform PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 Jeffreys Distribution: Linear Space vs. Log Space . . . . . . . . . . . . . 57

v

6.5 Gamma PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.6 Exponential PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.7 Normal PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.8 HalfNormal PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.9 LogNormal PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.10 Laplace PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.11 Triangular PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.12 Beta PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.13 Weibull PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.14 BivariateNormal PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.15 DiscreteUniform PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.16 Binomial PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.17 Geometric PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.18 Poisson PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.19 StudentsT PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.20 BetaBinomial PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.21 Right-censored Exponential PDF . . . . . . . . . . . . . . . . . . . . . 766.22 A Quarter Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.23 Estimating π via a Monte Carlo Procedure . . . . . . . . . . . . . . . . . 79

7.1 Basic Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 827.2 Venn Diagram for P (A+B) . . . . . . . . . . . . . . . . . . . . . . . . 83

10.1 Contour Plot for logPosterior . . . . . . . . . . . . . . . . . . . . . . . 10810.2 Marginal for Parameter p . . . . . . . . . . . . . . . . . . . . . . . . . . 10910.3 Marginal for Parameter λ . . . . . . . . . . . . . . . . . . . . . . . . . . 11010.4 NRMC Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

11.1 Good Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11711.2 Incomplete Burn-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11811.3 Bad Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11811.4 Marginal for µ via Gibbs Sampling . . . . . . . . . . . . . . . . . . . . 122

12.1 Marginal for True MP of n-Octadecane . . . . . . . . . . . . . . . . . . 13212.2 Marginal for True MP of n-Octadecane (including “outlier”) . . . . . . . 13512.3 Marginal for True Methane MP . . . . . . . . . . . . . . . . . . . . . . 13612.4 Marginals for Stroke Probabilities . . . . . . . . . . . . . . . . . . . . . 14412.5 Related Triple-point Temperatures . . . . . . . . . . . . . . . . . . . . . 14612.6 A Truncated Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14712.7 trueTp: Posterior Marginal Histogram . . . . . . . . . . . . . . . . . . . 14812.8 Marginal for sigma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14912.9 Height versus Weight for 247 Adult Men . . . . . . . . . . . . . . . . . 15012.10 BivariateNormal Contour Plot . . . . . . . . . . . . . . . . . . . . . . . 152

vi

13.1 Draws in Chess: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 15613.2 Draws in Chess: Predictions vs. Data . . . . . . . . . . . . . . . . . . . 15713.3 Calculator Usage: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 15813.4 Marginals for p and n . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16013.5 Online-calculator Usage: Model vs. Data . . . . . . . . . . . . . . . . . 161

14.1 Triple-point Temperatures (old and new) . . . . . . . . . . . . . . . . . 16314.2 DAG: New Tp Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16514.3 Literature vs. Student Sigmas . . . . . . . . . . . . . . . . . . . . . . . 16614.4 Linear Trend Goodness-of-fit (mean estimates) . . . . . . . . . . . . . . 16714.5 Daytime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17014.6 Daytime (MAP Result) . . . . . . . . . . . . . . . . . . . . . . . . . . . 17314.7 Daytime (marginal for sigma) . . . . . . . . . . . . . . . . . . . . . . . 17414.8 Hale-Bopp CN Release vs. Distance . . . . . . . . . . . . . . . . . . . . 17514.9 Hale-Bopp CN Fit (MAP and Mean) . . . . . . . . . . . . . . . . . . . . 17814.10 Carbon-14 Dating Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18114.11 Horned Lizards (marginal for N) . . . . . . . . . . . . . . . . . . . . . . 18714.12 Horned Lizards (marginal for p) . . . . . . . . . . . . . . . . . . . . . . 18714.13 NHL Goals from Poisson Regression . . . . . . . . . . . . . . . . . . . 19114.14 DAG: Free-throw Model . . . . . . . . . . . . . . . . . . . . . . . . . . 19314.15 Beta Hyperprior (marginals for mu and sz) . . . . . . . . . . . . . . . . 19514.16 Beta PDF ({mu, sz} → {a, b}) . . . . . . . . . . . . . . . . . . . . . . . 19514.17 Gamma Hyperprior (marginals for a and b) . . . . . . . . . . . . . . . . 19914.18 SAT: ANOVA From Spreadsheet . . . . . . . . . . . . . . . . . . . . . . 20314.19 SAT: Fraction Variance-between . . . . . . . . . . . . . . . . . . . . . . 20614.20 SAT: Variance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 20614.21 HP: Fraction Variance-between . . . . . . . . . . . . . . . . . . . . . . 20814.22 Probability of Draw Given Bin[j] . . . . . . . . . . . . . . . . . . . . . 21314.23 Draw Test: Cumulative Errors . . . . . . . . . . . . . . . . . . . . . . . 21414.24 Logistic(µi, 1) with Draw Threshold, t . . . . . . . . . . . . . . . . . . . 21614.25 Success of Model m123 With Training Set . . . . . . . . . . . . . . . . 21814.26 Logistic(µi, 1) with Two Outcome Thresholds, t[] . . . . . . . . . . . . . 220

15.1 Fish Caught: Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 22415.2 Marginals for ZIP Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22515.3 Salaries: Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22715.4 Marginals for mean[] and sigma[] from Model 1 . . . . . . . . . . . . . 22915.5 Marginals for mean[] and sigma[] from Model 2 . . . . . . . . . . . . . 23015.6 Salaries: Raw Data vs. Model 2 (mean estimates) . . . . . . . . . . . . . 23115.7 Marginals for mean[] and sigma[] from Model 3 after Relabeling . . . . 23215.8 Iris Data vs. First Two Principal Components . . . . . . . . . . . . . . . 23515.9 Gamma(3, 0.5) Distribution . . . . . . . . . . . . . . . . . . . . . . . . 23615.10 Iris Solution: Mean Contours . . . . . . . . . . . . . . . . . . . . . . . 238

vii

15.11 Iris Marginals (color-coded) . . . . . . . . . . . . . . . . . . . . . . . . 239

16.1 Aircraft Positions at a Runway Threshold . . . . . . . . . . . . . . . . . 24416.2 Raw Runway Data as NL1 . . . . . . . . . . . . . . . . . . . . . . . . . 24516.3 Within-bin Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 24716.4 Generic Bin Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 24716.5 Model for Runway-threshold Positions . . . . . . . . . . . . . . . . . . 24816.6 Unquantized NL1 Marginals . . . . . . . . . . . . . . . . . . . . . . . . 25016.7 NL1 Model With and Without Quantization . . . . . . . . . . . . . . . . 251

viii

List of Tables

1.1 Horned Lizard Recapture Data . . . . . . . . . . . . . . . . . . . . . . . 81.2 Horned Lizards (solution) . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Atmospheric Carbon Dioxide (partial solution) . . . . . . . . . . . . . . 101.4 Official Chess Results (2014, Expert and above) . . . . . . . . . . . . . 131.5 Prob(Draw) in Chess (solution) . . . . . . . . . . . . . . . . . . . . . . 131.6 Prob(QPDraw) − Prob(KPDraw) (solution) . . . . . . . . . . . . . . . . 15

2.1 Estimated Values for Parameter A . . . . . . . . . . . . . . . . . . . . . 20

3.1 More Dice Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Some Summary Statistics for 1000 Dice Throws . . . . . . . . . . . . . 27

4.1 Aries Data Over the Ages . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Summary Statistics for Current Aries Longitudes . . . . . . . . . . . . . 31

5.1 Aries: Observations and Errors . . . . . . . . . . . . . . . . . . . . . . . 435.2 Ozone Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Cell Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Estimating π Using a Monte Carlo Loop . . . . . . . . . . . . . . . . . . 78

7.1 Possible Chip Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.1 Desiderata for Valid Inference . . . . . . . . . . . . . . . . . . . . . . . 91

9.1 Prior Information for Test . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.1 A Synthetic Dataset: k Successes in N Trials . . . . . . . . . . . . . . . 9910.2 Synthetic Dataset (solution) . . . . . . . . . . . . . . . . . . . . . . . . 110

11.1 Normal Body Temperatures (−98 F) . . . . . . . . . . . . . . . . . . . . 12011.2 Normal Temperature (solution) . . . . . . . . . . . . . . . . . . . . . . . 122

12.1 C18H38 MP Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12712.2 Partial MCMC Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13112.3 Melting Point (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . 13312.4 Melting Point (solution #2) . . . . . . . . . . . . . . . . . . . . . . . . . 134

ix

12.5 Methane MP Solution From One “Impossible” Point . . . . . . . . . . . 13512.6 Pumps Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13712.7 Pumps (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13812.8 Draws in Chess: log(marginal likelihoods) . . . . . . . . . . . . . . . . 14112.9 Golf Strokes Over Four Rounds . . . . . . . . . . . . . . . . . . . . . . 14212.10 Strokes vs. Par (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . 14412.11 Pars vs. Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14512.12 Triple-point Temperature (solution) . . . . . . . . . . . . . . . . . . . . 14912.13 Height vs. Weight (solution) . . . . . . . . . . . . . . . . . . . . . . . . 151

13.1 Online-calculator Usage (solution) . . . . . . . . . . . . . . . . . . . . . 160

14.1 Triple-point Temperature (new solution) . . . . . . . . . . . . . . . . . . 16614.2 Length of Daytime (Boston, MA, USA) . . . . . . . . . . . . . . . . . . 16914.3 Daytime (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17314.4 Hale-Bopp CN Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17414.5 Hale-Bopp CN (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . 17714.6 Carbon-14 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17914.7 Carbon-14 Dating (solution) . . . . . . . . . . . . . . . . . . . . . . . . 18014.8 Atmospheric CO2 (solution) . . . . . . . . . . . . . . . . . . . . . . . . 18414.9 Horned Lizards: Grouped Data . . . . . . . . . . . . . . . . . . . . . . . 18414.10 Goals by the Winning Team (NHL, 2017) . . . . . . . . . . . . . . . . . 18814.11 NHL Goals Model m01 (solution) . . . . . . . . . . . . . . . . . . . . . 18914.12 NHL Goals: Model Comparison . . . . . . . . . . . . . . . . . . . . . . 19014.13 Goals by the Winning Team (NHL, 2017) . . . . . . . . . . . . . . . . . 19114.14 Free Throw Success: Games 1–10 . . . . . . . . . . . . . . . . . . . . . 19214.15 Free Throws (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . 19614.16 Free Throws: Prediction Comparison . . . . . . . . . . . . . . . . . . . 19714.17 Gamma Hyperparameters (solution) . . . . . . . . . . . . . . . . . . . . 20014.18 Pumps (solution #2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20014.19 Pumps (model comparison) . . . . . . . . . . . . . . . . . . . . . . . . 20114.20 Average HS SAT Scores by County . . . . . . . . . . . . . . . . . . . . 20214.21 SAT ANOVA (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . 20414.22 Automobile Horsepower Summary . . . . . . . . . . . . . . . . . . . . 20714.23 HP ANOVA (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . 20714.24 Chess Draw (m0123 solution) . . . . . . . . . . . . . . . . . . . . . . . 21014.25 Chess Draw: Model Comparison . . . . . . . . . . . . . . . . . . . . . . 21114.26 Chess Draw: Test Results . . . . . . . . . . . . . . . . . . . . . . . . . 21314.27 Chess Draw (m123 solution) . . . . . . . . . . . . . . . . . . . . . . . . 21614.28 Chess Draw (m0123 solution #2) . . . . . . . . . . . . . . . . . . . . . 21814.29 Chess Score for White: Model Comparison . . . . . . . . . . . . . . . . 22014.30 Chess Outcome Tallies . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

x

15.1 Fish Catch (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22415.2 Salaries (model1): Trace Segment . . . . . . . . . . . . . . . . . . . . . 22715.3 Salaries (model2 solution) . . . . . . . . . . . . . . . . . . . . . . . . . 23115.4 Salaries (model3 solution) . . . . . . . . . . . . . . . . . . . . . . . . . 23215.5 Partial Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23315.6 Fisher’s Irises (solution) . . . . . . . . . . . . . . . . . . . . . . . . . . 23815.7 Iris: Partial Class Assignments . . . . . . . . . . . . . . . . . . . . . . . 239

16.1 Runway Data: Bin Frequencies . . . . . . . . . . . . . . . . . . . . . . 24616.2 Runway NL1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 250

xi

List of Models

12.1 Melting Point (real-valued scalar) . . . . . . . . . . . . . . . . . . . . . 13012.2 Melting Point (one “impossible” datapoint) . . . . . . . . . . . . . . . . 13612.3 Pumps (Poisson data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13812.4 Draws in Chess (binomial) . . . . . . . . . . . . . . . . . . . . . . . . . 14012.5 Draws in Chess #2 (binomial) . . . . . . . . . . . . . . . . . . . . . . . 14112.6 Strokes vs. Par (multinomial) . . . . . . . . . . . . . . . . . . . . . . . 14312.7 Triple-point Temperature (truncated prior) . . . . . . . . . . . . . . . . . 14712.8 Height vs. Weight (bivariate normal) . . . . . . . . . . . . . . . . . . . . 151

13.1 Online-calculator Usage (continuous mixture) . . . . . . . . . . . . . . . 159

14.1 Triple-point Temperature (linear trend) . . . . . . . . . . . . . . . . . . 16414.2 Information in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16814.3 Daytime (informative hyperpriors) . . . . . . . . . . . . . . . . . . . . . 17214.4 Hale-Bopp CN (weighted regression) . . . . . . . . . . . . . . . . . . . 17614.5 Carbon-14 Dating (errors in variables) . . . . . . . . . . . . . . . . . . . 18014.6 Atmospheric CO2 (piecewise regression) . . . . . . . . . . . . . . . . . 18314.7 Horned Lizards (M0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18614.8 NHL Goals Model m01 (Poisson regression) . . . . . . . . . . . . . . . 18914.9 Free-throw Model (making predictions) . . . . . . . . . . . . . . . . . . 19414.10 Hierarchical Pumps Model (related Poisson counts) . . . . . . . . . . . . 19914.11 SAT Model (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 20514.12 Predicting a Draw: m0123 (logit) . . . . . . . . . . . . . . . . . . . . . 21014.13 Predicting a Draw: m123 (logit2) . . . . . . . . . . . . . . . . . . . . . 21714.14 Outcome of a Chess Game (ordered logit) . . . . . . . . . . . . . . . . . 221

15.1 Fish Catch (zero-inflated Poisson) . . . . . . . . . . . . . . . . . . . . . 22515.2 Salaries: Model 1 (naive mixture) . . . . . . . . . . . . . . . . . . . . . 22815.3 Iris Mixture (3 bivariate components) . . . . . . . . . . . . . . . . . . . 237

16.1 Runway Model (latent prior) . . . . . . . . . . . . . . . . . . . . . . . . 249

B.1 A BUGS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258B.2 A JAGS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259B.3 An MCSim Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

xii

B.4 A Stan Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

xiii

Preface

The purpose of this book is to answer two questions:

What does it mean to analyze data?How should one go about it?

Data analysis is a universal activity. It happens every time we observe the world aroundus and interpret what our senses (or instruments) perceive. In a more formal context, it isconsidered a technical speciality almost always associated with the discipline of statistics.All of this is well known and well understood. Less well known is the fact that, since the1990s, traditional statistics has become increasingly outdated to the point where, in manyareas, it is seldom used any longer, having been supplanted by something much better. Itis appropriate, therefore, to step back and reconsider the whole concept of data analysis,to return to first principles and see where we stand in this new millennium.

This book is not intended to be a textbook. For one thing, its content is very muchintroductory, not encyclopedic, and it does not contain the exercises for self-testing thattextbooks require. For another, it is informal in style and omits most of the mathematicsneeded for any reasonably complete description. Its targeted audience comprises studentsand practitioners, in whatever discipline, who need to analyze data seriously and have abasic mathematical background, including fundamental ideas from calculus.

A great deal of the “explaining” in these chapters is done using examples, a lot ofexamples. Hence, there is the temptation to see the latter as a repertoire of recipes fromwhich one should choose. That would be a mistake. The examples discussed are meant toserve rather as a buffet, sampling a small portion of the space of possibilities. It is expectedthat any real-world analysis would begin with a search of the literature to see whether theanalysis had ever been done before, whether it succeeded and, if not, why not. At thatpoint, the expertise of the analyst(s) will come into play, for better or worse.

The state of the art in data analysis is very computer-intensive and familiarity withsimple computer pseudocode is part of the necessary expertise since no realistic problemcan be done by hand. The examples herein will probably serve to supply most of what isneeded for those encountering this sort of code for the first time.

Readers are strongly encouraged to do more than just read about these examples; theyshould try them out using whatever software they have available. Or write their own! Oneversion of the requisite algorithms is described in sufficient detail for programmers to getup and running.

xiv

https://en.wikipedia.org/wiki/Statistics

When all is said and done, the analysis of any dataset must combine domain expertisewith analytical expertise. Until recently, the former could not be incorporated explicitlyand any analysis was limited to only rudimentary and approximate formulations. Giventhe theoretical and software advances of recent years, those limitations have now beeneliminated but the results take a lot of explaining.

This book is meant to provide an informal overview of current practice, one that shouldget those interested started on the right path.

To all of you: Happy reading!

MICHAEL P. MCLAUGHLIN

MCLEAN, VA, USAMAY, 2019

MPMCL ‘AT’ CAUSASCIENTIA.ORG

xv

Chapter 1

The Uraniborg Legacy

ON a clear day you can see forever, at least according to the Broadway musical, butwhenever I did the experiment I got very different results. My flat in Aburi, inthe Akuapem Hills, had a window that faced south and, on a clear day, I could

see about 46 miles. Much less than forever but more than one-ninth the distance to theEquator. Actually, there was nothing much to see, mostly the Atlantic Ocean and, beyondthe horizon, there was nothing at all until you reached Antarctica. Looking around wasnot terribly impressive—except on two days a year. Two amazing days!

When you live between the Equator and the Tropic of Cancer, there are two days ayear when the Sun is directly overhead at noon and all shadows disappear. A bright, sunnyday on which nothing, not even a telephone pole, casts a shadow is really something to seeand I have often wondered why, historically, such days did not get the same recognitionaccorded the solstices and equinoxes. Granted, the vernal equinox marked the beginningof the year but it was not at all obvious when it occurred especially if it occurred at night.Missing shadows are very obvious and easy to observe. I suspect that part of the reasonfor the lack of notoriety is that, in ancient times, most advanced civilizations lay north ofthe Tropic of Cancer so few people with the leisure to ponder matters calendrical wouldhave had the opportunity to witness this phenomenon.

Thousands of years ago, there were no calendars worth talking about. Still, suchknowledge was vital to farmers and others who had a serious need to know. They hadto satisfy this need by looking at the sky to see when certain stars rose and set comparedto the Sun. [46][12][58] From that, they could tell when to plant crops, when tides wouldbe best for fishing, and so on. Even in our current era, it took more than a thousand yearsto come up with the calendar we use today.

Thinking back on those times and trying to imagine what it was like is not easy forus. We live in a very different world, one that is immersed in sophisticated technology.It is almost impossible even to begin to list all of the things that we take for granted thatwould have been unimaginable to people even a few centuries ago. This technology andthe engineering expertise that brought it about are the result of progress in science, thecontinuing quest to understand the laws of Nature.

1

https://en.wikipedia.org/wiki/Aburi

https://en.wikipedia.org/wiki/Tropic_of_Cancer

https://www.weather.gov/cle/seasons

https://en.wikipedia.org/wiki/Gregorian_calendar

CHAPTER 1. THE URANIBORG LEGACY 2

We do not know who it was who first wondered about stars rising and setting or themovement of the tides. The origins of ancient beliefs about the world are largely forgottenbut there is no doubt whatever when and where our current understanding of Nature firstbegan in a truly modern fashion. It began in the last quarter of the sixteenth century on theisland of Hven. There, a Danish nobleman named Tycho Brahe built a castle, Uraniborg,and turned it into a world-class astronomical observatory.

Tycho Brahe was a man obsessed with the idea that, to discover the truth about Nature,all observations had to be as accurate as humanly possible. To this end, he fabricatedinstruments of unparalleled size and quality such as his great mural quadrant (Fig. 1.1).

He took equal care when using these instruments, with many elaborate corrections forexperimental error. But the effort was amply rewarded. When Brahe died in 1601, hisnotebooks contained results to a level of perfection that far surpassed anything ever seenbefore. None of his famous predecessors—Hipparchus, Ptolemy, Copernicus—even cameclose. Following classic Greek tradition, they all started with ideas; Brahe started withobservations.

His star catalog (see Fig. 1.2) is one very good example. The positions of the starsshown in this table provided a background sufficiently precise that the motions of theplanets could be plotted against it with unprecedented accuracy (cf. Table 5.1). They wereso precise that, a few years later, Johannes Kepler could use them to prove that the planetsmoved in ellipses around the Sun not circles about the Earth as nearly everyone had alwaysbelieved. Kepler’s three laws, along with the discoveries of Galileo, were the basis of thetheory of gravitation developed by Isaac Newton and Newton’s work laid the analyticalfoundation for all of physics and, eventually, for our technological civilization. And it allstarted with good observations.

1.1 Data vs. InformationUnderstanding always begins with observations (data). This is the initial step in traditionaldescriptions of the scientific method but it is also true in general. Less well appreciated isthe distinction between data and information.

In colloquial English, data and information are often taken to be synonymous but theyare not the same at all. Data refer literally to what we, or our instruments, observe andrecord. Information, on the other hand, can never be observed, merely inferred. Giveninformation, we can then use that information to extend our knowledge of the world, drawconclusions, make decisions, etc. We analyze data in order to extract information becauseit is information, not data, that we ultimately desire. It gives us something to think about.

This does not mean that information is an abstract, mental construct. Not at all. It isa genuine physical quantity, as physical as mass or temperature. [37][8] One proof liesin the fact that the transmission of information from one place to another is subject to theconstraints of Special Relativity, a set of physical laws that describe physical things. Thoselaws apply to information, stating (correctly) that the latter can be transmitted through

https://en.wikipedia.org/wiki/Ven,_Sweden

https://en.wikipedia.org/wiki/Tycho_Brahe

https://en.wikipedia.org/wiki/Uraniborg

https://en.wikipedia.org/wiki/Johannes_Kepler

https://en.wikipedia.org/wiki/Kepler's_laws_of_planetary_motion

https://en.wikipedia.org/wiki/Galileo_Galilei

https://en.wikipedia.org/wiki/Isaac_Newton

https://en.wikipedia.org/wiki/Scientific_method

https://en.wikipedia.org/wiki/Special_relativity


Figure 1.1: Mural Quadrant of Tycho Brahe (Tichonicus, 1598) [67]


space no faster than the speed of light. Information is a real quantity, usually measured inbits, that is contained in data much as metal is contained in ore and separating it out canbe complicated, requiring a process of inductive reasoning, that is, an inference.

The metal-in-ore analogy is a good one. For example, aluminum ore, bauxite, does notcontain pure aluminum, Al, but rather aluminum oxide, Al2O3, and extracting the formerfrom its oxide involves a chemical reaction and a whole lot of electricity. This is difficult,so much so that, in the nineteenth century, aluminum was as valuable as silver which is onereason why, when the Washington Monument was built, it was crowned with aluminum.

A similar difficulty applies to information contained in data. The information is notpresent in a “pure” form; it is there only implicitly and finding it takes some effort. Howmuch effort depends upon how deeply “buried” the information is. Sometimes the desiredinference is straightforward but, often, it can be very challenging. In fact, it is possiblethat the data do not contain enough information to draw useful conclusions, appearancesnotwithstanding.

1.2 UncertaintyOne huge problem with extracting information from data is that everything we observeand all that we know is uncertain to some extent. Our senses, instruments and analyticalprocedures are far from perfect. Our ability to reason, inductively or deductively, is alsoimperfect. These uncertainties contaminate everything associated with data analysis andthey have to be taken into account. That means that they have to be quantified sufficientlyso that their effects on any subsequent conclusions are readily apparent.

We never want to make mistakes but, if we do, we want to have a good idea of how biga mistake we might be making. Tycho Brahe was proud of the fact that his stellar positionswere accurate to one minute of arc, just one thirtieth the width of a full moon. How couldhe be so sure? He could not look in the back of the book for the correct answer. In reallife, there is no “back of the book”. Nevertheless, Tycho was right about being wrong.Even without a telescope,1 his results were, overall, as accurate as he claimed. [66]

Describing uncertainty and quantifying it properly are major topics in data analysis.They will be discussed in depth.

1.3 ModelsWe cannot go from data to information in one jump. An inference is always mediated bya model. The model is the link connecting data to information and it cannot be avoided.It does four things that are absolutely essential. First, a model provides a place to put thedata (typically numbers). To analyze the data, we need more than just a list of numbers;we need a mathematical structure showing that some numbers are related to, or more

1It had not been invented yet.

https://en.wikipedia.org/wiki/Bit

https://en.wikipedia.org/wiki/Bauxite

https://en.wikipedia.org/wiki/Washington_Monument


Figure 1.2: Tycho Brahe’s canonical catalog of a thousand fixed stars, as many as arereadily visible, derived from the most accurate observations carried out over the precedingtwenty years with the greatest care and subject to no errors of even one minute in celestiallongitude or latitude, adapted for the end of year 1600 A.D. [10]


common than, others. Second, the model encapsulates our domain expertise. It allows usto explain relationships implicit in our data. In other words, it describes our knowledge ofthe real world as it applies to the problem being addressed. Third, the model handles theuncertainties that are always present. For information to be useful, it must be accompaniedby a valid estimate of its accuracy. Finally, the model structures the information availablein a way that facilitates obtaining valid answers to our questions.

All of this presupposes analytical expertise. Models cannot create themselves. Theyhave to be built and customized for the given task. As we shall see, this is easier saidthan done. Most of the time, it can be done in more than one way. The result is a setof alternate, perhaps competing, models raising a new issue, “Is one model better thananother?” Again, we need a quantitative answer that everyone can understand.

We are not done yet. No analysis is complete until we prove that the model really doesdescribe what is observed. Otherwise, it is not a model at all even if it is the best model wehave. “Best” does not imply “good”. The necessary proof requires a quantitative measureof goodness-of-fit with a value that is sufficient for the purpose intended.

1.4 Inference in ActionThere is no better way to appreciate the ideas outlined above than to see them workingsuccessfully on real problems. The goal, in all cases, is to enhance our state of knowledgeby analyzing data that we obtain through observation. The general approach is to beginwith whatever we already know about a problem, often negligible, and incorporate anyinformation that we can infer from some new data. If we succeed, our state of knowledgewill increase. We will understand more about the world.

To illustrate this, we present three examples of inference in action using real data frompast experiments. For now, this is just simple show-and-tell with details omitted. All ofthese examples, plus many more, will be elaborated in the chapters to follow.

1.4.1 How Many Lizards?When studying or managing wild animal populations, a common task is to estimate howmany animals there are in a given region. There is a very large literature on this subjectthat goes back many years. Methodologies have gotten more sophisticated over that periodwith simple point estimates evolving into complex algorithms. The usual experimentalprocedure is to capture and label some animals, release them back into the wild, thencapture some again after a waiting period. [43] Usually, this loop is repeated as often astime and funding permit.

There are different variations of capture-recapture depending on factors such as whetherthe population is closed (constant) or open (emigration, immigration, birth and death, . . . ),whether the animals learn to avoid recapture, etc. Here we consider a small dataset from anexperiment to estimate the number of horned lizards in a closed region. [56] These lizards

https://eol.org/pages/791081/overview


are hard to count (or even find) because they are well camouflaged (see Fig. 1.3).

Figure 1.3: Horned Lizard (Phrynosoma mcallii) [67]

In this experiment, lizards were captured on 14 occasions. On each occasion, eachlizard (re)captured was given a small, indelible mark. Eventually, n = 68 different lizardswere found, 34 of them just once and an unlucky two six times. The final dataset issummarized in Table 1.1.

We want to answer the question, “How many horned lizards were there in this region?”Clearly, the answer is unlikely to be 68. Since they are hard to find, there were almostcertainly some that avoided capture even after 14 attempts. How can you count animalsthat you never saw? Is that even possible?

If you insist on a perfect census then, no, it is probably not possible. However, the datain Table 1.1 contain information and that information should say something about the size,N , of this lizard population. The trick, of course, is to design a model for extracting thevalue of N from the data available. Any such model will be, to a large extent, problem-specific and require a fair amount of modeling (and lizard) expertise. We shall discussmodeling in subsequent chapters. You are on your own with regard to lizards!

Actually, there is a second unknown in this problem. Not only do we not know N , wedo not know the probability, p, that a horned lizard will be caught at any sampling time.2 Ifwe did know p, then we could estimate N immediately since the data show that 68 lizards

2We shall discuss the meaning of “probability” more fully in Part II.


Table 1.1: Horned Lizard Recapture Data

Lizard # Captures Lizard # Captures Lizard # Captures Lizard # Captures1 1 18 1 35 2 52 32 1 19 1 36 4 53 63 1 20 1 37 2 54 24 1 21 1 38 2 55 25 1 22 1 39 4 56 36 1 23 1 40 3 57 37 1 24 1 41 3 58 48 1 25 1 42 2 59 39 1 26 1 43 2 60 210 1 27 1 44 5 61 311 1 28 1 45 2 62 212 1 29 1 46 2 63 413 1 30 1 47 2 64 514 1 31 1 48 6 65 215 1 32 1 49 2 66 316 1 33 1 50 2 67 317 1 34 1 51 2 68 3

were caught. There is a relationship between p, n and N that we shall discuss later whenwe construct a model for this problem.

So, after all of that, what can we say about the lizard population size in this case? Itturns out that we can say quite a lot even with no special ecology (or lizard) expertise. Wecan make estimates for N and p. Better yet, we can also determine a credible interval foreach of these unknown quantities giving the probability that the estimate is too high or toolow. How we can do this is the subject of this entire book so, in this chapter, we merelystate the problem and give the answers. Those answers are presented (a bit rounded off) inTable 1.2.

Table 1.2: Horned Lizards (solution)

Unknown Estimate 95% Credible IntervalMaxProb Mean Lower Limit Upper Limit

N 82 83.436 73 94p 0.117 0.116 0.092 0.140

We find that there is a 95-percent probability (odds of 19:1 in favor) for the propositionthat there were between 73 and 94 horned lizards in the experimental region with the mostprobable value being 82. If this is correct, then 14 lizards were never caught even after 14tries. Since the credible interval for N is so wide, it is not likely (probability = 0.074) thatN really was 82 but other values are even less likely. Still, we are claiming that we have a

https://en.wikipedia.org/wiki/Odds


“seven-percent solution” for something that we never observed. Sherlock Holmes wouldhave been proud!

Also, we find that the capture probability was (most likely) just 0.117 which meansthat, on any sampling occasion, any particular lizard would escape about eight out of ninetimes—not surprising given that they are very hard to find. This probability value is alsouncertain, however.

There is a great deal more that we could say but it would require too many explanationsat this point. Much will become apparent as we proceed.

1.4.2 Atmospheric CO2

It is well known today that, at some point in the eighteenth century, around the time that theindustrial revolution began, the concentration of carbon dioxide (CO2) in the atmospherebegan to increase rapidly. This is of concern because carbon dioxide is a greenhouse gasthat enables the energy content of the atmosphere to increase. Since it is impossible forenergy to do nothing, this excess concentration of CO2 leads unavoidably to significantclimate change, most of which is undesirable from a human standpoint.

Scientific studies around the world have been tracking CO2 concentration for severalyears and the resulting data have been placed in the public domain. [64][65] By examiningice cores from several sites, the dataset has so far been extended back for 800,000 years.The data for the past two millennia are shown in Figure 1.4.

The lack of trend seen in the left half of this figure is maintained for all earlier epochsexamined so it is fair to say that carbon dioxide was in a rough equilibrium for that entireperiod although the small fluctuations in the data are mostly real, not just measurementerror.

Obviously, all of that changed in the eighteenth century.Our task will be to take these 2,016 datapoints and determine what the concentration of

CO2 (in parts per million by volume, ppmv) used to be (the baseline) and when it startedaccelerating. Again, we want to be as quantitative as possible with respect to our estimatesand their uncertainties.

We shall say nothing yet about our model except that it comes in two pieces: i) a flatline up until some unknown change time, tc, followed by ii) exponential growth. We areassuming that we have no more relevant information (expertise) so we pick the simplestmathematical description suggested by Fig. 1.4. Apart from the piecewise complication,this is what analysts would term a regression analysis.

The analysis output quantifies the two unknown parameters and their uncertainties (seeTable 1.3).

Figure 1.5 shows the model (in blue) superimposed on the data. We shall see later thatthis simple model is a very good fit to the data at least as far as trends are concerned. Itdoes not try to “explain” the fluctuations in the baseline.

For a long time, the CO2 concentration was more or less flat at 279 ppmv3. Then,

3a measure which counts molecules, not their mass

https://en.wikipedia.org/wiki/Industrial_Revolution

https://en.wikipedia.org/wiki/Greenhouse_gas

https://en.wikipedia.org/wiki/Exponential_growth

https://en.wikipedia.org/wiki/Regression_analysis


� ��

��

��

��

��

��

��

��

��

��(��)

Figure 1.4: Atmospheric Carbon Dioxide Concentration (1 – 2016 CE)

Table 1.3: Atmospheric Carbon Dioxide (partial solution)

Unknown Estimate 95% Credible IntervalMaxProb Mean Lower Limit Upper Limit

baseline 279.37 279.37 279.26 279.48tc 1734.06 1734.17 1716.16 1751.06

around 1734, the concentration started to rise very rapidly. The industrial revolution beganat roughly this time and that is most likely not a coincidence.

Actually, the analysis results exhibit a fair amount of uncertainty in the year when thecurrent exponential growth began. We can draw a graph showing the relative certainty forthis unknown (see Figure 1.6).

For now, we defer making the ordinate of this graph quantitative but it does tell uswhich values of tc are likely and which are not. Clearly, values near 1734 are most likely.On the other hand, it is extremely unlikely that the change began before 1710 or after 1760.All of this can be made quantitative4 as we shall discuss in later chapters.

In contrast, our value for the concentration baseline is quite precise. One reason forthis is that this estimate is, in effect, an average value while the start of exponential growthis a single point in time. We shall see in Section 5.3 that averages are always more precisethan the numbers being averaged.

4For instance, the odds that exponential growth began outside the range [1710, 1760] are 156:1 against.


� ��

��

��

��

��

��

��

��

��

��(��)

Figure 1.5: Atmospheric Carbon Dioxide Concentration + Model

We conclude this example by noting that very specific information was extracted fromfairly noisy data. This was achieved, as always, by utilizing a specific model. That themodel is a good description of the trend in the data can be seen, literally, by comparing theblue curve in Figure 1.5 with the data. All data analysis proceeds in this fashion. Successwill depend on the quality of the model.

1.4.3 A Game of ChessIn this analysis, we examine an old question related to the game of chess. Among strongplayers, there are a lot of draws. It is the general consensus that the probability of a draw,Prob(Draw), is greater with “Queen-pawn” (QP) openings [7] than with “King-pawn”(KP) openings.5 We shall assess the validity of that proposition.

Our data, which come from an online database [6], comprise all official games forthe year 2014 in which both players were rated Expert or above and in which there wereat least five moves (to discount forfeits, etc.). The data are listed, month by month, inTable 1.4. There are 12 datapoints each for the KP and QP games with a grand total of14,553 games. Each datapoint is a vector of two dependent observations, viz., {# drawngames, # games}.6

Since we are wondering whether the KP/QP distinction matters with respect to draws,

5QP openings have ECO codes D and E and KP openings codes B and C.6dependent because # drawn games cannot exceed # games

https://en.wikipedia.org/wiki/List_of_chess_openings


��

�� (��)

��

Figure 1.6: Start of Exponential Growth in CO2 Concentration

we must consider two competing models. The first model, M1, will hypothesize that thedistinction is meaningless while the second, M2, claims that it is valid. We shall not putany limits on ”how valid” M2 must be (how much it matters), only that it is correct. Thus,in addition to estimating values for the unknowns in this problem, we also need to doa model comparison to decide which model is more credible. Here, “credible” meansdeserving of belief on its own merits (not because of some personal bias, etc.). As always,we want our answers to be quantitative and robust. Our final results for Prob(Draw), underthe conditions described above, are shown in Table 1.5.

It is apparent that, with M2, there is almost no overlap in the credible intervals for KPand QP drawing probabilities which suggests that the hypothesized distinction is reallyvalid. Here again, a picture is especially useful. Figures 1.7-1.8 show the relative certaintyfor various values of Prob(Draw) given our two models. It is interesting to compare thesegraphs with the numerical estimates in Table 1.5.

However, for complete understanding, pictures are not sufficient. Model comparisonrequires a quantitative answer. We shall describe two ways to obtain such a result: asimple, problem-specific approach then the mathematically-exact method. The latter ismuch more difficult to carry out but it provides a very robust, intuitive answer.


Table 1.4: Official Chess Results (2014, Expert and above)

Month OpeningKing-pawn (KP) Queen-pawn (QP)

Drawn Decisive Total Drawn Decisive Total1 99 284 383 96 172 2682 195 340 535 197 275 4723 373 725 1098 314 545 8594 209 380 589 189 394 5835 136 156 292 97 121 2186 215 607 822 221 471 6927 171 422 593 130 247 3778 487 1183 1670 448 785 12339 106 206 312 103 152 255

10 327 688 1015 296 460 75611 147 232 379 146 178 32412 132 308 440 150 238 388

Table 1.5: Prob(Draw) in Chess (solution)

Model Opening Estimate 95% Credible IntervalMaxProb Mean Lower Limit Upper Limit

M1 KP + QP 0.342 0.342 0.335 0.350

M2KP 0.320 0.320 0.309 0.330QP 0.371 0.372 0.360 0.383


��

��(��)

��

Figure 1.7: Draws in Chess (M1)

��

��(��)

��

KP QP

Figure 1.8: Draws in Chess (M2)


The simplest way to compare two models is to find some quantity that measures theirvalidity directly and then look at the difference (or ratio) of the respective measures. In thisexample, we are interested in deltaP = Prob(QPDraw)− Prob(KPDraw). Table 1.6and Figure 1.9 quantify the range of uncertainty for deltaP . Clearly, this range does notinclude zero with any reasonable probability7 so we can be very confident that QP drawsare more likely than KP draws.

Table 1.6: Prob(QPDraw) − Prob(KPDraw) (solution)

Estimate 95% Credible IntervalMaxProb Mean Lower Limit Upper Limit

0.052 0.052 0.037 0.068

��

��(�� ) - ��(�� )

��

Figure 1.9: Excess of QP Draws in Chess

These results for deltaP are an indication of the desired answer but, ideally, what wewould really prefer is a number giving the probability that M2 is a better (more credible)model than M1 regardless of how many parameters there are in each model and what theparameter values might be. This sounds impossible and indeed, until the end of the lastcentury, it was just wishful thinking. However, that is no longer the case. Thanks to

7Prob(deltaP < 0.02) ≈ 3.9× 10−5


recent advances in data analysis, it is now possible to compute a value for this probabilitydirectly (although it is a lot of work!) Consequently, we can conclude, with mathematicalrobustness (no additional assumptions or approximations), that

Prob(M2 better than M1) = 0.99999998 (1.1)

In other words, the proposition that there is a greater chance of a draw with a QP openingthan with a KP opening is a virtual certainty (odds of nearly 50 million to 1 in favor). Howone performs such a computation will be discussed in Chapter 10.8

8We have not shown that M2 is a good fit to the data but there are excellent reasons why it should beand, as we shall see, it is.

Part I

Data

18

GOING to high school for the first time was a big deal for us lowly eighth-graders.We looked forward to lots of new experiences and new things to learn. The formerincluded getting used to taking seven courses at the same time two of which were

English and Latin. Freshman Latin was devoted to grammar and began, not surprisingly,with the first declension (nouns) and the first conjugation (verbs).

One of the first verbs we learned was do, dare which means to give as evident in severalEnglish cognates such as donate and donor. Its past participle is datum (plural data)meaning given, both of which carry over into English unchanged except for pronunciation.Hence, a datum or a set of data specify something that is given, that is, something suppliedto an analyst and not under the analyst’s control. As we shall see later, this turns out to havetechnical (mathematical) ramifications that are not always appreciated. Another commonmisunderstanding, noted earlier in Section 1.1, is that data are the same as information.We have already emphasized that this is not true, rather that data contain information.Unfortunately, data also contain error and that is a problem.

This Part of the book discusses data per se. Since we wish to be quantitative as muchas possible, our focus will be on numerical data. We shall consider the different forms thatdata can take (discrete, continuous or something in between). It will be important to see,by examples, where one encounters various kinds of data and to know how such data maybe described and summarized.

We tend to think of data as observations of individual (scalar) values. However, datamay also consist of vectors or even matrices. Observed vectors and/or matrices may, inaddition, represent some kind of relationship most often between one quantity and another.Information then would include more than just the numbers themselves; it would describeany relationship(s) as well. In this book, we shall have the opportunity to consider manykinds of data and different kinds of information.

Most important, whatever kind of data we consider, we must always keep in mind thatwe are dealing with something observed—something real in the real world that contains(possibly hidden) information.

Chapter 2

Information is Real!

DATA contain information. Whenever we analyze any dataset, our goal is to extractthe information contained in the data. It was noted in Chapter 1 that information issomething physically real but this is not generally understood so it is worth a brief

digression to test this idea by performing an experiment.In this experiment, we shall guarantee that we know the correct answer by starting with

that answer then keeping it a secret. We can do this by constructing a sequence of syntheticdatasets (cf. sect. 6.3.1) using a specific formula, in this case y = Ax3 + Bx2 + Cx+D,with coefficients unknown to the subsequent analysis.1 Each synthetic dataset will containN datapoints obtained by choosingN random values of x in the interval [0, 10], computingthe value of y from the formula then adding some random jitter (noise) to give y some errorand make the analysis harder. The task will be to estimate the value of A from datasetscontaining one point, two points, three points, etc.

If you have ever done any data analysis in the past then, right away, you can foreseea major obstacle. Our secret formula has four unknown parameters, A–D, and “everyoneknows” that it will require four independent datapoints to get any answer at all, even apoor one. For instance, if you formulate this problem using linear algebra (solving foursimultaneous equations), then you will need at least four independent points. Otherwise,you will end up dividing by zero. That would be a show-stopper.

However, think about this problem from the information perspective. Suppose, forinstance, that our secret formula contained 100 unknown parameters and we had only 99datapoints. The linear algebra approach would fail again for the same reason as before.Yet, we have 99 percent of the “necessary” information. Are we really to believe thatgetting just one more point, even a lousy one, will suddenly make solving the problemfeasible? Isn’t there something wrong with this picture?

Indeed there is because information is real. The trick is to use it properly. If we have99 datapoints, those points contain a lot of information and we should be able to estimate avalue for any unknown parameter quite well. In fact, there should be nothing special aboutgetting as many points as unknowns (or more) in order to make the problem solvable. If

1actual formula: 1.3x3 − 3.5x2 + 2.1x+ 4

19

CHAPTER 2. INFORMATION IS REAL! 20

information is real, then the more we have, the better our estimate should be. That impliesthat we should be able to get some kind of estimate even if all we have is just one point!Even one point contains some information. Of course, one point will not produce a verygood estimate of anything but the procedure itself should not fail.

This experiment was performed as described with N = {1, 2, 3, 4, 5, 10, 20, 50}. Tomake sure that we did not simply “get lucky”, each analysis was carried out ten timeswith new input {x, noise} each time. Since we are still in show-and-tell mode, we shallpostpone describing our methodology until later (Chap. 9, et seq.). The results are shownin Table 2.1.

Table 2.1: Estimated Values for Parameter A

# Points Estimate Avg. 95% Credible Intervalin sample Min Max Mean Lower Limit Upper Limit Width

0 a −∞ ∞ 0 −58.80 58.80 117.61 −111.73 120.08 0.7705 −20.50 21.72 42.222 −70.752 70.030 0.7128 −12.13 10.98 23.113 −15.979 17.785 1.2088 −2.358 4.230 6.5884 −2.6426 5.3232 1.2997 0.8278 1.756 0.92825 −1.1139 3.3306 1.3006 1.1113 1.4945 0.383210 −0.1357 2.8883 1.2997 1.2173 1.3701 0.152820 0.9661 1.6452 1.3001 1.2709 1.3301 0.059250 0.7772 1.6126 1.2999 1.2781 1.3200 0.0419

asee pg. 168

The true value for A was 1.3. With just one datapoint, the average estimate (col. 4)was 0.77 with individual values considered (see Chap. 10) ranging over [-112, +120],quite a broad range. Still, if the only way you know to attack this problem is by traditionalalgebraic methods, then any result with just one datapoint should come as a revelation.The problem is not impossible after all.

The last column in this table is especially noteworthy. It shows the uncertainty in theestimate forA averaged over the ten replicates. With only one datapoint, the uncertainty inthe estimate is much larger than the estimate itself (and more than 30 times larger than thetrue value ofA). However, as the number of datapoints in the sample increases, the amountof information in that data increases so the average uncertainty in the estimate gets smallermonotonically.2 This is exactly the sort of behavior one would expect from a physicalsystem. Moreover, as Figure 2.1 shows, nothing special happens going from three pointsto four points (in contrast to linear algebra). The uncertainty keeps going down smoothlywith no big discontinuity anywhere. Notice also that even the Min and Max values in thetable tend to get better as well.

2Note: The accuracy need not decrease if the data are biased.

CHAPTER 2. INFORMATION IS REAL! 21

� ��

��

�

��

��

�� (�)

��%��

Figure 2.1: Parameter A Uncertainty vs. Information

All of this happens (on average) because information is a real, physical quantity. Itresides in data and, in a successful data analysis, we isolate it from the error in the data.Other things being equal, the more data we have, the more information we have. However,it matters a lot how we attempt to carry out data analysis. All approaches are not createdequal.

The physical reality of information is not just an interesting sidelight; it has importantimplications for data analysis. It highlights the fact that, whenever we analyze data, weare engaging in a process that is partly physical, not just a mathematical exercise, so anyvalid procedure cannot be arbitrary. There will be constraints of some sort. In Part III,those constraints will be spelled out in detail and they will be inviolable. We emphasizethis point because ignoring these constraints is unfortunately quite a common practice and(at the time of writing) hard-coded in some commercial software .

Analysts have various opinions in this regard and some would argue that we are takinga “purist” approach to this issue but we make no apology for that. Math is unforgiving.If/when you begin to cheat on the rules in any way, you start down a path that leadseventually to big mistakes. Here, we shall endeavor to avoid mistakes as best we can.

Chapter 3

Discrete Data

DATA are termed discrete whenever they are integers or something non-numeric thatcan be encoded as integers, e.g., True/False encoded as 1/0, resp. Almost always,discrete data are positive integers, including zero. For mathematics generally and

for data analysis in particular, the important thing about discrete values is that they areseparated from each other. That is, they cannot be arbitrarily close to each other. Asa result, the manner in which they may be described and analyzed will differ in somerespects from data represented by real numbers. The latter are termed continuous and willbe discussed in the next chapter.

When discussing data of any kind, we shall have to describe these data in some usefulfashion, either numerically and/or pictorially. Numbers are useful for analysis but picturesare sometimes preferable depending upon the audience. Such pictures can often be quiteelaborate, e.g., the well-known example in Figure 3.1. [45][67] These are justly famouswhen well-done but, in this book, we shall limit ourselves to simpler and more commontypes of pictures such as histograms and graphs.

3.1 Ordinal Data vs. Nominal DataData that are discrete are partitioned into two very different types: ordinal and nominal.Both types are encoded as positive integers but ordinal data are meant to be interpreted assuch while nominal data are not. For instance, if {1, 2, 3} are intended to be ordinal, thenthe difference between 1 and 3 is twice as large as that between 1 and 2. However, if thesame set of data is just an encoding of non-numeric groups/categories such as {red, white,blue}, then the data are nominal and must never be treated numerically (mathematically).They may be used only as category labels.

We shall focus mainly on ordinal data in this book but nominal data can often contain alot of information, e.g., in epidemiology and the social sciences. However, such data mustbe handled in a special way. See Sections 3.4 and 14.7.

22

CHAPTER 3. DISCRETE DATA 23

Figu

re3.

1:R

ussi

a’s

Con

ques

tofN

apol

eon


3.2 Count DataBy far, the most common sort of discrete data are counts of some sort. To take a familiarexample, suppose I throw a pair of ordinary dice ten times and record the total each time.I might get something like the following dataset:

5 7 8 5 10 5 7 5 7 6

These ten observations are discrete. Each is an integer. The possible range is [2, 12]but, with only ten values, we will see just a subset and, indeed, we see only five differentvalues with several duplicates..

The little dataset above contains every bit of information available but, when sets ofnumbers get very large, it is not feasible simply to list all of the datapoints since peoplecannot absorb that much information all at once (unlike computers). Therefore, we needways to shorten the input for human consumption. This is usually done by grouping thedata in some fashion. For instance, we could display only the frequencies of the resultingtotals. One way to do this is with a frequency histogram, in this case Figure 3.2.

� � � � � � � � ��

�

�

�

�

��

��

N = 10

Figure 3.2: Dice Totals: 10 Tosses

Whenever data are grouped, some information is lost. In the histogram above, we losethe order in which the totals were observed. Still, such summaries are sometime necessary.On a different occasion, dice were tossed 100 times and then 1,000 times. We could listthe totals one by one but that would be unwieldy. With many discrete datapoints, it isalmost always better to report the observed frequencies as in Table 3.1 or Figure 3.3.


Table 3.1: More Dice Totals

N 2 3 4 5 6 7 8 9 10 11 12100 2 9 8 12 19 16 13 7 7 6 1

1000 30 58 86 111 137 164 143 106 78 65 22

� � � � ��

�

��

��

��

��

��

N = 100

� � � � ��

��

��

��

��

��

��

N = 1000

Figure 3.3: More Dice Totals

3.3 Summary StatisticsGrouping is not the only way to summarize a dataset. There are other summaries that havesome very useful mathematical properties.

3.3.1 MomentsThe most common of these are the moments of the data where the term moment comesfrom an analogy in physics.

There are four moments that apply to data analysis and the first of them should befamiliar. It is termed the (arithmetic) mean and it equals the average. Given N datapoints,xi, the formula for the mean, x, can be written as follows:

x =1

N

N∑i=1

xi =1

N

N∑i=1

(xi − 0) (3.1)

The second part of (3.1) is shown here to emphasize the fact that the datapoints arebeing compared to zero. With “higher” moments, discussed below, data values are usuallycompared to their mean. When comparing to zero, moments are often termed raw momentsand, when comparing to their mean, central moments.

For our dataset of ten totals, the mean = 6.5 and this value serves to characterize thedataset with a number that quantifies, in a particular way, how different these data are fromzero. Obviously, a lot of information is lost when using the mean to represent the entiredataset.

https://en.wikipedia.org/wiki/Moment_(mathematics)


We can record more of the available information by computing the second moment, asfollows:

mom2(x) =1

N

N∑i=1

x2i =1

N

N∑i=1

(xi − 0)2 (3.2)

This is the second moment because xi is raised to the second power.The second central moment, termed the variance, is a more useful quantity.

var(x) =1

N

N∑i=1

(xi − x)2 = mom2−mean2 (3.3)

The rightmost part of (3.3) is often remembered as ”the average of the squares minus thesquare of the average”. Variance is always ≥ 0. The square root of var(x), with the sameunits as x, is called the standard deviation, usually symbolized σx.

Mean(x) describes the average value of x, often called the expectation of x; the variancedescribes how a set of values spreads out about their mean. These two quantities are almostalways independent of each other. That is, knowing the mean tells you nothing aboutthe variance, and vice versa. Note that formula (3.3) is the sample variance and, strictlyspeaking, applies only to a sample of data. To estimate the variance of the entire populationfrom which the sample was drawn, 1/N in (3.3) should be replaced with 1/(N − 1).

Finally, the third central moment is skewness and the fourth is kurtosis. Skewnessdescribes how lopsided, about their mean, a set of numbers is. A histogram with a “righttail” has positive skewness and one with a “left tail” has negative skewness. If the dataare perfectly symmetric about their mean, their skewness is zero. Kurtosis describes how“peaked” the histogram is—large kurtosis, like a teepee, or small kurtosis, like an anthill.Kurtosis is always≥ 0. Both skewness and kurtosis are usually expressed in dimensionlessform by dividing by variance raised to the 3/2 or second power, respectively.

Central moments, collectively, are a good way to summarize a set of scalar values. Weshall see later that variance, in particular, has some very nice properties.

3.3.2 Weighted AverageEquation 3.1 is not the only way to compute an average. Another way, which will beextremely important in subsequent chapters, is to utilize the data values, xi, along withtheir respective frequencies, fi. Doing this gives the following formula:

x =1

N

N∑i=1

fixi =N∑i=1

wixi (3.4)

where wi ≡ fi/N .This formula assumes that all values are included so that the weights, wi, sum to one. Theweights are then said to be normalized.

If we apply (3.4) to the N = 1000 data of Table 3.1, the average is 6.960, the sameanswer (of course) that we get with (3.1).


3.3.3 QuantilesIf we sort a set of numbers, we can describe the set by means of its quantiles. Quantilesare numbers that split a sorted set of values into subsets of equal size, ignoring numericalvalues. Of special interest are the quartiles which divide the sorted set into four subsets.Dividing the sorted set in half is the median (the second quartile). The first quartile thendivides the lower half in half while the third quartile divides the upper half in half. Thedifference between the first and third quartiles is the interquartile range which is oftenutilized as an alternate measure of how “spread out” the numbers are.

Sorted data are also often divided into 100 subsets, called percentiles.

3.3.4 Numerical ExampleMoments and quartiles are generally useful for summarizing all scalar data, continuous aswell as discrete. Table 3.2 lists their values for the N = 1000 data of Figure 3.3.

Table 3.2: Some Summary Statistics for 1000 Dice Throws

Statistic ValueMean 6.960

Variance 5.8904Standard Deviation 2.42701

Skewness −0.00665Kurtosis 2.33313

First Quartile 5Median 7

Third Quartile 9Interquartile Range 4

3.3.5 ModeAnother very common summary is the mode of a dataset. This is simply the most commonvalue in the data assuming that there is one. In both histograms shown in Figure 3.3, themode = 7. Not every dataset has a mode while some datasets have more than one and aresaid to be multi-modal.

3.4 Indicator VariablesWhen a datum is nominal, analysis will require some sort of indicator (dummy) variable.Indicator variables are not numeric and can be used only to signify that an observation isin some category of the data. For example, data about the wins of a baseball team mightinclude a Home/Away category since that may be an important factor.

https://en.wikipedia.org/wiki/Quantile

https://en.wikipedia.org/wiki/Mode_(statistics)

https://en.wikipedia.org/wiki/Dummy_variable_(statistics)


For any category, values must always be disjoint (not overlap) so that all of the pointsare well-defined. They must also be comprehensive (include all possibilities). Given thispartition, the corresponding indicator will be a Boolean value—either 1 or 0, indicatingTrue or False, respectively.

Since the category values are comprehensive then, with k category values, only k−1 ofthem can be unknown. In technical language, there are only k−1 degrees of freedom. Thatis, if someone told you the values for k − 1 of the possibilities, you would automaticallyknow the value for the final possibility, e.g., if it was not a Home game, then it must havebeen an Away game. This will be important later when we analyze such data.

3.5 Missing, Censored and Truncated DataWe have neglected something with which anyone who has ever analyzed real data is alltoo familiar, namely, that data can be missing. This is apparent whenever the observationscome from a sequence of known size or from a matrix with known dimensions. In suchcases, data elements can be seen to be absent. For instance, if the data are supposed to beaverage math SAT scores from some jurisdiction, a missing datum will be obvious.

Data might also be censored. That is, some data values might be possible, and perhapseven observed, but those points are not available to the analyst for any number of reasons.Alternatively, it may be that some data values are physically impossible and our usualmodel for describing them must be truncated to exclude such values a priori.

These are all important issues but we shall say little about them except to note thatsuch situations can be readily handled in conventional, albeit software-dependent, ways.

3.6 ErrorsSo far, we have said nothing at all about errors. All data in this chapter were discrete butthat does not mean that they were necessarily correct. The lizard data (Sect. 1.4.1) werealso discrete but it was assumed at the outset that there were going to be a lot of countingerrors since the lizards were hard to find.

We can never be certain about the number much less the size of errors. For instance, theChandra X-ray telescope, when making a long exposure of a faint source, might observejust one photon every four days. [23, pg. 107] Under such circumstances, one must ask,“Did we just see a photon, or not?” There is a lot of uncertainty in the answer evenwith discrete data and more so with continuous data. Consequently, we must defer anydiscussion of errors until Part II.

https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)

https://www.nasa.gov/mission_pages/chandra/main/index.html

Chapter 4

Continuous Data

IN the last chapter, we described data that took the form of integers and were termeddiscrete since they were clearly separate from each other. In contrast, continuous datatake on values that are not necessarily separate from each other. That is, there is no

limit as to how close together they can be. Continuous data, therefore, take the form ofreal numbers. In recorded data, this means that there are nearly always some digits to theright of the decimal point in at least some of the datapoints.

Continuous datasets are of two different kinds: univariate (one number per datum) ormultivariate (two or more numbers per datum). In the former case, each datum is typicallya single measurement (observation) with repeated measurements termed replicates. In thecase of multivariate data, the values in a datum may be observed either all at once or,perhaps, at various times by various observers.

Since continuous data are real numbers, their values can get very large or very small.Consequently, data errors can be harder to handle or even perceive. This is a topic for thenext Part of this book but it is something to keep in mind as we discuss continuous datagenerally.

We shall discuss univariate and multivariate data separately. Almost everything we sayabout univariate data will apply to multivariate data as well but the latter will require someadditional explanations.

4.1 Univariate DataGood examples of continuous data can be seen in the three datasets shown in Table 4.1.Here, each entry is the position, {Longitude, Latitude} in ecliptic coordinates, of a visiblestar in the constellation Aries. The first two examples are mainly of historical interestwhile the third is current. All numbers are shown, by convention, as integers but theywould be real numbers were they converted into decimal degrees.

Each position is actually two-dimensional but these dimensions are always measuredseparately so we can consider each column shown here as one set of continuous, univariate

29

https://en.wikipedia.org/wiki/Ecliptic_coordinate_system

CHAPTER 4. CONTINUOUS DATA 30

Table 4.1: Aries Data Over the Ages

Star ID Ptolemy (150 CE) Tycho (1601 CE) Current (2000 CE)Longitude Latitude Longitude Latitude Longitude Latitudedeg min deg min deg min deg min deg min deg min

γ 6 40 7 20 27 37 7 8.5 33 11 7 10β 7 40 8 20 28 23 8 29 33 58 8 29α 10 40 10 30 32 6 9 57 37 40 9 58η 11 0 7 40 32 34 7 23 38 8 7 24θ 11 30 6 0 33 20 5 42.5 38 52 5 45ι 6 50 5 50 27 57 5 24 33 31 5 28ν 17 40 6 0 38 36 6 7 44 8 6 10ε 21 20 4 50 42 57 4 8.5 48 30 4 11δ 23 50 1 40 45 15 1 46.5 50 51 1 50ζ 25 20 2 30 46 24 2 50 51 57 2 54τ1 27 0 1 50 47 50.5 2 36 53 24 2 37ρ3 19 40 1 30 41 22 1 12 46 56 1 11π · · · · · · · · · · · · 39 35 1 7 45 8 1 8σ 18 0 −1 30 39 23 −1 20 44 57 −1 18o · · · · · · · · · · · · 37 52 0 −39 43 25 0 −35µ · · · · · · · · · · · · 38 46 4 1 44 20 4 4κ · · · · · · · · · · · · 31 41 9 13 37 15 9 1533 19 10 10 40 40 35 10 50.5 46 8 10 5335 19 40 11 10 41 23 11 16 46 56 11 1938 15 0 −5 15 · · · · · · · · · · · · 42 39 −3 2139 21 20 12 40 42 51 12 25.5 48 22 12 2941 21 40 10 10 42 40 10 24 48 12 10 27

data. As before, we would like to start out by describing the data in some fashion that doesnot require listing all of the individual values. Once again, we desire a summary.

4.1.1 Summary StatisticsLet us consider the current ecliptic longitudes, in decimal degrees, for the stars shown inTable 4.1. Aries is one of the signs of the Zodiac which, traditionally, stretch over 30degrees of longitude (which is why there are twelve of them). The 22 datapoints comprisea set of similar numbers that we can summarize in the same way that we discussed before.

We can compute the usual moments, quartiles, etc. as we did in the previous chapter.For this dataset, we get Table 4.2. All of these statistics have the same definitions as withdiscrete data.

With continuous data, the Min and Max, here {33.184 deg, 53.396 deg}, are oftenadded as additional data characteristics.

https://en.wikipedia.org/wiki/Zodiac


Table 4.2: Summary Statistics for Current Aries Longitudes

Statistic ValueMean 43.5656

Variance 34.1558Standard Deviation 5.84429

Skewness −0.315312Kurtosis 2.08211

First Quartile 38.128Median 44.6375

Third Quartile 48.203Interquartile Range 10.075

4.1.2 HistogramWe can also create a frequency histogram (see Fig. 4.1). It is left-tailed (skewness < 0).

��

�

�

�

�

�

�

�

��

��

N = 22

Figure 4.1: Current Stellar Longitudes for Aries

Unlike our histograms for discrete data, the bins in Figure 4.1 do not have unit width.Instead, we have a binwidth of five. Binwidth greater than one is typical of continuous datasince values do not differ by an integer. With a binwidth of five, there is also additionalloss of information. We cannot tell, for instance, how many values there are between 40and 41. Of course, Table 4.1 tells us that there are none (in this dataset) but this histogram


does not. We could create a histogram with bins of different sizes that might show us moreinformation but convention dictates that they all have the same binwidth. This facilitatessubsequent analysis.

4.1.3 Boxplots

35

40

45

50

55

Figure 4.2: Boxplot for Longitudes

The three quartiles in Table 4.2 can be used toconstruct a boxplot. There is more than oneform for such a plot but Figure 4.2 shows theessential features. The orange “box” showsthe interquartile range with the median as awhite line. The two “fences” outside the boxcan denote various kinds of limits. Here, theydelimit 1.5 × the interquartile range. If thereare “outliers” even farther away from the box,they are (optionally) shown as dots. In thisdataset, there are no outliers.

Boxplots are especially useful when youwant to compare several datasets side-by-sideas in Figure 4.3 which shows stellar longitudes spanning nearly two millennia.

Ptolemy

Tycho

Current

10

20

30

40

50

Figure 4.3: Historical Boxplots for Aries Longitudes

Clearly, longitudes are increasing slowly over time. This is due to the fact that the zeroof ecliptic longitude (the celestial Greenwich) is defined to be the longitude of the vernalequinox which is precessing towards the west and longitudes increase towards the east.

https://en.wikipedia.org/wiki/Box_plot

https://en.wikipedia.org/wiki/Axial_precession


4.2 Multivariate DataAs we saw in the last chapter, discrete data are usually counts of some sort and, thus, arescalar values. Continuous data occur in many more contexts and can be multivariate innature, occurring as vectors or matrices. Often such multivariate data specify relationshipsbetween one quantity, termed the dependent variable and one or more covariates.

4.2.1 CovarianceOf course, relationships may or may not be genuine. One useful metric for quantifying aputative relationship between two or more variables is their covariance. For a dataset of Nbivariate points (x, y), sample covariance, cov(x, y), is defined as follows:

cov(x, y) =1

N

N∑i=1

(xi − x)(yi − y) (4.1)

Equation 4.1 is for the sample alone and is biased low with respect to the whole populationas noted in the previous chapter.

This formula can be generalized to more than two dimensions when necessary. In thatcase, it is convenient to summarize all of the possible bivariate pairs in a single covariancematrix of order two with each matrix element computed as in (4.1).

If we consider the current positions in Table 4.1 as 22 bivariate datapoints, we obtaintheir covariance matrix, M, as follows:

M ≡ cov(lon, lat) =

(34.1558 −5.5918−5.5918 18.0863

)(4.2)

In (4.2), M[1,1] is var(lon) and M[2,2] var(lat). M[1,2] = M[2,1] is cov(lon, lat) asgiven by (4.1). Note that cov(x, y) is necessarily the same as cov(y, x) so a covariancematrix is always symmetric about its diagonal regardless of its size.

4.2.2 CorrelationA covariance matrix can be converted to a correlation matrix, C, by dividing each elementby the two associated standard deviations. In matrix C, each element is thus given by

corr(xi, xj) =cov(xi, xj)

σxiσxj(4.3)

where i and j can be equal. As a result, the diagonal of C will contain all ones while thesymmetric off-diagonal elements will be the correlation between xi and xj . Corr(xi, xj)is always in the range [-1, 1] and is usually symbolized by ρi,j .

If two variables are positively correlated, ρ > 0, then one variable will tend to increase(decrease) as the other increases (decreases); if they are negatively correlated, then onewill tend to increase as the other decreases. The emphasis here is because this tendency isalmost never perfect. There will be lots of exceptions for individual points.

https://en.wikipedia.org/wiki/Covariance

https://en.wikipedia.org/wiki/Covariance_matrix

https://en.wikipedia.org/wiki/Covariance_matrix

https://en.wikipedia.org/wiki/Covariance_and_correlation


-3 -2 -1 1 2 3

-3

-2

-1

1

2

3ρ = 0

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3ρ = 0.75

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3ρ = -0.95

Figure 4.4: Correlation (or not)

Correlation, therefore, can serve to quantifya putative relationship between two (or more)variables. If such a relationship exists, then onevariable will change as the other changes andthis tendency will get stronger and stronger ascorr(x, y) gets closer and closer to 1 (or −1).

Figure 4.4 shows three examples of variousamounts of correlation. In all three, there are50 points. In the top plot, correlation is zeromeaning that there is no significant relationshipbetween the variables assigned to the two axes.Accordingly, the 50 points are roughly equallydivided among the four quadrants of the plot.

In the middle plot, ρ = 0.75 (positive) sothe two variables tend to increase and decreasetogether. However, the correlation is not closeto one so this tendency is not very strong. Onewould say that these two variables are looselycorrelated.

In the bottom plot, ρ = −0.95 (negative)so one variable tends to decrease as the otherincreases. Moreover, this tendency is strong.As a result, the 50 points are nearly in a straightline and two of the four quadrants are almostempty of points.

If a correlation were perfect, ρ = ±1, all ofthe points would fall on a straight line exactly.

4.2.3 Plotting RelationshipsRelationships between two or three quantities allow the data to be displayed in the formof plots (graphs) showing how the dependent variable changes as the covariates change.Such plots take various forms depending, for example, on how the axes are scaled.

Linear plots

A simple example of a relationship between two variables was shown in Figure 1.5. There,the plot displays the relationship of CO2 concentration (dependent variable) over time, the(independent) covariate. Drawn this way, the plot is showing the “dependence” of CO2

concentration on time using both the data (in black) plus a mathematical function (in blue)purporting to show the same thing.

In this plot, both of the axes are drawn using a linear scale, like a ruler. Sometimes this


is inconvenient because there are datapoints observed over an extremely broad range and,if displayed linearly, very small values would be hard to discern.

Log plots

In astronomy, magnitudes are defined such that a difference of five in magnitude is equalto a factor of 100 in brightness, with larger values being dimmer. Thus, a difference of1 in magnitude makes the object brighter (or dimmer) by a factor of 5

√100 ≈ 2.5. The

brightest star in the sky, apart from the Sun, is Sirius with a magnitude of −1.5 while thefarthest planet, Neptune has a magnitude of 8 and cannot be seen with the naked eye.1

Figure 4.5 shows a plot with several values of apparent magnitude on the abscissa andbrightness, compared to the North Star (Polaris), on the ordinate. For magnitudes greaterthan two, it is hard to tell the brightness values apart.

-� -� -� � � � � � � � � ��

��

��

��

��

��

��

��

Figure 4.5: Magnitudes of Relative Brightness

The usual solution to this problem is to plot the dependent variable using an ordinatescaled according to the common logarithm of the associated values. This log plot can showboth large and small values the same time. Figure 4.6 is such a log plot for the same data asFigure 4.5. The points in this log plot form a perfect straight line because the magnitudescale is perfectly logarithmic. Other well-known logarithmic scales include the Richterscale for earthquakes and the pH scale for acidity.

1Uranus has a magnitude of 6 and is just barely visible but only if your eyes are really, really good.

https://en.wikipedia.org/wiki/Magnitude_(astronomy)

https://en.wikipedia.org/wiki/Sirius

https://en.wikipedia.org/wiki/Neptune

https://en.wikipedia.org/wiki/Polaris

https://en.wikipedia.org/wiki/Richter_magnitude_scale

https://en.wikipedia.org/wiki/Richter_magnitude_scale

https://en.wikipedia.org/wiki/PH

https://en.wikipedia.org/wiki/Uranus


-� -� -� � � � � � � � � ��

��

��

�

��

��

��

��

Figure 4.6: Magnitudes of Relative Brightness (log plot)

In some cases, it is convenient to use a log scale on both axes, giving a log-log plot.This is often done when graphing a power-law relationship between two variables.

Three-variable plots

When there are two covariates plus the dependent variable, plots become three-dimensionaland are harder to display. Here, we shall take a generic example with a dependent variable,z, and covariates, x and y. The 3-D analogue of the standard plot is shown in Figure 4.7.

There are two common alternatives for displaying such a plot. The first is a contourplot in which selected contours of constant z are displayed projected onto the page, as inFigure 4.8 in which the contour values are labeled explicitly. In a third format, z valuesare encoded using color. In this case, it is necessary to include a legend showing how thecolors map to z values. Figure 4.9 shows a plot of the same information encoded in thisfashion, often termed a density plot. Sometimes, a density plot includes the contours aswell.

Color coding is very common for displaying information in a plot especially when thevariables are nominal as in the map on this webpage.

With more than three dimensions of information, displaying it becomes problematicalwhich is why examples such as Figure 3.1 are so famous.

https://en.wikipedia.org/wiki/Power_law

https://www.causascientia.org/math_stat/PropCI_Usage.html


Figure 4.7: Standard 3-D Plot

-0.8

-0.8

-0.6

-0.6

-0.4

-0.4

-0.2

-0.2

0

0

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y

Figure 4.8: A Contour Plot


-0.75

-0.50

-0.25

0

0.25

0.50

0.75

Figure 4.9: A Color-coded Density Plot

4.2.4 Additivity of VarianceAn important property of the variance is that it can be shown, using only straightforwardalgebra, that the variance of the sum (or difference) between two variables, x and y, isgiven by the following formula

var(x± y) = var(x) + var(y)± 2 cov(x, y) (4.4)

provided, of course, that the values being added (or subtracted) all have the same units, abasic requirement of arithmetic. Using values from (4.2), we get

var(lon+ lat) = 34.1558 + 18.0863 + 2 (−5.5918) = 41.0585 (4.5)

Equation (4.4) is especially useful when x and y are independent. In that case, theircovariance (and correlation) is zero, in theory, although it might not be exactly zero for


an observed sample. Conversely, variables with zero covariance (correlation) are almostalways independent.

This additive property is of great utility in data analysis because what we observe isvery often the net result of a combination of factors. To the extent that these factors areindependent of each other, the variance of what we observe is equal to the sum of thevariances of the individual factors.

One of the underlying questions asked of data is, “Why are these numbers not all thesame?” This is the same as asking why their variance is greater than zero. In this sense,the variance of a dataset constitutes “what is to be explained”, quantifying a major goal ofdata analysis that we shall address specifically when we discuss uncertainty in Part II.

4.3 Pseudo-continuous DataContinuous data are real numbers, in principle. However, such numbers can, and usuallydo, have an infinite number of decimal digits. Obviously, these cannot all be writtendown. Recorded data must necessarily contain only a finite number of digits. This meansthat continuous data are usually truncated to some finite multiple of the unit described bythe rightmost decimal digit. Hence, they are also quantized to an interval within which wedo not distinguish.

In some circumstances, this quantization (truncation or rounding) can be so extremethat it must be taken into account explicitly when the data are analyzed. We shall see anexample later in which continuous data were so heavily rounded off that they had to betreated as discrete bin frequencies.

It may also happen that data which are actually discrete are treated as though they werecontinuous. This very often happens when the data are rational (fractions). For instance,in Major League Baseball in the U.S., batting averages are routinely expressed as three-digit decimals even though batters rarely experience 1,000 official at-bats. [41] Givingtheir “average” to one part in a thousand (the third decimal digit) is therefore unjustified.Nevertheless, fractions are typically treated as though they were continuous.

In this book, we shall consider continuous data to be real-valued and discrete data tobe integer-valued except where noted.

Part II

Uncertainty

41

THERE was a trick to it of course. I knew that there must be and, sure enough, therewas. What you had to do, if you were right-handed, was to position the burette sothat the stopcock was pointing to the right. Then, you could wrap your left hand

around the bottom of the burette in order to manipulate the stopcock while swirling theflask counterclockwise with your right hand. If you were especially dextrous, you couldset the flask down just prior to the endpoint and use two hands to twist the stopcock veryquickly to get half-drops. I really needed those half-drops because I really needed the job.

This was late June between my Junior and Senior undergraduate years and I had foundsummer employment working in a chemistry laboratory for the U.S. Fish and WildlifeService. We tested fish and sometimes water. I was lucky to get work at all that year andparticularly so to find something that matched my college major. Getting paid for doingchemistry was the best I could have hoped for and a far cry from my first summer job, atage eleven, picking string beans on farms for three dollars a day.

The procedure in question was the final step in a day-long analysis to determine thepercentage of protein in a fish. It was not that easy. The burette had engraved divisions,about a millimeter apart, marking off tenths of a milliliter.2 However, there are roughly20 drops in a milliliter so you had to read between the lines because results (final reading− initial reading) had to be recorded to two decimal places. The experimental protocolrequired doing this acid-base back titration in three replicates and, in order to avoid bias,you had to carry out all three replicate titrations, making six measurements, before doingany of the three subtractions. If any of the three results then differed in the first decimalplace, you had to repeat all three all over again. I had the feeling that too many suchrepetitions and I would have become unemployed. Fortunately, I had taken QuantitativeAnalysis the previous academic year so I had plenty of recent practice.

The point of such a rigorous protocol was that a) the final result was very important andb) measurements are never perfect. There is always some uncertainty, i.e., measurementerror. Had our burettes not been periodically cleaned with chromic acid, there would alsohave been a systematic error since all measured volumes would have been too large.

This Part of the book addresses the subject of uncertainty in general. There is a lot toknow even after all the observations are complete and the results tabulated. For instance,in the protein analysis described above, I had to average the three results. Today, everyonewould do that step automatically but, once upon a time, no one would have known how tosummarize discrepant observations with a single number. It was not until the eighteenthcentury that there was any good theory to serve as a guide. [53][63]

To explain this theory, we shall need to discuss the subject of probability and its twobasic rules in some detail. This will provide us the tools we require to handle uncertaintiesmathematically and thus know how to carry out accurate and robust data analysis.

21 milliliter (ml) = 1 cubic centimeter (cc).

https://en.wikipedia.org/wiki/Burette

https://en.wikipedia.org/wiki/Titration

https://en.wikipedia.org/wiki/Chromic_acid

Chapter 5

Errors and Ignorance

SINCE data are never perfect and there is always some amount of error, we shouldbegin by asking why. What is the origin of these errors? Can anything be “done”about them? If we cannot make them go away, can we make them smaller? If not,

can we at least say something valid about how big they are? Is the size of error predictablein any reliable way?

Are errors the only source of uncertainty? What else might there be? Most importantof all, can we “handle” uncertainties in some robust, mathematical fashion?

All of these questions are intimately related to the subject of data analysis. Fortunately,the answers are known and, for the most part, satisfactory for our purposes. We shall seemany practical examples in forthcoming chapters. In the broadest terms, uncertaintiescome from errors and ignorance.

5.1 ErrorsAn error is a mistake of some sort, perhaps avoidable, perhaps not. We shall focus ontwo kinds of error, measurement error and systematic error; we shall ignore other possiblekinds such as deliberate misinformation, occasional blunders and typos.

Measurement errors arise due to imperfections in a measuring apparatus and/or in theuse of the apparatus. For example, we noted earlier that Tycho Brahe is famous for hisastronomical instruments. They were the best in the world at the time. He also was expertin using them, making correction tables for any deficiencies in them that he discoveredthrough years of use. His observations were much better than those of astronomers beforehim as Table 5.1 illustrates. The errors listed there are based on current knowledge of whatshould have been observed when those measurements were made. Not surprisingly, theerrors of Ptolemy1 were much larger.

Astronomical data are a bit complicated to analyze so let us take the much simplerexample shown in Table 5.2.

1largely inherited from his predecessor, Hipparchus

42

CHAPTER 5. ERRORS AND IGNORANCE 43

Tabl

e5.

1:A

ries

:Obs

erva

tions

and

Err

ors

Star

IDPt

olem

y(1

50C

E)

Tych

o(1

601

CE

)O

bs.(

deg)

True

(deg

)E

rror

(min

)O

bs.(

deg)

True

(deg

)E

rror

(min

)L

ong.

Lat

.L

ong.

Lat

.L

ong.

Lat

.L

ong.

Lat

.L

ong.

Lat

.L

ong.

Lat

.γ

6.66

77.

333

7.45

47.

087

−47

.214

.827

.617

7.14

227

.617

7.14

40.

0−

0.1

β7.

667

8.33

38.

238

8.41

8−

34.3

−5.

128

.383

8.48

328

.402

8.47

0−

1.1

0.8

α10

.667

10.5

0011

.900

9.91

6−

74.0

35.0

32.1

009.

950

32.0

889.

952

0.7

−0.

1η

11.0

007.

667

12.3

397.

276

−80

.323

.432

.567

7.38

332

.548

7.37

51.

10.

5θ

11.5

006.

000

13.1

595.

598

−99

.524

.133

.333

5.70

833

.309

5.72

01.

5−

0.7

ι6.

833

5.83

37.

787

5.34

2−

57.2

29.5

27.9

505.

400

27.9

505.

434

0.0

−2.

0ν

17.6

676.

000

18.4

236.

003

−45

.4−

0.2

38.6

006.

117

38.5

736.

131

1.6

−0.

9ε

21.3

334.

833

22.7

793.

990

−86

.750

.642

.950

4.14

242

.933

4.13

31.

00.

5δ

23.8

331.

667

25.0

491.

657

−72

.90.

645

.250

1.77

545

.270

1.78

6−

1.2−

0.7

ζ25

.333

2.50

026

.235

2.73

1−

54.1

−13

.946

.400

2.83

346

.380

2.85

51.

2−

1.3

τ1

27.0

001.

833

27.6

572.

425

−39

.4−

35.5

47.8

422.

600

47.8

272.

568

0.9

1.9

ρ3

19.6

671.

500

21.0

931.

147

−85

.621

.241

.367

1.20

041

.337

1.17

11.

81.

7π

···

···

19.4

030.

970

···

···

39.5

831.

117

39.5

691.

100

0.9

1.0

σ18

.000

−1.

500

19.1

91−

1.45

3−

71.5

−2.

839

.383

−1.

333

39.3

72−

1.33

00.

7−

0.2

o···

···

17.6

72−

0.74

2···

···

37.8

67−

0.65

037

.842−

0.61

71.

5−

2.0

µ···

···

18.6

003.

913

···

···

38.7

674.

017

38.7

634.

025

0.2

−0.

5κ

···

···

11.5

389.

119

···

···

31.6

839.

217

31.6

839.

218

0.0

−0.

133

19.1

6710

.667

20.4

0610

.732

−74

.4−

3.9

40.5

8310

.842

40.5

6910

.852

0.9

−0.

635

19.6

6711

.167

21.2

3111

.138

−93

.91.

741

.383

11.2

6741

.373

11.2

740.

6−

0.4

3815

.000

−5.

250

16.8

55−

3.45

1−

111.

3−

107.

9···

···

37.0

67−

3.36

9···

···

3921

.333

12.6

6722

.621

12.3

80−

77.3

17.2

42.8

5012

.425

42.7

9712

.458

3.2

−2.

041

21.6

6710

.167

22.4

8410

.332

−49

.0−

9.9

42.6

6710

.400

42.6

3810

.422

1.7

−1.

3


Table 5.2: Ozone Data

std obs0.2 0.1

337.4 338.8118.2 118.1884.6 888.010.1 9.2

226.5 228.1666.3 668.5996.3 998.5448.6 449.1777.0 778.9558.2 559.2

0.4 0.30.6 0.1

775.5 778.1666.9 668.8338.0 339.3447.5 448.911.6 10.8

556.0 557.7228.1 228.3995.8 998.0887.6 888.8120.2 119.6

0.3 0.30.3 0.6

556.8 557.6339.1 339.3887.2 888.0999.0 998.5779.0 778.911.1 10.2

118.3 117.6229.2 228.9669.1 668.4448.9 449.2

0.5 0.2

The data at the right list concentrations of ozone supplied bythe U.S. National Institute of Standards and Technology (NIST),std, along with corresponding concentrations, obs, measured byvarious commercial ozone monitors. [50] Ideally, the values ineach pair should be the same but they are not.

One likely reason is that ozone is unstable and changes backto oxygen over time. If that were the only source of error, thenall the obs values would be less than or equal to the std valuesbut that is not happening here so there must be other errors aswell. There is nothing in the table alone that tells us how manykinds of error there are much less gives us some clue as to theirvalues. We cannot even predict whether they should be plus orminus so, in the end, we tend to just lump everything together,call them measurement errors and accept the fact that they areunpredictable.

Since measurement errors are unpredictable, it is customaryto refer to them as “random” errors but this is too much of asimplification. Most errors are random in this sense but someare not. In addition to random errors, there can also be one ormore biases in the observations. For instance, if your bathroomscale starts at 0.5 pounds instead of zero, then your measuredweight will be half a pound too high. This is a kind of systematicerror since it can be largely predictable given sufficient priorinformation.

Even though this dataset exhibits some error, the errors seemto be small and there is a very strong correlation between the stdand obs values as there should be! (see Fig. 5.1) This figure tellsus that there is a linear relationship between the quantities obsand std (5.1):

obs[i] = m std[i] + b (5.1)

Part III of this book will discuss how we should determinethe values of m and b but, for now, we shall accept that theyare 1.00209 and −0.2467, respectively. This gives us a formulafor predicting obs given std with the difference, obs[i]−pred[i],termed a residual. Using the values for m and b given above, wecan obtain the residuals from our linear relationship. These areplotted in Figure 5.2. They also appear to be largely “random”errors which, although of a different kind, are still individuallyunpredictable.

The emphasis in that last sentence is the key to the rest of this book because, eventhough errors (and uncertainties in general) might be individually unpredictable, that doesnot mean that there is nothing useful that can be said about them.

https://en.wikipedia.org/wiki/Ozone


� ��

��

��

��

��

��

��

��

N = 36

Figure 5.1: Observed vs. Standard Ozone Concentrations

5 10 15 20 25 30 35

-2

-1

1

2

3Residual

N = 36

Figure 5.2: Residuals from Ozone Predictions


We can illustrate that what is unpredictable might not be completely so by sorting therandom residuals from Figure 5.2 and plotting them in a different way (Fig. 5.3).

��

��

��

��

��

��

��

��

� = ��

Figure 5.3: “Probability Plot” of Ozone Residuals

For now, the ordinate of this plot is simply labeled “Theory” because what is mostimportant here is not what that theory is but that there is one—even for random quantities.Figure 5.3 shows that there must be because the residuals almost all fall on a straight linein this plot and that cannot be a coincidence. Even though we cannot predict individualrandom quantities, it is apparent that we can say something about them collectively. It isthis knowledge of how random quantities behave, when considered as a group, that willenable us to make some sense of them and, ultimately, to extract information from data.

As the title of Figure 5.3 indicates, this ability comes from expertise in the subject ofprobability (chap. 6 and 7).

5.2 IgnoranceIn addition to errors, uncertainty can also arise through ignorance. We could see rightaway, from Figure 5.1, that the obs and std values were related in a linear fashion, atleast to a very good approximation. That gave us formula (5.1) for what is termed astandardization curve. However, it did not tell us the values of m and b, the slope andintercept. Looking at the figure, we can tell that the slope is positive but the 27 pointsnear the origin are too “squeezed together” to tell whether the intercept is zero or not. We


still have a lot of ignorance (uncertainty) concerning the formula for this line. Values forthe slope and intercept were referenced above but, as we saw, the predictions we obtainedusing them were not perfect since the points did not fall on a perfect straight line (althoughthat was hard to tell from Figure 5.1).

What we referred to earlier (Section 1.4) as our state of knowledge increased after weobtained values for the slope and intercept of the standardization curve which means thatour uncertainty regarding obs versus std decreased. We know now that obs and std arealmost the same (slope ≈ 1). There is, however, a bias (intercept 6= 0). Even when there isno ozone present (std = 0), we nevertheless predict a concentration that is not zero.2 Thismeans that we have a systematic error3 as well as measurement and prediction errors. Themeasurement and prediction errors may be considered random. The bias is not random; asexplained, it probably means that some ozone decomposed back to oxygen before testing.

Figure 5.3 showed that random errors are, to some degree, understandable and, as ithappens, the vast majority of errors are indeed random errors. Therefore, we can expectthat we should be able to “handle” uncertainties (errors or ignorance) using probability.

Stay tuned.

5.3 Accuracy vs. PrecisionWhen discussing errors in general, it is important to recognize that random errors givewrong answers in individual observations but not collectively. Usually, random errors areunbiased so a large number of random observations of a fixed quantity will have errors thattend to cancel out and, the larger the set of observations, the better this cancellation. Thus,the average of many observations will be accurate even though individual observations arescattered about. This is known as the Law of Large Numbers.

When scattering of observations is small, we say that the observations exhibit goodprecision; when it is large, the precision is poor. Precision, therefore, is a measure of therepeatability of a measurement. Accuracy, on the other hand, measures how unbiased theaverage observation is from the truth.

In any experiment, measurements can be accurate and precise, accurate but not precise,etc. The four possibilities are illustrated in Figure 5.4.

No data analysis is complete without some indication of the reliability of the result.How good this indication is depends upon knowledge of what can go wrong. Randomerrors can be addressed by knowing something about how these errors behave. Figure 5.3showed that this should be possible. In the absence of systematic errors, a measure ofaccuracy should be computable. Should there be systematic errors, an answer that appearsaccurate could still be significantly wrong (e.g., Fig. 5.4, top-right).

In the analyses in this book, we always include a measure of the reliability (accuracy)of the answer.

2a prediction that is obviously impossible3like the bathroom scale

https://en.wikipedia.org/wiki/Law_of_large_numbers


Accuracy��

Prec

isio

n��

��

Figure 5.4: Accuracy vs. Precision

5.4 Significant FiguresIn the fishy back titrations described earlier, I needed good accuracy and good precisionso that the uncertainty in my final answer was both known and of acceptable quality. Oneway in which results are reported so as to record this uncertainty is by use of significantfigures. Significant figures are traditionally comprised of all those digits in the answerthat are correct and repeatable plus one more which is uncertain with the proviso thatzeroes included just to show where the decimal point is are never “significant”. The latterqualification is usually accommodated by reporting numbers in scientific notation.

My titration volumes were always in the range (0, 50) so I had to record each measured

https://en.wikipedia.org/wiki/Scientific_notation


volume to four significant figures—in this case, two decimal places, e.g., 35.87, The lastdigit (here, 7) would be uncertain because I had to interpolate between the lines, by eye,to get it. The first three were repeatable (since I was good at titrating by then). There isnothing in the reported value (35.87) to show that these digits are significant; it is simplya convention.

Sometimes, when desirable, significant figures are reported in a more explicit fashion.For instance, in the official 2018 listing of fundamental physical constants, one of the leastunertain4 is for the electron g factor. Its value is listed as−2.002 319 304 362 56(35). [49]This format tells us that the final two digits (56) should be read as 56 ± 35 meaning thatthe estimated standard deviation of the reported value is 3.5× 10−13. We shall see, in thenext two chapters, that knowing the standard deviation for a measured quantity tells usa lot about our uncertainty in the value of that quantity because there is another law, inaddition to the Law of Large Numbers, that is a lot more quantitative when dealing withrandom errors. Stay tuned.

5.5 SummaryWe cannot make errors go away but we can say something useful about them. It willturn out that we shall usually be able to predict how large an error might be given a validmeasure of uncertainty. The same can be said of uncertainty due to ignorance. Of course,any such prediction will also be uncertain but we shall have an estimate of that uncertaintyas well.

4apart from those defined as exact

https://en.wikipedia.org/wiki/G-factor_(physics)

Chapter 6

Probability

THE idea of probability arose in the seventeenth century by considering the notionof a fair bet. A fair bet is one in which you will break even, on average, no matterwhich side of the bet you choose. Most bets are not fair; usually, one side has an

advantage. A lot of very smart people thought long and hard about how one could changean unfair bet into a fair one.

Bets are made when (and because) the outcome is uncertain. Uncertainty, in turn, isthe motivation for data analysis. Probability is the mathematical framework for addressinguncertainty thus making data analysis possible. This chapter and the next describe somebasic ideas from the study of probability.

6.1 OddsConsider a game that involves tossing a pair of standard dice once and betting on theoutcome (cf. sect. 3.2). Suppose you bet that a seven will come up. Your opponent, ofcourse, is betting that it will not. Assuming that there is no third possibility, somebody isgoing to win. However, this bet is not “fair” because you are at a disadvantage. The reasonis that there are 36 ways that the dice can come up and only six of them have a total ofseven. With standard dice, all 36 ways are equally likely so the result will be seven lessthan half the time (i.e., on average).

We can summarize this situation and make everything more quantitative by looking ata few numbers. Let W be the number of ways to win and L be the number of ways to lose(all ways equally likely). Then one interesting number is the ratio, L/W , called the odds(against). In this example, we have

odds =L

W=

30

6= 5 : 1 (against) (6.1)

If you play this game repeatedly, betting on seven, then you will lose five times for eachtime you win, on average. This is not “fair”. It would be fair if and only if each playerwon the same amount on average. To make this game fair, you would have to receive

50

https://en.wikipedia.org/wiki/Probability

CHAPTER 6. PROBABILITY 51

odds of 5:1 meaning that, for each dollar you bet, you would receive five dollars if youwon (with your original bet returned as well) while, if you lost, you would receive nothingand your opponent would keep your bet. With odds of 5:1, both you and your opponentwould win nothing over the long run. Mathematically speaking, each of you would havean expectation of zero. Obviously, this is strictly true only if you play an infinite numberof times. Otherwise, your winnings would get closer and closer to zero the more gamesyou played. How fast this would happen is something that would need to be computed; itis seldom obvious.

6.2 Defining ProbabilityWe can use some of the ideas from the previous section to define the probability of somepredefined outcome, A. If W means that A is true (happens) and L otherwise, then

Prob(A) = limN→∞

W

N(6.2)

where N = W + L.This is the frequentist definition of probability since it involves relative frequencies. It

takes the limit as N goes to infinity and, while this is impossible in practice, it is necessaryfor a definition because, in any finite sequence of trials, probability would be a rationalnumber—a ratio of integers. In most cases, probability is a real number, not a rationalnumber, so it cannot always be expressed as a ratio of integers.

This definition is exactly in accord with intuition but, for it to make sense, we haveto assume that one could carry out a very large number, N , of trials at least as a thoughtexperiment. This assumption turns out to be false so (6.2) does not suffice as a definitionof probability.

Imagine a doctor examining a very ill patient and thinking, “What is the probability thatthis patient will be alive a year from now?” This is a reasonable question and no one woulddoubt that the term probability is appropriate. However, (6.2) cannot be used to computethe answer. “This patient” is a unique individual who cannot be repeated/replicated alarge number of times (to fit the frequentist definition). Likewise, “a year from now” is aunique, unrepeatable period of time. Nevertheless, the doctor’s question is a good one andwarrants a good answer.

Were this a math textbook, we would seek a rigorous definition of probability that wasconsistent with our intuitive notions about what it means. There are such definitions in theliterature [14, chap. 1] but those that would please a mathematician tend to be a bit obscureand are of little help to the analyst who is simply trying to extract information from data.For data analysis purposes, we can accommodate (6.2) as well as more general situationsby adopting the following Bayesian definition:

Probability is a measure of the credibility of a well-defined logical proposition.


Here, a well-defined proposition is one that is either true or false (no third possibility).A credible proposition is one that deserves to be accepted as true based on its own merits,not because of personal bias, etc. The measure will be a real number in the range [0, 1]with zero indicating total impossibility and one indicating complete certainty.

In practice, one should keep definition (6.2) in mind when it is appropriate but recallthe Bayesian formulation whenever a more general definition is needed. We shall havemore to say about this in Chapter 8.

6.3 Describing ProbabilityProbability is inherently mathematical so we shall have to describe it in a mathematicalway. As with all math, that will take some explaining. In this book, most of our explainingis done using examples, at least at first, so we shall begin by examining random numbers(random variates) with a specific example. Given some general observations, we can thendiscuss probability the way a data analyst or mathematician would. After three centuriesof study, there is quite a lot of theory but we shall discuss only the basics. That will besufficient to enable us to analyze any dataset in the best way possible.

6.3.1 Random NumbersWe need an example of something random so that the result will be unpredictable. Thiswill force us to employ probability instead of trying to make a deterministic predictionabout what the result will be.

As a really simple example, suppose that we choose five numbers in the range [0, 1],then keep the smallest one and throw the rest away. If we choose the numbers so that allpossible choices are equally likely then these choices are said to be Uniform(0, 1). It isdifficult to obtain random numbers by guessing but it is quite easy for a computer to doso if it is programmed correctly. Using a computer implies that the result is not perfectlyrandom but, after a great deal of effort, algorithms have been developed that can outputpseudorandom numbers indistinguishable from truly random numbers and that is all werequire. Such a software algorithm is called a pseudorandom-number generator (PRNG).Here, we shall utilize a PRNG to find our min of five selections N = 1,000 times giving asynthetic dataset of a thousand random minima.

The relative-frequency histogram in Figure 6.1 shows how these data are distributedover their domain (support). In this histogram, the bin frequencies have all been dividedby N so, using definition (6.2), the ordinate is labeled Probability.

The blue line in this figure connects the tops of the bins and is consequently termedthe envelope of the histogram. With this dataset, it is not smooth because a sample of size1,000 is a lot smaller than one of infinite size as required by the frequentist definition ofprobability. We can obtain a much more interesting and useful result by using a computerto generate a sample of a much larger size.

https://en.wikipedia.org/wiki/Pseudorandom_number_generator


��

��

��

��

��

��

�� [��]

��

N = 1,000

Figure 6.1: 1,000 Min of Five Uniform(0, 1)

6.3.2 Probability DensityRepeating our experiment with N = 10,000,000, we get the histogram shown in Figure 6.2.

��

�

�

�

�

�

�� [��]

��

N = 10,000,000

Figure 6.2: 107 Min of Five Uniform(0, 1)


The bins in this second histogram are much narrower since, with so many datapoints,narrower bins still contain enough points for the bin height to be precise. The value plottedon the ordinate is here probability per unit binwidth which makes it probability density.Since probability is a dimensionless number (no units), probability density has units equalto the reciprocal of the units for the random variate. Therefore, probability equals binwidthtimes probability density which means that, in histogram 6.2, the probability that a variatefalls in any given bin equals the area of that bin.1

The envelope of this histogram is also much smoother than that in Figure 6.1. It looksjust like a mathematical function. In fact, it can be shown that the limit of this envelope,as N →∞ and binwidth decreases to the infinitesimal, dx, is indeed a probability densityfunction (PDF), f(x). For this example, f(x) has the general formula

f(x) =

{n(1− x)n−1 ; 0 ≤ x ≤ 1

0 ; otherwise(6.3)

in this case with n = 5.A PDF is the basic expression connecting probability to other mathematics. In many

instances, it has a fairly simple analytical form with which computations can be carriedout. The key concept is the equivalence of probability to an area under a PDF.

6.3.3 Cumulative DistributionTo quantify the probability = area idea, we would like to have a formula for this area. Weknow from calculus that we can find an area under f(x) by computing the definite integralof f(x) over the range of interest. If our PDF is (6.3), then the cumulative distributionfunction (CDF), F (x), is the area (integral) for the range (−∞, x] here limited to [0, x]:

F (x) =

∫ x

−∞n(1− x)n−1dx = 1 + (−1)n+1(x− 1)n (6.4)

For n = 5,F (x) = 1 + (x− 1)5 (6.5)

Notice that the total area under (6.3) equals one which means that the probability thata minimum is somewhere in [0, 1] = 1. The PDF is then said to be normalized which itshould be; the probability that the minimum is somewhere in [0, 1] is completely certain.On the other hand, the probability that x > 0.5 is computed to be 0.0312500. Multiplyingthis by N gives an expected frequency of 312,500. In this dataset, the actual number ofminima greater than 0.5 is 312,836. The difference from theory is, once again, due tothe fact that 10,000,000 is less than infinity. All real datasets are likely to show somediscrepancy from theory, even synthetic datasets like this one. With a lot more math, wecould estimate how big a discrepancy to expect given a sample size of ten million. Asnoted above, there is a lot of theory on the subject of probability.

1since the bin is a rectangle with area = height × width (assumes linear axes)

https://en.wikipedia.org/wiki/Integral


6.4 Some Practical TheoryIn this section, we describe two common applications of PDFs and CDFs.

6.4.1 Weighted Average ReduxEarlier (Sect. 3.3.2), we introduced the idea of a weighted average. In that discussion,the weights were taken to be a discrete set of positive numbers with a sum of one. Ingeneral, however, weights do not have to be discrete. A PDF can be thought of as a setof continuous weights for any arbitrary function of the random variate. That is, if x isdistributed as described by some PDF, f(x), symbolized as x ∼ f(x), then the expectedvalue of any g(x) can be found by summing (integrating) all of the infinite weight× valueproducts in the domain of x.

Using (6.3) as our PDF, n = 5 and g(x) = x, we can find the expected value (arithmeticmean) of x as follows:

x =

∫ 1

0

x[5 (1− x)4

]dx = 1/6 (6.6)

We can find the second moment of x in the same way:

mom2 =

∫ 1

0

x2[5 (1− x)4

]dx = 1/21 (6.7)

so var(x) = mom2 − x2 = 5/252 = 0.0198413 (to six sig. fig.).It does not matter what g(x) is. For instance, let g(x) = sin(x), then

E(sin(x)) =

∫ 1

0

sin(x)[5 (1− x)4

]dx = 65− 120 cos(1) ≈ 0.163723 (6.8)

where E(·) denotes expectation.This procedure works even when there are other, possibly unknown, parameters in the

PDF. For instance, the mean of x for arbitrary n is

x =

∫ 1

0

x[n (1− x)n−1

]dx =

1

n+ 1(6.9)

This is a general way to find the expectation of anything related to any random quantityprovided only that the normalized PDF is known. In many cases, the integration does notyield a simple closed form. When that happens, one must use numerical integration.

6.4.2 Drawing Variates From a DistributionIn order to generate a synthetic dataset, or just a single random value, it is necessary to beable to do so for any PDF. Computers know how to pick a random number in the interval

https://en.wikipedia.org/wiki/Numerical_integration


[0, 1] with all values equally likely but to pick a random value from any other PDF requiresspecial techniques. [9]

The simplest of such techniques is the inverse-CDF method which is feasible wheneverthe inverse of the CDF, F−1(x), has a closed form that can be readily evaluated. In general,pick a uniform random value, u, in [0, 1]. Then, F−1(u) is the quantile of u, the valueof x for which F (x) = u. With CDF (6.5), this computation has to be carried out usingnumerical methods but it is straightforward. All quantiles generated in this way will berandom values drawn from the target PDF.

6.5 Basic Continuous DistributionsThere are a large number of named distributions with more being defined all the time asresearchers in various areas encounter the need for some new, customized PDF. There are,likewise, a number of sources describing them, e.g., this collection and this Compendium.For convenience, some continuous PDFs useful for data analysis are described in thisSection; some discrete distributions are described in the following Section.

6.5.1 UniformUniform(a, b) is the distribution for random variates in the range [a, b] with all valuesequally likely.

Uniform(a, b) =1

b− a; a ≤ x ≤ b (6.10)

Computers know how to draw random variates ∼ Uniform(0, 1) and this is the standardstarting point for obtaining variates from other distributions. Figure 6.3 plots two suchPDFs with parameters shown in parentheses.

Another parametrization is Uniform(location, scale) where location is the midpoint,(a+ b)/2, and scale the half-width, (b− a)/2, (or the width).

In general, a location parameter is one that positions the PDF as a whole somewherealong the abscissa. A scale parameter does not change this overall position but changes itswidth. Consequently, any normalized distribution with a very small scale will be very tallsince the area underneath the PDF must equal one.

Note that it is possible for a PDF to consist of several separate pieces with disjoint(non-intersecting) areas. In that case, the total area under all pieces must still equal one.

6.5.2 JeffreysJeffreys(a, b) describes variates that are ∼ Uniform(a, b) but in log space not linear space.For instance, x ∼ Jeffreys(0.1, 10) means that x is in the interval [0.1, 10] but it has equalprobability in each decade: [0.1, 1) and [1, 10] as shown in Figure 6.4.

Jeffreys(a, b) =1

x log(b/a); 0 < a ≤ x ≤ b (6.11)

https://en.wikipedia.org/wiki/List_of_probability_distributions

https://www.causascientia.org/math_stat/Dists/Compendium.html

https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)


� � � � � � � � � � ��

��

��

��

��

��

�

��

(�� )

(�� )

Figure 6.3: Uniform PDF

� � � � � � � � � � ��

��

��

��

��

��

�

�� (�� )

��

��

��

��

��

��

�

�*��

(0.1, 10)

Figure 6.4: Jeffreys Distribution: Linear Space vs. Log Space

The Jeffreys distribution has some useful technical properties but is most often used indata analysis to describe scale parameters for which there is so little information that eventhe order of magnitude is uncertain. [26, pg. 54]

With the example shown in the figure above, the probability (measuring uncertainty)is equally divided between two halves so that the median = 1. The total area under the leftsub-figure is equal to one but on the right, with a log axis, it is not because the ordinate isno longer PDF.


6.5.3 GammaGamma(shape, scale) is a right-tailed distribution bounded on the left by zero, its locationparameter. In addition to location and scale, it also has a shape parameter, α > 0. Shapeparameters are usually exponents and therefore dimensionless.

Gamma(α, β) =xα−1

Γ(α) βαexp

(−xβ

);x > 0 (6.12)

where β = scale and Γ(·) is the (complete) gamma function. See Figure 6.5.

� � � � � ��

��

��

��

��

�

��

(�� )

(�� )

Figure 6.5: Gamma PDF

The Gamma distribution arises naturally in a variety of physical circumstances suchas the distribution of speeds for molecules of a gas. It is commonly utilized to model anunknown variance or standard deviation (since these must be positive).

There are a number of special cases of the Gamma distribution distinguished by theirparameters. A few of these are as follows:

Exponential

Exponential(scale) is a Gamma distribution with shape = 1.

Exponential(λ) =1

λexp

(−xλ

);x ≥ 0 (6.13)

https://en.wikipedia.org/wiki/Gamma_distribution

https://en.wikipedia.org/wiki/Gamma_function

https://en.wikipedia.org/wiki/Exponential_distribution


where λ = scale. In this parametrization, λ is also the mean and λ2 is the variance. SeeFigure 6.6.

One special feature of the Exponential distribution is that it exhibits the memorylessproperty very commonly seen in, e.g., physics, chemistry and queueing theory.

� � � � � � � � � � ��

��

��

��

��

��

�

��

(�)

(�)

Figure 6.6: Exponential PDF

ChiSquared

ChiSquared(ν) is the Gamma distribution with scale = 2 and shape = ν/2 where ν > 0equals degrees of freedom.

ChiSquared(ν) =xν2−1

Γ(ν2) 2

ν2

exp(−x

2

);x > 0 (6.14)

Here, degrees of freedom need not be an integer.The ChiSquared distribution has several characterizations but it is most commonly

utilized as a goodness-of-fit criterion in the Chi-Squared test.Goodness-of-fit is a measure that quantifies how well a model describes the observed

data, It can be shown that certain kinds of data, especially data that fall into categories,will deviate from theory by an amount described by the ChiSquared distribution. We shallmake use of this test in some of our later examples.

https://en.wikipedia.org/wiki/Memorylessness

https://en.wikipedia.org/wiki/Queueing_theory

https://en.wikipedia.org/wiki/Chi-squared_distribution

https://en.wikipedia.org/wiki/Chi-squared_test


6.5.4 NormalNormal(µ, σ) is the most famous of all non-uniform probability distributions.

Normal(µ, σ) =1

σ√

2 πexp

[−(x− µ)2

2σ2

];−∞ < x <∞ (6.15)

where µ = mean (location) and σ = standard deviation (scale).It is often written in standard form as follows:

Normal(0 , 1 ) =1√2π

exp

[−z

2

2

];−∞ < z <∞ (6.16)

where z = (x− µ)/σ.The Normal (Gaussian) distribution, sometimes called the “bell curve”, is actually a

special case of the Gamma distribution since it is the asymptotic limit of the Gammaas shape goes to infinity. However, it deserves a special section all to itself for a greatmany reasons and several books have been written about it. It is a symmetric distribution,Gamma origin notwithstanding. Two examples are plotted in Figure 6.7.

-� -� -� � � � ��

��

��

��

��

�

��

(�� )

(�� )

Figure 6.7: Normal PDF

The CDF of the Normal distribution cannot be written in simple, closed form but it isso important that it is used to define the error function, erf(z).

Two of the many reasons for this importance are the Central Limit Theorem and thePrinciple of Maximum Entropy.

https://en.wikipedia.org/wiki/Normal_distribution

https://en.wikipedia.org/wiki/Error_function

https://en.wikipedia.org/wiki/Central_limit_theorem

https://en.wikipedia.org/wiki/Principle_of_maximum_entropy


Central Limit Theorem

The Central Limit Theorem (CLT) is a powerful theorem in probability that has beenshown to be true. [32, pg. 246] In non-mathematical language, it states that, if you observea large number of independent variates (uncertain quantities), then the distribution of themean of those variates will approach a Normal distribution as the sample size approachesinfinity (even if the original variates are non-Normal).

In data analysis, the CLT applies most strongly to uncertain location parameters sincethey are usually unlimited in scope as is the Normal PDF. In scientific and engineeringdisciplines, “best practice” was traditionally taken to include the assumption of normally-distributed errors. In particular, the traditional 95-percent confidence interval was assumedto equal (and was almost always reported as) µ ± 1.96σ because that is the Normal CDFsymmetric range (cf. Chebyshev’s inequality). State-of-the-art analysis, to be described inlater chapters, makes this assumption unnecessary but it is very often a good assumptionespecially when the sample size is large. Were it not, people would have noticed a longtime ago!

Principle of Maximum Entropy

The second, and even stronger, reason for the ubiquity of the Normal distribution in dataanalysis is the Principle of Maximum Entropy (MaxEnt). Explaining what entropy has todo with probability would take us beyond the scope of this book. [30, chap. 7] Suffice itto say that adopting the MaxEnt criterion means being maximally noncommittal regardinginformation that is unavailable—in other words, in a strict, mathematical sense, makingthe fewest assumptions. As one example, if the only thing that you know about someuncertain quantity is that it has a variance greater than zero, then MaxEnt requires thatyou describe that quantity with a Normal distribution. [26, pg. 198]

We shall encounter many examples in this book in which we know essentially nothingabout some continuous quantity, q, and will describe our state of knowledge (ignorance)about q using a Normal PDF. Given MaxEnt, that description need not be justified anyfurther.

HalfNormal

HalfNormal(scale) is the distribution that results when a Normal PDF, with mean = 0,is folded to the right about the origin. Thus, it describes the absolute values of Normalvariates with mean = 0.

HalfNormal(s) =1

s

√2

πexp

[− x2

2 s2

]; 0 ≤ x <∞ (6.17)

where s = scale (but not the standard deviation).Two examples are shown in Figure 6.8. They look like half of a Normal distribution

and, except for the moments, can be thought of in this way.

https://en.wikipedia.org/wiki/Confidence_interval

https://en.wikipedia.org/wiki/Chebyshev%27s_inequality

https://en.wikipedia.org/wiki/Half-normal_distribution


� � � � ��

��

��

��

��

�

��

(�)

(�)

Figure 6.8: HalfNormal PDF

6.5.5 LogNormalLogNormal(µ, σ) is the distribution that is Normal(µ, σ) in log space, that is. the Normalanalogue of Jeffreys vs. Uniform. Two examples are shown in Figure 6.9.

LogNormal(µ, σ) =1

x σ√

2πexp

[−(log(x)− µ)2

2σ2

]; 0 < x <∞ (6.18)

6.5.6 LaplaceLaplace(µ, s), also called DoubleExponential, is the distribution of the absolute value ofExponential variates. Two examples are shown in Figure 6.10.

Laplace(µ, s) =1

2 sexp

(−|x− µ|

s

)(6.19)

The Laplace distribution, with scale = s, has wider tails than the Normal distribution.It will be used later to implement an important algorithm for inference as well as in thecase study described in Part IV.

https://en.wikipedia.org/wiki/Log-normal_distribution

https://en.wikipedia.org/wiki/Laplace_distribution


� � � � � � ��

��

��

��

��

��

��

��

�

��

(�� )

(�� )

Figure 6.9: LogNormal PDF

-� -� -� � � � ��

��

��

��

��

��

�

�� (�� )

(�� )

Figure 6.10: Laplace PDF


6.5.7 TriangularTriangular(a, b, c) is the distribution, shaped like a triangle over the range [a, b] with modeat c. See Figure 6.11. Its PDF consists of two pieces separated by the mode:

Triangular(a, b, c) =

2 (x− a)

(b− a)(c− a); a ≤ x ≤ c

2 (b− x)

(b− a)(b− c); c < x ≤ b

(6.20)

-� -� -� � � � � ��

��

��

��

��

��

��

�

��

(-�� )

Figure 6.11: Triangular PDF

A triangular distribution is a simple choice for describing an uncertain quantity that isdoubly-bounded with a unique mode. We shall see an example later in one of our models.

6.5.8 BetaBeta(α, β) is a much more flexible distribution for doubly-bounded quantities. The reasonis that the standard form, defined for the range (0, 1), does not need location or scaleparameters but has two shape parameters (greater than zero, as usual). The latter permit agreat variety of shapes, from uniform to highly nonlinear.

Beta(α, β) =1

B(α, β)xα−1(1− x)β−1 ; 0 < x < 1 (6.21)

https://en.wikipedia.org/wiki/Triangular_distribution

https://en.wikipedia.org/wiki/Beta_distribution


where B(α, β) = Γ(α)Γ(β)/Γ(α + β), known as the beta function. Here, α, β > 0 areshape parameters.

Figure 6.12 shows three examples of this distribution in the standard interval. Thereis also a generalized version of the Beta distribution defined over an arbitrary interval.(Compendium, pg. 23)

��

��

��

��

��

��

��

�

��

(�� )

(�� )

(�� )

Figure 6.12: Beta PDF

6.5.9 WeibullWeibull(k, λ) is a distribution, bounded on the left by zero, that is often used to describefailure rates and extreme events. (See Figure 6.13.)

Weibull(k , λ) =k

λ

(xλ

)k−1exp

(−xλ

)k; 0 ≤ x <∞ (6.22)

for k (shape) and λ (scale). It is a generalization of the Exponential distribution.

6.5.10 BivariateNormalBivariateNormal(µx, µy, σx, σy, ρ) is the 2-D case of the MultivariateNormal distribution.It is one of the simplest examples of a distribution for a vector—in 2-D, just (x, y). Eachof the components has its own mean, µ, and standard deviation, σ, with the components

https://en.wikipedia.org/wiki/Beta_function


https://en.wikipedia.org/wiki/Weibull_distribution

https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_case


� � � � � � ��

��

��

��

��

��

�

��

(�� )

(�� )

Figure 6.13: Weibull PDF

being connected by the correlation, ρ. (See Section 4.2.2.)

BivariateNormal(µx , µy , σx , σy , ρ) =

1

2πσxσy√

1− ρ2exp

[−z2x + z2y − 2 ρzxzy

2 (1− ρ2)

];−∞ < zx, zy <∞ (6.23)

where zx = (x− µx)/σx and similarly for zy.With two independent variables, a plot of this PDF requires three dimensions. A 3-D

plot and contour plot of BivariateNormal(1, 3, 1, 2, 0.8) is shown in Figure 6.14. Thecontours are ellipses with both the mean and mode occurring at (µx, µy). In this example,the PDF value at the mode is ≈ 0.132629.

When there are more than two components, it is hard to make a plot of the data andthe PDF becomes a bit unwieldy. Still, there are datasets that require more than twodimensions, e.g., a distribution of points in space, and the math can be handled even if thepoints cannot be easily pictured on a flat page.


0.01

0.02

0.03

0.04

0.05

0.075

0.1

0.13

-2 -1 0 1 2 3 4

0

2

4

6

8

x

y

Figure 6.14: BivariateNormal PDF


6.6 Basic Discrete DistributionsDiscrete distributions are defined over the positive integers or, in some cases, a contiguoussubset. These distributions are, strictly speaking, described by a probability mass function,pmf rather than a PDF since the concept of density does not really apply to “bins” of zerowidth and the formula for a pmf is dimensionless, unlike a PDF. However, in this book,we shall continue to use the term PDF to keep things simple.

6.6.1 DiscreteUniformDiscreteUniform(a, b) is the discrete analogue of the Uniform distribution described above.It is defined for integer x in the range [a, b] and its PDF (pmf) is equally divided amongthe n = b− a+ 1 integers as shown in Figure 6.15.

DiscreteUniform(a, b) =1

b− a+ 1;x ∈ {a, a+ 1, . . . , b− 1, b} (6.24)

� � � � � � � � ��

��

��

��

��

��

�

��

(�� )

Figure 6.15: DiscreteUniform PDF

The CDF and all uses of the PDF of a discrete distribution are expressed as sums ratherthan definite integrals. For this distribution, the CDF has a closed form, as follows:

F (x) =bxc − a+ 1

b− a+ 1(6.25)

where b·c is the floor function.

https://en.wikipedia.org/wiki/Discrete_uniform_distribution

https://en.wikipedia.org/wiki/Floor_and_ceiling_functions


6.6.2 MultinomialMultinomial(p[], n) is a distribution describing how many observations fall into each ofk+1 disjoint categories, usually numbered [0 . . . k], where n is the total for all categoriesand p[] is the vector of category (cell) probabilities. For this distribution to be well-defined,n must be known; otherwise, the moments are no longer unique. There are only k degreesof freedom. Once that many cells are given values, the last cell count must make the totalequal to n.

Multinomial(p[], n) =n!

n0!n1! · · ·nk!px00 p

x11 · · · p

xkk ;x = 0, 1, 2, . . . (6.26)

This distribution is clearly appropriate for describing data such as the horned lizardcapture counts shown in Table 1.1 and we shall use it later for that purpose.

There are three special cases of the Multinomial distribution formed by restricting thenumber of categories, the sample size or both. These are described below:

Binomial

Binomial(p, n) describes cell probabilities when there are only two cells. The latter arevariously labeled {success, failure}, {True, False} or any other disjoint pair. Once again,parameter n is assumed to be known. Conventionally, p is the probability of success,however that is defined.

Binomial(p, n) =

(n

x

)px(1− p)n−x ;x = 0, 1, 2, . . . (6.27)

where(nx

)= n!

x!(n−x)! is a binomial coefficient = the number of ways to choose x out of n.The mean of this distribution is np and the variance is np(1 − p). Obviously, any

observed mean could be the product of an infinite number of np pairs which is why nneeds to be known. Typically, p is unknown and is the target of the analysis.

Table 6.1: Cell Probabilities

x P(x)0 0.0182281 0.0897822 0.1989933 0.2613654 0.2252815 0.1331516 0.0546527 0.0153828 0.0028419 0.000311

10 0.000015Total 1.000001

If an experiment has a probability of success of 0.33and we carry out that experiment ten times, the probabilitiesfor observing success {never, one time, two times, . . . , tentimes} are∼ Binomial(0.33, 10). These values are listed inTable 6.1 (to 6 sig. figs.) and plotted in Figure 6.16.

The expected number of successes, x, can be found byusing the formula np or by computing the weighted averageof x (discrete version), as discussed in Section 3.3.2. Thelatter is just the dot product of x and the vector for P(x). Ineither case, the answer is 3.3.

The Binomial distribution is almost as ubiquitous as theNormal distribution and we shall utilize it in several of theexamples discussed in Part III.

https://en.wikipedia.org/wiki/Multinomial_distribution

https://en.wikipedia.org/wiki/Binomial_distribution

https://en.wikipedia.org/wiki/Binomial_coefficient


� � � � � � � � � � ��

��

��

��

��

��

��

�

��

(�� )

Figure 6.16: Binomial PDF

Categorical

Categorical(p[]) is the Multinomial(p[], n) distribution with n = 1. With this substitution,(6.26) is simplified as follows:

Categorical(p[]) =

{pi ;x = i

0 ; x = invalid index(6.28)

This distribution is used to describe the probability for a single observation that couldbe any of several possibilities.

Bernoulli

Bernoulli(p) is the distribution describing a single trial with a binary {True, False} set ofpossible outcomes. Thus, it is a Binomial(p, n) distribution with n = 1, as follows:

Bernoulli(p) =

{p ; if True

1− p ; if False(6.29)

where p = P(True).Although trivial, the Bernoulli distribution is very useful in data analysis because it

encodes a True/False outcome as a distribution. We shall see some examples in Part III.

https://en.wikipedia.org/wiki/Categorical_distribution

https://en.wikipedia.org/wiki/Bernoulli_distribution


6.6.3 GeometricGeometric(p) is the distribution describing the number of failures prior to the first successgiven P(success) = p.

Geometric(p) = p (1− p)x ;x = 0, 1, 2, . . . (6.30)

With this definition, the Geometric distribution is the discrete analogue of the Exponentialdistribution (see above).

Alternatively, the Geometric distribution can be defined as the distribution describingthe number of trials needed to realize the first success. In that case,

Geometric(p) = p (1− p)x−1 ;x = 1, 2, 3, . . . (6.31)

Using the first definition with p = 0.3, we get the plot shown in Figure 6.17.

� � � � � � � � � � ��

��

��

��

��

��

��

��

�

��

(��)

Figure 6.17: Geometric PDF

6.6.4 PoissonPoisson(λ) describes the number of times a specified event occurs in a fixed time intervalif the events themselves have an interarrival time ∼ Exponential(λ).

Poisson(λ) =λx

x!exp(−λ) ;x = 0, 1, 2, . . . (6.32)

https://en.wikipedia.org/wiki/Geometric_distribution

https://en.wikipedia.org/wiki/Poisson_distribution


The Poisson distribution is often used as the starting point for a description of any kindof event that arrives in any sort of queue. Parameter λ is the mean of Poisson(λ) as well asthe mean of Exponential(λ).

Figure (6.18) shows a plot of the Poisson(λ) distribution with λ = 2.5.

� � � � � � � � ��

��

��

��

��

��

��

�

��

(��)

Figure 6.18: Poisson PDF

6.7 Mixture DistributionsA simple distribution such as those discussed in the last Section is often insufficientlyflexible to describe real-world data. When this happens, it is sometimes useful to usea weighted combination of more than one distribution. It is even possible to combinediscrete distributions with continuous distributions. Any such weighted combination istermed a mixture (distribution).

Mixtures can be partitioned into two classes depending upon whether the number ofcomponents is finite or infinite.

6.7.1 Discrete MixturesDiscrete mixtures are formed from a weighted sum of a finite (usually small) number ofcomponent distributions. The weights must be greater than zero and must sum to one.Mixtures pose some special problems for data analysis because they are often ill-defined.For this reason, they have a separate chapter of their own, Chapter 15.

https://en.wikipedia.org/wiki/Queueing_theory


6.7.2 Continuous MixturesA continuous mixture, M(·), also called a compound distribution, is a distribution, f(θ, ·)with an infinite number of values for parameter θ. The weights, w(θ), are described byanother (normalized) distribution. The combination is expressed as a definite integral asdiscussed in Section 3.3.2.

M(·) =

∫ ∞0

w(θ)f(θ, ·)dθ (6.33)

where M(·) is no longer a function of θ but, rather, the average over all values of θ.Compound distributions are used to allow each datapoint in a sample to have a different

value of θ. We discuss two examples, one continuous and one discrete.

Student’s t

StudentsT or, simply, the t distribution is the distribution one obtains when the precision(1/variance) of a Normal distribution with mean zero is itself∼ Gamma(ν

2, 2ν), symbolized

Normal(0, τ)∧τ

Gamma

(ν

2,

2

ν

)The integral (6.33) can be evaluated in closed form as follows:

Normal(0, τ) =

√τ√

2πexp

(−τ x

2

2

)(6.34)

Gamma

(ν

2,

2

ν

)=(ν

2

)ν/2 τν2−1

Γ(ν/2)exp

(−ν τ

2

)(6.35)

StudentsT(ν) =

∫ ∞0

(ν2

)ν/2 τν2−1

Γ(ν/2)exp

(−ν τ

2

) √τ√2π

exp

(−τ x

2

2

)dτ

=Γ[ν+12

]√ν π Γ

[ν2

] (x2ν

+ 1

)− ν+12

(6.36)

where ν > 0.When parameter ν is an integer, it is equal to the degrees of freedom for the problem.

In general, it is real and greater than zero and then the plot of the PDF looks like a standardNormal distribution but with much heavier tails as illustrated in Figure 6.19.

A special case of (6.36), with ν = 1, is called the standard Cauchy distribution. Thisunusual distribution arises naturally in some physical situations but it has no moments (i.e.,no mean, no variance, . . . !) because the corresponding integrals do not converge. Withrandom variates drawn from the Cauchy distribution, the Law of Large Numbers fails.Nevertheless, the Cauchy PDF can be normalized since that integral converges.

https://en.wikipedia.org/wiki/Compound_probability_distribution

https://en.wikipedia.org/wiki/Student%27s_t-distribution

https://en.wikipedia.org/wiki/Cauchy_distribution


-� -� -� � � � ��

��

��

��

��

�

��

(�)

Figure 6.19: StudentsT PDF

BetaBinomial

BetaBinomial(α, β, n) results when parameter p of Binomial(p, n) is itself ∼ Beta(α, β),that is

Binomial(p, n)∧p

Beta(α, β)

Proceeding as above, we have

BetaBinomial(α, β, n) =

∫ 1

0

pα−1(1− p)β−1

B(α, β)

(n

x

)px(1− p)n−x dp

=

(n

x

)B(x+ α, n− x+ β)

B(α, β)

(6.37)

where B(·, ·) is the beta function defined in Section 6.5.8.The BetaBinomial distribution is used with data that are nominally ∼ Binomial(p, n)

but, with a single value for p, the goodness-of-fit is poor. The BetaBinomial choice resultsin an overdispersed Binomial-type distribution.

Figure 6.20 shows PDF (probability) values for BetaBinomial(6, 2, 10) which can becompared to those in Figure 6.16 for Binomial(0.33, 10). The overdispersion of the formergives a broader range of values.

https://en.wikipedia.org/wiki/Beta-binomial_distribution


� � � � � � � � � � ��

��

��

��

��

��

��

��

��

�

��

(�� )

Figure 6.20: BetaBinomial PDF

6.8 Truncated and Censored DistributionsIn addition to mixtures, another common method for modifying a distribution in order tomake it a better (or valid) description of data is by restricting the domain over which thedistribution is defined. This procedure gives rise to two closely related modifications of adistribution: truncated and censored.

6.8.1 Truncated DistributionsA truncated distribution is a (usually univariate) distribution with part of one or both tailsof its domain removed. One example is the HalfNormal(s) distribution described above inwhich the (−∞,∞) domain of a Normal(µ, σ) distribution is reduced to [0,∞).

Truncation is appropriate when the tails removed are invalid for the data observedso their probability must be zero. In this case, the PDF, f(x), must be renormalized bydividing it by the integral of the valid range, giving g(x)—the distribution truncated to thedomain (a, b].

g(x) =f(x)

F (b)− F (a); a < x ≤ b (6.38)

In some special cases, like HalfNormal(s), the new domain may be closed at both ends.Often, truncated distributions must be coded manually as is done in the example on

page 146.

https://en.wikipedia.org/wiki/Truncated_distribution


6.8.2 Censored DistributionsA censored distribution is appropriate under circumstances different from those requiringtruncation.

One example: Imagine a quantity that can take on values in a broad range, even(−∞,∞), but that the instrument used to measure these values has a smaller range. Thisis reasonable since any instrument has a lower bound below which it is not sensitive andan upper bound above which the instrument will fail (with or without smoke coming outthe back). When it tries to measure a value that is out of range, the “pointer” will point to abound, not to the true value. True values in the missing tails exist but are now “censored”.

Under these circumstances, it would be incorrect to truncate the tails, thus declaringthem impossible. The usual alternative, is to combine the non-censored range with a pointmass at the bound (one-sided or two-sided). As with truncated distributions, there is nosimple formula for a censored distribution so it must be implemented by manual coding.

Figure 6.21 shows a plot of the PDF of Exponential(3) right-censored at 5. The areaunder the curve plus the height (probability mass) of the bound equals one.

� � � � � � ��

��

��

��

��

��

��

��

�

��

Figure 6.21: Right-censored Exponential PDF

6.9 An Experiment in ProbabilityThe preceding sections have described a number of theoretical ideas and formulas butwithout really proving any of them. Therefore, we close this chapter with an experimental

https://en.wikipedia.org/wiki/Censoring_(statistics)


investigation that has a known answer. In doing so, we hope to show that

• Random variates can be drawn from a known distribution

• That these variates can be utilized in a Monte Carlo process to estimate the value ofa (known or unknown) quantity

• That the Law of Large Numbers assures that this estimation gets more precise as thesample size increases

• That the additivity-of-variance formula (Sect. 4.2.4) gives correct results

If we can demonstrate the first three of these things, it will constitute a justification, inpart, for the data-analysis methodology to be described in Part III. The fourth will providesome additional assurance that our theoretical results have validity.

6.9.1 The Experiment

Figure 6.22: A Quarter Disk

Consider a unit disk at the origin. Inscribethis disk in a square then consider just the firstquadrant, giving Figure 6.22. The area of thesquare is one unit and the area of the quarterdisk is π/4 units. Therefore, the ratio of thetwo areas (disk/square) is π/4. If we couldfind this ratio by some computational process,we could estimate the value of π.

Employ a Monte Carlo loop in which wechoose x and y, each∼Uniform(0, 1), and usethem to define a point in the quadrant. Keep acount, n, of those points which fall inside thedisk.

After N times through the loop, n/N willapproximate π/4 because the random pointschosen are unbiased. Thus, (4 n)/N will be ourrunning estimate of π which should get betterand better as N increases.

We could stop here and start the experiment but, in fact, bullet #4 above suggests aneven better procedure. We could employ antithetic variates.

Recall the formula for adding the variances of two quantities, viz.,

var(x+ y) = var(x) + var(y) + 2 cov(x, y) (6.39)

This says that the variance of the sum will be worse (greater) than the sum of the twovariances whenever the components are positively correlated. But, what happens if theyare negatively correlated? Then, the variance of the sum will be better, on average!

https://en.wikipedia.org/wiki/Monte_Carlo_method

https://en.wikipedia.org/wiki/Antithetic_variates


This suggests an interesting modification to our procedure. Suppose that every timewe choose q ∼ Uniform(0, 1), we compute q* = (1 − q) and pass it to a parallel MonteCarlo process. We can estimate π in this parallel process at the same time as in our originalprocess. That gives us two estimates of π each time through the loop. These estimates arenegatively correlated by construction. Therefore, the average of the two estimates = sum/2should have a lower variance, on average, than either estimate by itself. In other words,we should expect this average to be an improved estimate of π, even better, in fact, thanthe estimate we would get by just doubling N. [57, sect. 4.3.5]

This is what we shall do.

6.9.2 The ResultsJust to make sure that we tried hard enough, this Monte Carlo loop was executed a trillion(1012) times. Periodic results from the parallel processes (chains a and b) are listed, alongwith the error from the true value2, in Table 6.2.

Table 6.2: Estimating π Using a Monte Carlo Loop

log10(N) πa πb Average |Error|1 2.8000000 3.6000000 3.2000000 0.058407352 3.0800000 3.2400000 3.1600000 0.018407353 3.1280000 3.1800000 3.1540000 0.012407354 3.1568000 3.1608000 3.1588000 0.017207355 3.1540800 3.1356800 3.1448800 0.003287356 3.1445880 3.1432240 3.1439060 0.002313357 3.1416284 3.1425488 3.1420886 0.000495958 3.1416218 3.1417872 3.1417045 0.000111879 3.1416181 3.1415963 3.1416072 0.00001458

10 3.1415895 3.1415906 3.1415900 0.0000026211 3.1415905 3.1415893 3.1415899 0.0000027312 3.1415941 3.1415921 3.1415931 0.00000047

Clearly, this procedure tends to provide a closer estimate to π as N increases and theaverage of the two chains is usually better than either chain alone in accordance with thetheoretical expectation. Figure 6.23 shows a plot of these results.

Note that this expectation is not realized every time. That is because an expectationis not the same as a guarantee. The output of any stochastic process will be subject to avariance of its own. Measurement errors are one source of stochastic variance (errors). Thegoal is to make that variance as small as possible. That is one reason why it is importantto quantify uncertainties in any experimental result in a valid and robust fashion. In ourexamples in this book, that is something that we shall always do.

2encoded in the words of a well-known mnemonic, ”May I have a large container of coffee—black?”


� � � � � � � � � �� -��-�

��-�

��

��

��

��

��

�

��(�)

|��

-π| Average

πa

πb

Figure 6.23: Estimating π via a Monte Carlo Procedure

Chapter 7

Rules of Probability

THIS chapter contains some mathematical symbology and formulas necessary forcomputing probabilities under conditions that can become complicated. This willhappen whenever the events to be discussed are not simple but composed of two or

more sub-events with their respective sub-probabilities. Even if you use some computersoftware to do an analysis, you will still need to understand this material in order to set upthe problem and tell the computer what to do.

We noted earlier that probability is an intuitive concept and nearly everyone has somefeeling for what it means (or should mean), with or without mathematics. Therefore,we begin with an opportunity for readers to test their own intuition on a very easy (butinstructive) problem. The solution will be presented at the end of this chapter.

7.1 Pop QuizA bag contains one poker chip that is either red or blue (equally likely). You add a second,blue chip to the bag, shake the bag and, without looking, remove one of the chips.

• What is the probability that the chip you remove is blue?

• What is the probability that you remove a blue chip and the other chip is also blue?

• If you removed a blue chip, what is the probability that the other chip is also blue?

Write down your answers then see if the contents of this chapter make the problem easier.

7.2 SymbologyProbability must, of course, be the probability of something. In order to speak in generalterms, we shall refer to this “something” as an event. This event could be an occurrence orit could be something more abstract but uncertain such as a proposition or hypothesis. Inany case, it must be uncertain either intrinsically or because it will happen in the future.

80

CHAPTER 7. RULES OF PROBABILITY 81

In the following list of definitions, A and B are events.

P (A) = the probability of A, i.e., that A will happen (or that A is true)

P (A) = the probability that A will not happen (or that A is false)

P (A,B) = the joint probability that A and B will both happen (or are both true)

P (A+B) = the probability that A is true or B is true or both are true

P (A|B) = the probability that A will happen (or is true) given that B has alreadyhappened (or is true)

These symbols are all that we shall use in this book. There are some additional symbols(and concepts) in the literature and many relations that can be derived using them.

7.3 Numerical ExamplesTo get some concrete feeling for the symbology defined above, it is helpful to considera few very simple examples using a standard deck of 52 playing cards. If we imaginedrawing one card at random from the deck, we can state the following in accordance witheither the frequentist or Bayesian definitions of probability:

• P(ace) = 4/52

• P(red card) = P(black card) = 26/52

• P(face card) = 12/52

• P(not a face card) = 40/52

• P(red face card) = P(red card, face card) = 6/52

• P(red card or face card or both) = P(red card + face card) = 32/52

• P(face card given that it is a red card) = P(face card|red card) = 6/26 = 12/52

Note that the last of these is not the same as the probability that the card is a red facecard because it states that you already know that the card is red so a lot of the uncertaintyis removed (greater probability).

7.4 The Two RulesAs far as data analysis is concerned, there are only two rules of probability that mattersince they are both necessary and sufficient for all of our needs. These two rules are theSum Rule and the Product Rule shown in Figure 7.1.


Sum RuleP (A) + P (A) = 1

Product RuleP (A,B) = P (A)P (B|A)

= P (B)P (A|B)

Figure 7.1: Basic Rules of Probability

The Product Rule can be rearranged to give a formula for conditional probability,P (A|B). Also, given any probability, p, the Sum Rule implies that odds = p/(1− p).

We shall not derive the rules but proofs are available. [36][26, chap. 2] These two ruleswill constitute the basis of our methodology for inference to be discussed in Part III.

7.5 ExtensionsThe Sum Rule and Product Rule have some corollaries that will be useful later. These arenot new rules, just consequences that can be derived from the two basic rules.

7.5.1 Extended Sum RuleThe sixth bullet in the previous Section asks for the total probability of two possible events.In this bullet, these events are not disjoint because a card can be both red as well as a facecard (e.g., Alice’s nemesis, the Queen of Hearts). The Extended Sum Rule can be derivedfrom the two rules with about a third of a page of Boolean algebra. [26, Eq. 2.23]

P (A+B) = P (A) + P (B)− P (A,B)

The truth of this formula is evident if one draws a Venn diagram such as Figure 7.2.In this diagram, points representing the ways that A and B can be true are shown and it isclear that there is overlap so the two sets are not disjoint. If they were, P (A+B) would bethe sum of the two probabilities. Here, however, simply adding the probabilities togetherwould double-count the points in the overlap, P (A,B). Consequently, to get the union ofthe points, A or B, that overlap must be subtracted out once.

Consider again P(red card + face card). The sum is 26/52 + 12/52 = 38/52—the wronganswer since it double-counts the six red face cards.

7.5.2 MarginalsIn Equation 6.37, the first line of the definition of the BetaBinomial PDF is a function offour parameters: α, β, p and n. However, parameter p is a nuisance parameter because theexpression already contains a formula for p so we need not spell it out twice. It is sufficient

https://en.wikipedia.org/wiki/Boolean_algebra

https://en.wikipedia.org/wiki/Venn_diagram


A B

Figure 7.2: Venn Diagram for P (A+B)

here to compute the average behavior for p by marginalization, that is, by “integrating out”that variable. What remains is a marginal distribution—everything except p (second lineof (6.37)).

Later on, we shall obtain distributions with many parameters. The whole function willbe defined in a space of many dimensions and it will be of interest how each parameterlooks all by itself. Geometrically, this is like taking a two-dimensional slice through theparameter space. Mathematically, this is achieved by marginalizing (integrating out) allother parameters.

We showed such a 2-D slice in Figure 1.6 for parameter tc, the time when atmosphericCO2 started to increase exponentially.

7.5.3 Extending the ConversationThe procedure of “extending the conversation” is the reverse of marginalization. Startingwith some PDF = f(a, b, . . .), we add one or more variables not there originally. Whyanyone would want to do such a thing is best explained with an example.

Shuffle a standard deck of cards, turn them face down and choose one at random. Lookat this card. Next, choose a second card at random. What is the probability, P (R2), thatthis second card is red?

Answer: It depends. If the first card was red, P (R2) = 25/51. If the first card wasblack, then P (R2) = 26/51.

Now, repeat the experiment from the beginning but, this time, do not look at the firstcard. What is P (R2) in this case? It might seem that we cannot get a numerical answersince we do not know how many red cards remained for the second choosing. Is it possibleto compute probabilities when relevant information is missing? Poker players the worldover would like to know.

And they would be happy to know that the answer is “Yes”. We can obtain a numericalanswer by extending the conversation.


We begin by using the Sum Rule for the disjoint events, R1 and R1. The combinationof these two has a probability of one, something we knew already of course. We extendthe conversation by adding R1 +R1 to our question about R2. Since R1 +R1 is certain, itdoes not constitute a constraint of any kind. Hence, the first line of our computation:

P (R2) = P (R2, R1 +R1)

= P (R2, R1) + P (R2, R1) ; obvious, yes?

= P (R1)P (R2|R1) + P (R1)P (R2|R1)

= (1/2)(25/51) + (1/2)(26/51)

= (1/2)(25/51 + 26/51) = (1/2)(1) = 1/2

(7.1)

Since we knew nothing about the first card, the result is the same as though we never drewit from the deck at all.

There is another really important lesson here. The probability of any event is a functionof available information. Therefore, had we shown the first card to someone else, thatperson would get P (R2) 6= 1/2 no matter what the first card was. In general, probabilityalways depends on the information available to the analyst computing that probability. Iftwo analysts have different information, their values for some probability might well bedifferent. To this extent, probability is always “subjective”. It is unavoidable and shouldnever be considered a criticism.

We shall encounter another example of “extending the conversation” in Chapter 9.

7.6 Pop Quiz: SolutionThere are two chips in the bag and two choices for which is removed. Thus, there are fourpossible sequences that can occur. These are listed in Table 7.1 whereB1 means “first chipblue”, etc. Sequences I–IV are equiprobable and disjoint so probabilities add.

Table 7.1: Possible Chip Sequences

Bag Removed ResultI B1 B2 1 B3

II B1 B2 2 B3

III R1 B2 1 R3

IV R1 B2 2 B3

Question #1

Of the four possibilities, three result in “blue chip removed”. Thus, the probability is 3/4.


Question #2

This question is asking for the joint probability that a blue chip is removed and, also, theremaining chip is blue. The only way that this can happen is for both chips to be blue tobegin with. This is true for two of the four sequences so the probability is 1/2.

More formally,

P (B1, B2) = P (B1)P (B2|B1) = P (B1)P (B2) ; B1 and B2 are independent

= (1/2)(1) = 1/2(7.2)

Question #3

This question is asking for the conditional probability that the remaining chip is blue giventhat you have already removed a blue chip. Since B2 is certain, we need ask only aboutB1, as follows:

P (B1|B3) =P (B1, B3)

P (B3)=

1/2

3/4= 2/3 (7.3)

This same result can be seen directly from Table 7.1.

Part III

Inference

87

THE previous two Parts of this book, Data and Uncertainty, may seem like a rathertedious digression to any reader interested in the procedure for data analysis itself.However, they comprise the context/framework for the material in this Part none

of which can be appreciated without the requisite background. We have tried to describeenough of that background to set the stage for discussion of the revolutionary changes inthe state of the art that have occurred in recent years.

People have been analyzing data ever since there were data to analyze. Traditionally,data analysis was done using statistics, a discipline developed over many decades, withmuch success, and culminating in a vast repertoire of algorithms filling huge handbookswith techniques designed to solve almost every imaginable problem.1 However, technicalmethodologies tend to improve over time and data analysis is no exception. The stateof the art today is, in many ways, much better and more powerful than the traditional,frequentist approach. Why and how is the subject of this Part of the book in which wediscuss inference, the procedure for extracting information from data. It turns out that,from a mathematical point of view, there is only one, unique approach for performinginference correctly, with full rigor and without ad hoc approximations, and that approachis fundamentally different from the frequentist approach.

There is, perhaps surprisingly, a very general, mathematical solution for inference, theF = ma of data analysis which, used correctly, works every time for any problem withfull mathematical robustness—no approximations, no exceptions, no special cases. Thisadvance was a long time coming but it has finally arrived.

It has been a familiar transition. After Albert Einstein developed his two theories ofRelativity, Special and General, they effectively replaced the dynamical model formulatedby Isaac Newton. However, Newtonian dynamics did not suddenly stop working. Rather,it was supplanted, when necessary, by something that worked better and all the time. Thus,Newtonian dynamics was seen to be an approximation. A really good approximation, tobe sure, but still sub-optimal. In much the same way, state-of-the-art data analysis issupplanting traditional (frequentist) statistics. Here, the improvement is more obviousthough we shall not be able to do it justice in this short book. The examples we shalldiscuss are all very simple and give only a hint of the underlying power—like seeingairplanes taxi but never seeing them take off. For that, you will just have to read theliterature. Still, there should be enough here to give you a good introduction.

Welcome to the world of world-class data analysis!

1In 2018, the Encyclopedia of Statistical Sciences contained 16 volumes with 9,686 pages.

https://en.wikipedia.org/wiki/Frequentist_inference

https://en.wikipedia.org/wiki/Special_relativity

https://en.wikipedia.org/wiki/General_relativity

https://en.wikipedia.org/wiki/Classical_mechanics

https://www.wiley.com/WileyCDA/WileyTitle/productCd-0471150444.html

Chapter 8

Data Analysis

IN this chapter, we come to grips with very fundamental concepts. What exactly dowe mean by the term “data analysis”? What are we really doing whenever we analyzedata? While this might seem to belabor the obvious, the literature for the past century

or so would suggest otherwise. Computational approaches for data analysis have oftenengendered serious controversies among proponents of conflicting procedural strategiesreflecting philosophical differences which persist even to this day. [44][61, chap. 1]

Much of the controversy arose from an improper perspective coupled with some veryunfortunate terminology. The latter is due to the appropriation of everyday language forpurposes of technical description, in particular, the word “belief” to characterize someaspects of modern data analysis. This word is loaded with linguistic connotations that canbe almost impossible to ignore. Many textbooks continue to use this word as if it weresomehow appropriate. It is not.

If you are asked to name the best movie of all time, then your personal belief/opinionhas some relevance. On the other hand, if you are tasked with analyzing data to determinethe orbit of a near-Earth asteroid, then your beliefs, however firmly held, are not going tokeep the asteroid from falling on your house if that is, indeed, its true orbit. Nothing in thereal world depends on anybody’s beliefs.

The continued use of “belief” in this context is due, in turn, to an improper perspectiveon what is actually happening when data are analyzed. If you try to determine the orbit ofan asteroid, that orbit is the same when you are finished as it was before you began. It isunaffected by your analysis. What is affected, what has changed is your state of knowledgeregarding this orbit—what you can say about it. Before you analyzed a new dataset, youhad some (possibly negligible) amount of knowledge about the orbit, perhaps only thatit was some conic section in space. After the analysis, you now have estimates for theparameters of that conic section. Your former uncertainty has decreased.

Data analysis is all about reducing uncertainty.

With this perspective, all the nonsense about “beliefs” vanishes. Analysis has nothingto do with beliefs; it has to do with taking an initial state of knowledge and utilizing theinformation in some data to reduce the uncertainty in that state of knowledge.

88

https://en.wikipedia.org/wiki/Conic_section

CHAPTER 8. DATA ANALYSIS 89

It should be evident that there is always an initial state of knowledge since even totalignorance is included among the possibilities. It should also be obvious that the initialstate of knowledge must be valid. Equating personal opinion to state of knowledge willnot work unless it is based on actual experience. Otherwise, one merely initializes theclassic garbage-in, garbage-out scenario.

8.1 Reducing UncertaintyThe process of utilizing the information from data to reduce the uncertainty in a state ofknowledge is what we mean by “inference”. Let us take a big step back and look at theprocess at a very high level. What we see are

• A “problem” under consideration

• An “analyst” who is to do the considering

• A set of newly available (and relevant!) data

The problem can be anything at all; it need not be numerical but, bearing in mindthe words of Lord Kelvin, it usually is. The analyst, likewise, can be anyone trying tounderstand the problem in the light of the new data which, perhaps, might be new onlyto that analyst. The analyst then reduces his/her uncertainty using some procedure forinference. A flow diagram for the whole process would therefore comprise three sequentialphases:

1. State of knowledge before examining the new data

2. Incorporating new information, if any, from the new data

3. State of knowledge after considering the new information

Phases 1 and 3 are termed prior and posterior, resp., referring explicitly to the analyst.That is, the process of inference affects nothing in the real world except what this analystlearned (information gained) from the data. Unfortunately, this qualification is not alwaysacknowledged but it is crucial. Different analysts may have different degrees of expertiseand what is new to some might be elementary to others. Therefore, when we speak ofstate of knowledge, it is necessarily with reference to the individual(s) performing theinference. Some would complain that this makes inference subjective but this is hardly avalid complaint since it is unavoidable. There really is such a thing as domain expertise.

Quantifying a state of knowledge requires a well-defined measure of knowledge. Weshall not attempt to define “knowledge” directly but, instead, shall quantify it in terms ofits complement, uncertainty. By now, we should have a good handle on uncertainty andequating lack of uncertainty to knowledge accords with the intuitive notion of the latter.

Uncertainty, the degree to which something, X, is not known perfectly, is quantifiedby assigning a probability to X. As noted earlier, this does not imply that X is random.


We employ probability as our measure of uncertainty because it is intuitively appropriateand mathematically convenient. Here, X will always be some well-defined proposition orhypothesis. Thus, we shall apply probabilities only to the truth of propositions, never todata or to anything in the real world. Here are some examples of well-defined propositions:

• It will rain a week from today in London, England.

• Female spiders are usually bigger than male spiders.

• The half-life of carbon-14 is in the range [5690, 5770] years.

• Adult literacy is greater in Europe than in North America.

As a specific numerical example, recall the population, N , of horned lizards discussedin Sect. 1.4.1. It was unknown, something that we were uncertain about. However, it wasclearly some integer and might even have been known, with near certainty, to someonewho had counted every lizard very carefully. Using the data, we described (omitting thedetails) a process of inference that decreased our level of uncertainty (increased our stateof knowledge) aboutN after which we could make some quantitative statements regardingvarious propositions such as N = 83 or N = 100. Each of these propositions could nowbe assigned a probability. The value of N , however, remained whatever it was regardlessof our analysis. It did not, and does not, have a probability. It is only propositions that canbe said to have probabilities because they can be uncertain. Things that are real are neveruncertain.1

8.2 Desiderata of Valid InferenceWe have discussed probability at some length already but more is required to formulatea valid procedure for inference. The most basic requirements are what are known as thePolya-Cox-Jaynes desiderata. The term desiderata implies that these are not axioms, takento be true by definition, but characteristics that are desirable if the inference procedure is tobe mathematically consistent and successful in accordance with fundamental logic. Thereare various forms for these desiderata. The list shown below (Table 8.1) is adapted fromRef. [26, pg. 30].

The desiderata are largely self-explanatory, and perhaps even self-evident, but a fewcomments are in order. Item #1 does not specify a range for probability but mathematicalconsistency requires that the range be [0, 1] [36, pg. 37]. In item #2, deductive limit meansthat probability must, when appropriate, converge to something inside the [0, 1] bounds.In item #3b, it is understood (or should be) that “all information” means valid information,not bias, personal opinion2 or deliberate misinformation. A consensus of expert opinion,however, is legitimate and often solicited at the outset of an analysis.

1unless quantum mechanics or chaotic behavior is dominant2even if that person is your boss or customer

https://en.wikipedia.org/wiki/Cox%27s_theorem

https://en.wikipedia.org/wiki/Quantum_mechanics

https://en.wikipedia.org/wiki/Chaos_theory


Table 8.1: Desiderata for Valid Inference

1. Probability is represented as a real number.

2. Probability must exhibit qualitative agreement with rationality. For any true (false)proposition, it must increase (decrease) continuously and also monotonically as newinformation is supplied. A deductive limit must be obtained where appropriate.

3. Consistency

• Structural consistency: If a conclusion can be reached in more than one way,every possible way must yield the same result.

• Propriety: The theory must take into account all information relevant to thequestion.

• Jaynes consistency: Equivalent states of knowledge must be represented byequivalent probability assignments (thus assuring objectivity).

Any possible confusion or dispute regarding objectivity is resolved by item #3c. Ananalysis is deemed to be objective if and only if all analysts with identical informationreach identical conclusions, expressed as probabilities. If this does not occur then, clearly,something other than information (something subjective) was inserted into the analysis.

8.3 SummaryThe target of any data analysis is the state of knowledge of the analyst(s). The goal isto increase this state of knowledge which means reducing the initial uncertainty to someextent. The extent to which the state of knowledge increases depends upon the amount ofinformation extracted from the data.

It now remains to specify a procedure for inference consistent with the desiderata listedabove. In this procedure, knowledge will be quantified as the complement of uncertaintyand measured in terms of probabilities.

Chapter 9

Making an Inference

ANY valid procedure for inference must be consistent with known mathematics andthe desiderata listed in the previous chapter. Those desiderata suggest that wedo our computations in probability space. Since the laws of probability are well

established (see Chaps. 6, 7), they must constitute our starting point. All of this comprisesan area of research that has been carried out over hundreds of years beginning with thework of Thomas Bayes in the eighteenth century. Here, we give the results of that effort.

9.1 Bayes’ RuleIn the language of Chapter 8, we want to describe our posterior state of knowledge about aproposition, H , given our prior state of knowledge plus some new data, d. Moreover, wewant to describe (encode) everything in terms of probabilities. Therefore, our posterior isa conditional probability, symbolized P (H|d). Our prior state of knowledge (before weever examined the data) is symbolized P (H). Our posterior is therefore (see pg. 82)

P (H|d) =P (H,d)

P (d)(9.1)

which can be rearranged as

P (H,d) = P (H|d)P (d) (9.2)

By definition, P (H,d) = P (d, H). Therefore,

P (H|d)P (d) = P (H)P (d|H) (9.3)

and, finally,

P (H|d) =P (H)P (d|H)

P (d)(9.4)

Equation (9.4) is Bayes’ Rule. Obviously, it is just the Product Rule of probability but,here, we employ it as our fundamental rule of inference. It constitutes the necessary andsufficient approach for increasing our state of knowledge by examining new data.

92

https://en.wikipedia.org/wiki/Thomas_Bayes

https://en.wikipedia.org/wiki/Bayes%27_theorem

CHAPTER 9. MAKING AN INFERENCE 93

Necessary: If you want your analysis to be correct, in the mathematically rigorous sense,then you must utilize Bayes’ Rule (actually., the Sum rule as well as the Product rule)to carry out that analysis. Alternate approaches, e.g., statistics, might approximatethe correct answer in many cases but such alternatives are not guaranteed to workevery time. The Sum and Product rules do work every time.

Sufficient: Bayes’ Rule (plus the Sum Rule) is all that you ever need to do any kind of dataanalysis. It does not matter what kind of data you have or what sort of conclusionsyou want to draw. The rules of probability are universally valid and Bayes’ Rule,when employed correctly, will produce the right answer every time given your priorinformation (or lack thereof) and some new data.

Note: This does not imply that using Bayes’ Rule will be easy in every case.

Analyzing (data + prior info) using (9.4) is termed Bayesian inference. [15][26][61] Itis very different from frequentist inference. Some procedural differences are as follows:

9.1.1 Bayesian Inference:• Computations employ the Sum Rule and Product Rule and nothing else.

• Analyses begin by encoding everything in terms of probabilities.

• Bayes’ Rule is then used to generate the posterior (the answer) which encodes thecomplete, post-data state of knowledge.

• Since the posterior is defined in probability space, there is no need to “translate”analysis results into probabilities—no “hand-waving” required.

9.1.2 Frequentist Inference (traditional statistics):• Computations are based on the data alone; domain expertise is totally deprecated.

• Data are assumed to be a random sample drawn from a parent population which isoften fictitious, sometimes impossible.

• A statistic (a metric having no free parameters) is computed from the data.

• The observed value of this statistic is assessed using its expected distribution1 in the(hypothetical) parent population.

• The probability of this value, along with the defining parent population, producesthe inference. For instance, a very large, improbable value would indicate that theassumed parent population did not describe the data.

1often, only asymptotically correct

https://en.wikipedia.org/wiki/Statistical_population

https://en.wikipedia.org/wiki/Statistic


Three of the expressions in Bayes’ Rule are PDFs:

P (H|d) The posterior—the answer to the problem describing our state of knowledgeafter we have incorporated the new data.

P (H) The prior—whatever we knew before examining the new data. It is equal to thejoint PDF describing our pre-data uncertainty for all of the unknowns in the problem.

P (d|H) The likelihood—in effect, the joint probability of the data given some specificvalues for the unknowns.

The denominator, P (d), is just a number. Formally, it is the probability of the datagiven everything else. Quantitatively, it is the reciprocal of the normalization constantfor the integral of the numerator. It has various names, viz., marginal likelihood, globallikelihood, evidence. If there are only two unknowns, then a plot of the numerator of theright-hand-side (RHS) of Bayes’ Rule will be a 3-D graph. In that case, P (d) equals thevolume under that graph. With more dimensions, this is harder to visualize but the idea(hypervolume) is the same.

Note that the prior is not a function of the data. In other words, one must never usethe new data to define (say anything about) the prior. Doing so violates Bayes’ Rule and,therefore, the Product Rule of probability. Given sufficient data, it is permissible to use asmall fraction of it for exploratory analysis and, thus, influence the prior. However, thatsubsample must not be used again since it is no longer new data.

Finally, since P (d) is just a number, it will often be sufficient to employ a simplerform of Bayes’ Rule:

P (H|d) ∝ P (H)P (d|H) (9.5)

9.2 Traditional RulesSince we shall be discussing state-of-the-art, Bayesian methodology for data analysis, itis important to have a basis for comparison. Until the algorithmic and software advancesat the end of the last century, the only practicable way to analyze data was by employingfrequentist methods. The typical problem was to model the data using an appropriatelikelihood distribution then find the “best” parameters for that distribution given the data.The likelihood could then be utilized to obtain further information.

There are two ways to find “optimum” likelihood parameters: maximum likelihood andmethod of moments.

9.2.1 Maximum Likelihood (M-L)The frequentist method considered ideal is the maximum-likelihood procedure. Since thelikelihood quantifies the joint probability of the data given the model,H , it is reasonable toassert that the best parameters are those that maximize the probability of what was, in fact,


observed. Observations are almost always considered to be independent of each other so,for N datapoints and PDF = f(·), the likelihood, L , is just the product of N PDF valuestaken over the datapoints.

L =N∏i=1

f(xi|H) (9.6)

where xi could be a scalar or a vector.To avoid numerical overflow, the computation is always carried out in log space, giving

the logLikelihood.

logLikelihood =N∑i=1

log [f(xi|H)] (9.7)

The parameters that maximize log[f(·)] will also maximize f(·).A problem with the M-L approach, hardly ever acknowledged, is that the “most likely”

parameters are not necessarily likely at all. Suppose you toss 100 coins in the air andpredict how many “heads” you will get. Assuming the coins are fair, P(head) = 1/2 andthe M-L prediction is 50 heads. However, from the Binomial distribution, it is clear thatthe actual probability of this result is less than eight percent.

P (50 heads) =

(100

50

)(0.5)100 ≈ 0.0795892 (9.8)

Nevertheless, maximum likelihood is almost always the frequentist method of choice.

9.2.2 Method of MomentsA second, inferior but sometimes necessary, approach is to compute moments from thedata and set up simultaneous equations with their theoretical expressions. Solving theseequations yields the parameters that guarantee a match between theoretical and observedmoments which is enough to justify the method in many cases.

9.3 Bayes’ Rule: Numerical ExampleBayes’ Rule may be unfamiliar so it is instructive to examine a very simple numericalexample in which all necessary information is known. The results might be surprising.

Let us suppose that there is some disease that afflicts one person in ten thousand andlet us further suppose that there is a new test for this disease that is an excellent diagnostic.In particular, suppose that it exhibits only one percent false positives and one percent falsenegatives. In other words, if you have the disease, it will say you do not (false negative)one percent of the time. Likewise, if you do not have the disease, it will say you do (falsepositive) one percent of the time. This is far better than most such tests.

Now, you go to a doctor and get this test and the test comes out positive. How worriedshould you be? What is the probability that you have this disease?


If you do not know Bayes’ Rule, then you are likely to think that it is 99-percentcertain that you have the disease. After all, the test is guaranteed to have P(false positive)= 0.01. However, your conclusion would be wrong because it ignores prior information. Toperform the computation correctly, you must utilize all available prior information plus thedata = the fact that the test was positive (tp). Here, prior information is listed in Table 9.1.Given the data, you can then find the correct answer by applying Bayes’ Rule (9.4).

Table 9.1: Prior Information for Test

• Prior probability of disease, P (D) = 0.0001

• Prior probability of no disease, P (D) = 1− 0.0001 = 0.9999

• P(test positive given disease), P (tp|D) = 0.99

• P(test negative given disease), P (tn|D) = 0.01 ; false negative

• P(test positive given no disease), P (tp|D) = 0.01 ; false positive

• P(test negative given no disease), P (tn|D) = 0.99

What you really want to know is P (D|tp). That is, you want the inverse of the thirdbullet above but that quantity is not in the list. To find it, apply Bayes’ Rule and the knownrules of probability, as follows:

P (D|tp) =P (D)P (tp|D)

P (tp); Bayes’ Rule

=P (D)P (tp|D)

P (tp,D + D); extending the conversation

=P (D)P (tp|D)

P (tp,D) + P (tp, D)

=P (D)P (tp|D)

P (D)P (tp|D) + P (D)P (tp|D)

=0.0001 · 0.99

0.0001 · 0.99 + 0.9999 · 0.01; prior info

≈ 0.0098

(9.9)

So you need not worry too much after all. There is less than one chance in a hundredthat you have the disease! Are you surprised? This result is typical of Bayesian inference.Unless the prior equals zero or one, the posterior will always be some compromise betweenthe prior and the data. In this case, the prior, P (D), is very small.

If you are still worried, you can get a second opinion (a second, independent test).Now, your prior is no longer 0.0001. It is your most recent posterior, viz., 0.0098. Bayesian


inference is often employed in this sequential fashion as new data are accumulated overtime. Indeed, in many applications, this is the only way in which it is used. [44]

If the test comes out positive again, then your P (D|tp) ≈ 0.495. If the second testcomes out negative, then your P (D|tn) ≈ 0.0001, the same as the original prior.

9.4 Complete Bayesian ModelsThe example above is correct as far as it goes but it is unrealistic because all of the inputswere, by assumption, perfectly known. This never happens in the real world where nothingis perfect. There is always some uncertainty somewhere and, as noted in the last chapter,the goal of data analysis is to reduce that uncertainty.

In order to reduce uncertainty, we need a complete model specifying an analyticalexpression describing the uncertainty for the observations themselves and for each of theunknowns (uncertain quantities) appearing in the likelihood and the prior(s). Each of theseexpressions will be a probability distribution of one sort or another because that is how weencode uncertainty. There is no rule for choosing these distributions; that is left to theexpertise of the analyst. The final model becomes the RHS of Bayes’ Rule from which theposterior is computed.

9.5 Computing the PosteriorComputation of the posterior means finding one multivariate distribution that subsumeseverything on the RHS of Bayes’ Rule given the chosen model. This is almost alwaysextremely difficult, usually impossible, to do analytically so we require another approach.Fortunately, there is such an approach. The next two chapters will discuss it in some detail.None of this methodology will really become clear, however, until we apply it to practicalexamples which we do beginning in Chapter 12.

Chapter 10

Computational Details I

THE laws of probability insist that making an inference (drawing conclusions fromdata) requires setting up Bayes’ Rule for the problem. This is a necessary but formalstep. It specifies the posterior but it does not, by itself, output quantitative answers to

our original questions. Given the model, the data and the Bayes’ Rule expression adaptedfor the problem, we somehow have to use the rule to tell us what we can say about

1. Values for the unknowns (uncertain quantities) in the problem

2. The model in general irrespective of values for the unknowns

The answer to #1 is obtained from the numerator on the RHS of Bayes’ Rule and theanswer to #2 is obtained from the denominator. We shall describe both of these along withpseudocode for a simple variant of the basic computational approach. There are severalother, more efficient variants for which we shall provide brief descriptions.

10.1 Data and Prior InformationTo make everything as clear as possible, we are going to describe the computations usinga concrete, numerical example. Moreover, we shall employ a synthetic dataset so that the“ground truth” is available for comparison. This dataset consists of the thirty independentdatapoints shown in Table 10.1. Also, we shall assume the following prior information:

• The observations are ∼ Binomial(p,N ) with constant probability, p.

• The Binomial parameter, N , comes from a Poisson process with constant mean, λ.

• Nothing further is known about parameters p and λ.

Our discussion will be divided into three parts: i) estimating values for the unknowns,1

ii) computing the marginal likelihood for the model as a whole and iii) testing whetherthe model is a good one, here, assessing whether a Binomial distribution with constant preproduces the data sufficiently well.

1For purposes of later comparison, note that this dataset was synthesized using p = 0.7 and λ = 8.3.

98

https://en.wikipedia.org/?title=Poisson_process

CHAPTER 10. COMPUTATIONAL DETAILS I 99

Table 10.1: A Synthetic Dataset: k Successes in N Trials

k N k N k N k N k N k N9 12 7 10 5 6 5 5 5 9 5 103 4 3 5 6 9 7 9 2 6 6 95 5 5 9 5 7 4 11 3 6 7 105 5 3 3 11 12 11 12 9 13 8 95 6 3 6 6 7 3 4 2 5 6 8

10.2 The Unknowns (I)The unknowns can be investigated using just the numerator of Bayes’ Rule, viz.,

P (Θ|D, I) ∝ P (Θ|I)P (D|Θ, I) (10.1)

where Θ symbolizes the unknowns, D the data and (optional) I other prior information.In writing down a specific, computable form of Bayes’ Rule for this problem, we begin

with the data. In any Bayesian analysis, one must always begin with the data because thelikelihood for the data must be specified before any parameters of that likelihood, and theirpriors, can be specified.

Here, each datum contains two quantities, N and k. Thus, each observation is actuallya vector, (N, k), so the likelihood expression must specify its joint probability. Assuming,as usual, that the observations are independent, the final likelihood (proportional to thejoint probability of all datapoints) will be the product of the individual probabilities.

The joint probability for a single observation is given by the Product Rule.

P (N, k|D, I) = P (N |D, I)P (k|N,D, I) (10.2)

Our prior information states that the rightmost factor in (10.2) is Binomial with constant(unknown) probability, p.

P (k|N,D, I) =

(N

k

)pk(1− p)N−k (10.3)

We are also given that N ∼ Poisson with constant (unknown) mean, λ.

P (N |D, I) =λN

N !exp(−λ) (10.4)

Substituting, we get a computable expression for the likelihood, L .

L = P (D|p, λ, I) =30∏i=1

λNi

Ni!exp(−λ)

(Ni

ki

)pki(1− p)Ni−ki (10.5)

The two unknown parameters, p and λ, must have a (joint) prior describing what weknew about them before we saw this dataset.


Parameter p is a probability so it must have a value in the range [0, 1]. Since we knownothing else about it, we are forced to choose a vague (noninformative) prior. Here, we usethe simplest one, Uniform(0, 1) with PDF = 1. We know nothing about parameter λ exceptthat it is, from theory, the mean of the associated Poisson distribution, hence greater thanzero. We have to use a vague, but positive, prior for λ as well. In order to keep the mathsimple, we shall use a Uniform(0, 20) distribution with PDF = 1/20. This is just a guessso we shall have to go back later to verify that it really was vague, that it did not constrainthe answer in any way.

Since we knew nothing about parameters p and λ a priori, we are also forced to treatthem as independent. Otherwise, we would have to specify how they are correlated andwe cannot do that.2 This gives us an expression for our priors:

P (Θ|I) = P (p, λ|I)indep.

= P (p|I)P (λ|I) = (1)(1/20) (10.6)

We can finally write down the posterior for this simple problem (numerator only).

P (p, λ|D, I) ∝ P (p|I)P (λ|I)P (D|p, λ, I)

=1

20

30∏i=1

λNi

Ni!exp(−λ)

(Ni

ki

)pki(1− p)Ni−ki

(10.7)

There are several points worth emphasizing about expression (10.7). First, appearancesnotwithstanding, this is an exceptionally simple example—a posterior having only twoparameters with trivial priors. It is not uncommon to find Bayesian analyses reported in theliterature that utilize dozens of parameters and many elaborate analytical forms. Second,it should be noted, once again, that the data appear only in the likelihood factor (10.5),not in any of the priors (10.6). This is a mathematical requirement of the Product Ruleand violating this Rule (e.g., choosing priors by looking at the data) invalidates all ofthe subsequent conclusions. Third, expression (10.7) still does not get us all of the wayto providing answers to question #1 on the previous page. That is, it does not specifynumerical values for any of the unknowns. To estimate these values, to the extent possible,we must carry out the analysis one step further.

In order to output what can be said about an unknown, e.g., parameter p, we musttake expression (10.7) and isolate that unknown from everything else in a manner that isconsistent with the posterior. Formally, this means integrating out (marginalizing away) allof the other unknowns. As described (Sect. 7.5.2), isolating parameter p means carryingout the process shown in (10.8) with the integration limits filled in using the prior for λ.

The resulting marginal for p quantifies our posterior state of knowledge regarding thisunknown given everything else. To obtain the marginal for λ, one would instead integrateout parameter p.

2It is always possible that the data might induce a (posterior) correlation.


marginal for p = P (p|D, I)

=

∫ 20

0

P (p|I)P (λ|I)P (D|p, λ, I) dλ

=1

20

∫ 20

0

30∏i=1

λNi

Ni!exp(−λ)

(Ni

ki

)pki(1− p)Ni−ki dλ

(10.8)

Occasionally, an integration such as that shown in (10.8) can be done in closed form.There might be, for example, sufficient statistics that will collapse the entire integrand intosomething more manageable. Typically, however, the integrals arising in this fashion areanalytically intractable. This is one of the main reasons why Bayesian inference was notpracticable until the 1990’s when a new procedure was developed to the point that suchcomputations became feasible. This new technique, a truly world-class achievement, isknown as Markov-chain Monte Carlo. [20][4]

10.3 Markov-chain Monte Carlo (MCMC)A Bayesian posterior such as (10.7) takes the form of a conditional joint probability densityfunction3 for the unknowns in the problem given the data and any other prior information.It is, most often, an expression that is impossible to integrate in closed form. Fortunately,for most operations, including marginalization, one does not require a closed form fora posterior distribution provided one can draw samples of unlimited size from it. Anysuch sample constitutes an empirical distribution consistent with the posterior from whichanything knowable about the unknowns can be extracted with little effort.

In practice, the sample takes the form of a 2-D table with each row being a point in themulti-dimensional space containing the posterior and each column containing the valuesfrom a single dimension of that space. The trick (a very great trick, indeed!) is to guaranteethat the relative frequency of points in this table are exactly in accord with the posteriorfor the problem given the limited precision of the table. This precision increases with thesize of the sample/table. When executed correctly, the MCMC procedure will constructthis table with the desired guarantee. Each column in the table will constitute a marginalfor one of the unknowns—our posterior state of knowledge for that quantity.

In other words, given the data and a model, MCMC will solve any problem of inferencewith total validity. No exceptions.

10.3.1 How It WorksMarkov-chain Monte Carlo is a computer simulation that traverses the parameter spacecontaining the posterior repeatedly, keeping a record (trace) of its path through this multi-dimensional space. Each point (state) in this space is completely described by a unique

3often unnormalized

https://en.wikipedia.org/wiki/Sufficient_statistic

https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo


vector of parameter values. The traversal goes from one state to another with a transitionprobability (the Monte Carlo part) that depends only on the current state and the identity ofthe next (proposed) state. Except for the recorded trace, no memory/history is retained ofprevious states visited nor does this history influence the transition probabilities in any way(the Markovian part). The guarantee that the final, sorted trace corresponds to a histogramof parameters from the posterior for the problem (and possibly other related quantities) isachieved by virtue of the analytical form for a transition probability (cf. Sect. 10.3.2).

A valid, albeit silly, mental image for the process just described is that of a frog sittingon one of many lily pads in a large pond. Imagine that the frog has black ink on its feet.From time to time, it jumps to another lily pad. Each time it lands, it adds another setof black footprints to the target lily pad. After a very long time, with thousands or evenmillions of jumps, a bird’s-eye view of the pond would reveal a 2-D pattern of footprints.An MCMC trace for a 2-D problem, such as the one presented at the beginning of thischapter, would look much the same provided we allow for an infinite number of lily padssince both p and λ are continuous. The big difference is that the frog jumps at random(more or less) while MCMC jumps around the space with a guarantee in advance thatthe footprints will converge to the correct shape for the posterior no matter how manydimensions there might be or whether they are continuous, discrete or a mixture of both.

There are several different algorithms for MCMC in current use with varying degreesof complexity. One of the oldest and most intuitive of these is the Metropolis algorithmlater generalized to the more elaborate Metropolis-Hastings (M-H) algorithm. We shalldescribe one variant of the Metropolis algorithm in detail. In the next chapter, some of thealternate algorithms will be outlined briefly, focusing on the differences between them.

Metropolis algorithm

The Metropolis algorithm is a very simple form of MCMC. The specific variant describedbelow is the Metropolis Random Walk (RWM) algorithm. RWM is by no means the bestor most efficient implementation of MCMC but it is the easiest to explain. Also, thepseudocode presented here is not meant to be complete and robust. Working code requiresseveral checks to ensure that nothing goes wrong without signaling an error. Readers withcomputer-programming skills can use the pseudocode below as a starting point. Otherscan use available software (see Sect. 11.4). In either case, it should not be assumed thatlack of an error message implies success. Output must always be checked to make surethat the MCMC simulation traversed the parameter space successfully (see Chap. 11).

The RWM procedure consists of the high-level steps shown in Algorithm 1. Detailedpseudocode is given in Algorithms 2 and 3.

Metropolis MCMC moves from one state to another in two steps. First, a randomproposal is made for the next state. Then, a decision is made whether to move to that newstate or to reject the proposal and remain in the current state. This sequence is containedin the function MetropolisLoop and constitutes a single iteration in the simulation whetheror not the proposal is accepted.

https://en.wikipedia.org/wiki/Markov_process

https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm


Algorithm 1 Metropolis (high-level)procedure METROPOLIS(NburnIn, Nsample, Nthin)

Choose initial state . Must be validTuneProposal(0.25, 0.35) . Until P(acceptance) falls within limits shownMetropolisLoop(NburnIn, Nthin, False) . Burn-in phaseMetropolisLoop(Nsample, Nthin, True) . Sampling phaseOutput trace to file

end procedure

In a Metropolis algorithm, the proposal is completely arbitrary with one condition. Theproposal process must be symmetric: P(propose X → Y ) = P(propose Y → X).4 This isnot automatic; the proposal distribution must ensure it.

A Metropolis random walk is a kind of traversal in which the proposed state is not avector of new values but, instead, a vector of displacements from the current values. Thiscan be achieved by centering the proposal distribution on the current state. In Algorithm 2,the function Propose(currentState) utilizes a Laplace(mean, scale) distribution. Since thisdistribution is symmetric, the Metropolis condition is satisfied.

Before the Laplace proposal distributions can be used effectively, their respectivescales must be tuned. Experience suggests that Metropolis MCMC works best when thefraction of proposals accepted is about 0.3 meaning that, most of the time, the proposal isrejected. Function TuneProposal(lo, hi) is therefore a desirable preliminary. It does notmatter how the tuning is carried out. Any convenient optimization technique will likelysuffice. When there are very many parameters, tuning can take longer than the rest of thesimulation so an efficient tuning technique is preferable.

In Algorithm 2, the unknown parameters are assumed to be independent. When thisis not the case, the proposal distribution must be explicitly multivariate so that correlationcan be incorporated. If this multivariate distribution is not symmetric, the Metropolisalgorithm cannot be used.5

The crucial step in each MCMC iteration is the question, “To move or not to move?”If this is not answered correctly, the output will not correspond to the desired posterior.The Accept(·) function provides the answer. The form of Accept(·) shown in Algorithm 2is specific to Metropolis MCMC and is mathematically correct if and only if the proposaldistribution is symmetric.

The trace is saved as Algorithm 2 proceeds (see Algorithm 3). At the end of the run, theoutput matrix/table will contain Niter rows with one column for each quantity in the stateplus one for each Extra quantity. Sometimes, a zeroth element in each state is the logarithmof the value of the numerator on the RHS of Bayes’ Rule (hereinafter logPosterior).

Algorithm 2 is a stochastic procedure so results will vary very slightly from run to run.Hence, longer runs are better.

4In other words, forward and backward proposals, not moves, are equally likely.5The computer program will still run but the answer will not match the posterior for the problem.

https://en.wikipedia.org/wiki/Laplace_distribution

https://en.wikipedia.org/wiki/Mathematical_optimization


Algorithm 2 Metropolis (low-level)function METROPOLISLOOP(Niter, Nthin, Sampling) . generic iteration

acceptCount← 0 . needed for tuningfor iter = 1 to Niter do

for j = 1 to Nthin doproposal ← Propose(state)if Accept(state, proposal) then

state← proposal . else do not movecompute Extras . ancillary quantities of interest (optional)acceptCount← acceptCount+ 1

end ifend forif Sampling then DoOutput . thinned iterations discountedend if

end forreturn acceptCount . includes thinned iterations (optional)

end function

function PROPOSE(currentState) . random walk with Laplacian perturbationsfor param = 1 to Nparams do . assumes independent parameters

newState[param]← Laplace(currentState[param], scale[param])end forreturn newState

end function

function ACCEPT(current, proposed) . Metropolis-specificprobAccept ← max

(1, posterior(proposed)

posterior(current)

). compute in log space

u← Uniform(0, 1)accepted← u ≤ probAccept . probAccept might be zeroreturn accepted

end function

procedure TUNEPROPOSAL(lo, hi) . scale factors globalrepeat

for param = 1 to Nparams doPerturb(scale[param]) . Perturb(·) unspecified

end forNaccepted ←MetropolisLoop(Ntries ,Nthin,False) . Ntries is tunablefracAccepted← 1.0 ∗ Naccepted/(Ntries ∗ Nthin)

until fracAccepted > lo and fracAccepted < hiend procedure


Algorithm 3 Increment traceprocedure DOOUTPUT

outputVector ← state+ Extras . concatenationAppend outputVector to OutputMatrix . trace

end procedure

10.3.2 Why It WorksThe MCMC algorithms described above are very different from the kind of computationscarried out in traditional data analysis (i.e., statistics) and, when first encountered, the factthat they output the correct answer seems almost magical. How can a traversal of theparameter space using one fixed, strictly local rule compute a desired global result: thetrue posterior for a problem (in tabular form). What’s the secret?

The secret is a really good one and not obvious at all. To explain it will take a littlemath and a short detour into the world of chemistry.

A chemical excursion

Chemistry is not a familiar subject to most people but nearly everyone knows that allsubstances are composed of molecules6 which, in turn, are made from atoms; that puresubstances are composed of identical molecules and that substances can undergo chemicalreactions that change them into different substances. Sometimes a molecule can rearrangeits own atoms to form a different molecule. There are many examples but, to keep thingssimple, imagine that substance A, when heated, rearranges to give substance B.

One of the less familiar facts about chemical reactions is that every chemical reactionis reversible although most are so lop-sided that the reversibility is not apparent. Thus,while molecules of A are changing into B, molecules of B are simultaneously changingback into A. It is a bit like a tug-of-war. So, how does this tug-of-war turn out? For that,we shall need a bit of mathematics.

Suppose that we start at some arbitrary time t = 0 with a solution containing arbitraryinitial concentrations of A and B, symbolized [A]0 and [B]0. We can describe the changein [A] as a function of time as follows:

d[A]

dt= −k1[A] + k−1[B] ; (constants) k1, k−1 > 0 (10.9)

Equation 10.9 says that the concentration of A decreases at a rate proportional to itscurrent value and increases in proportion to the current concentration of B. Typically, thevalues of k1 and k−1 are very different. If we let kT = k1 + k−1, the solution to thisdifferential equation, describing the concentration of A at time t, is

[A] =exp(−kT t)

kT([A]0k1 − [B]0k−1 + ([A]0 + [B]0)k−1 exp(kT t)) (10.10)

6except for the noble gasses: helium, neon, argon, krypton, xenon, radon and oganesson


Consequently, the concentration of B at time t is

[B] = [B]0 + ([A]0 − [A])

= [A]0 + [B]0 −exp(−kT t)

kT([A]0k1 − [B]0k−1 + ([A]0 + [B]0)k−1 exp(kT t))

(10.11)

To discover how the tug-of-war turned out in the end, consider the ratio [B]/[A]. After afew steps and a fair amount of algebra, we obtain

[B]

[A]=−[A]0k1 + [B]0k−1 + ([A]0 + [B]0)k1 exp(kT t)

[A]0k1 − [B]0k−1 + ([A]0 + [B]0)k−1 exp(kT t)(10.12)

Letting t go to infinity, we get a very simple result:

limt→∞

[B]

[A]= lim

t→∞

−[A]0k1 + [B]0k−1 + ([A]0 + [B]0)k1 exp(kT t)

[A]0k1 − [B]0k−1 + ([A]0 + [B]0)k−1 exp(kT t)=

k1k−1

(10.13)

Equation 10.13 says something truly amazing. It says that, no matter how much A orB we start with (provided at least one of them is not zero), the ratio of concentrations willeventually become constant! The only way that can happen is for A to be changing into Bat the same rate at which B is changing into A. Putting this into symbols,

d[A]

dt= 0 = −k1[A] + k−1[B] (10.14)

so that we have an equilibrium between A and B.7

k1[A]eq = k−1[B]eq (10.15)

Needless to say, it does not take an infinite amount of time to reach equilibrium. Mostoften, it takes only a few seconds. If not, then (in a controlled environment) one can raisethe temperature since all reactions, forward or backward, go faster at higher temperature.

Note that, in showing that A and B come to equilibrium, we have said nothing aboutany other reactions that might be happening at the same time. This is because all suchreactions proceed to equilibrium as well. All analogues of Equation 10.15 will be satisfiedsimultaneously in accordance with their respective rate constants, k. This condition isknown as detailed balance. When several reactions are going on at the same time, thechemistry becomes complicated and equilibrium concentrations will be affected shouldreaction pairs have substances in common. However, the net result is that equilibrium willprevail everywhere, eventually.

7The temperature (and pressure)-dependent constant, k1/k−1, is usually denoted Keq .

https://en.wikipedia.org/wiki/Detailed_balance


Detailed balance in MCMC

The phenomenon of detailed balance described above carries over almost exactly into therealm of Markov-chain Monte Carlo. The only new element to be added is a “walker” thatmeanders throughout the parameter space.8 Apart from that, the analogy holds in detailwith nothing but the interpretation of the elements altered.

In MCMC, the concentration of a substance is replaced with the probability densityof a parameter. Probability density, much like concentration, can take on any positivevalue.9 The set of concentrations at any time during approach to equilibrium or afterwardscorresponds to a state visited by a walker. In almost all data analyses, there are an infinitenumber of states just as in the chemical analogy. The expression k[A] corresponds to thetransition between states. In standard MCMC, there is no concept of temperature10 soequilibrium state-transition probabilities are permanently fixed and the Markov process issaid to be stationary thereafter.

The approach to equilibrium in the chemical analogy corresponds to the burn-in phaseof MCMC. In both cases, equilibrium is achieved (assuming that MCMC was carriedout successfully). In the chemical analogy, equilibrium means that the total amount ofall substances is partitioned into respective concentrations that remain fixed thereafter.In MCMC, after burn-in, the total posterior probability (= 1) is concentrated into a sub-volume of the parameter space with fixed marginals which show how each parameter isconcentrated into a portion of its component axis.

The reason this happens in MCMC is because the probAccept rules for state transitionshave been designed to achieve detailed balance (as in chemistry). These rules guaranteethat, at equilibrium, the rate at which a walker enters any state equals the rate at which itleaves that state (cf. Equation 10.15). Consequently, the walker moves randomly throughthe parameter space at constant speed so the record of its itinerary (trace) contains statesexactly in proportion to their inherent probability with no biases induced by remaining ina state too long or too briefly. Thus, the trace is a (tabular) histogram of the true posterior,as desired.

10.4 The Unknowns (II)Random-walk Metropolis was performed using the algorithms described in Section 10.3.1with the dataset in Table 10.1 as input. Tuning and burnin-in were carried out followed bythe sampling phase (see Algorithm 2). Tuning, to get an acceptance probability of about0.3, was done with a simple grid search. Proposal scale values were then set to 0.1 forp and 0.3 for λ. Niter was set to 100,000 and Nthin to 10 so, after burn-in, one millioniterations were carried out with every tenth state saved (duplicate or not) to the outputarray (trace). This sample size is typical of MCMC.

8In ensemble MCMC, described in the next chapter, there are many simultaneous walkers.9Concentration cannot really become infinite but probability density can, at least in principle.

10However, there is in some ancillary procedures of MCMC.

https://en.wikipedia.org/wiki/Markov_chain


The output array is a multi-dimensional histogram, in tabular form, for the posteriorfor the problem. Its precision is proportional to the number of rows in the table. In thiscase, there are only two unknowns, p and λ, so a simple contour plot can be drawn (seeFigure 10.1). Here, the contours are labeled with corresponding values of logPosterior.The maximum of logPosterior = −125.408 is labeled with a ‘+’ symbol. These are rawdata so the plot is not smooth.

-136

-134

-132

-130

-128

-126

��

��

��

��

��

��

��

��

��

��

��

��

�

λ

+

Figure 10.1: Contour Plot for logPosterior

Even in this tiny problem, posterior values are on the order of 10−57 which is whyMCMC computations are always carried out in log space. Otherwise, they could be out-of-range given the double-precision, floating-point number format in a computer.

It is customary to cache the best state found during a run, that is, the state with thehighest value of logPosterior. This state is known as the maximum a posteriori (MAP)11

11previously called MaxProb


state and is often used as a point summary of the entire posterior.Since each state in the trace was visited in toto, any implicit correlations between the

parameters are automatically taken into account. Therefore, each column in the outputtrace corresponds to the marginal for that parameter conditioned on everything else in theproblem, including prior information. This marginal can be plotted to show the posterioruncertainty in the value of the parameter. In principle, anything that is knowable about anunknown can be computed from the trace—moments, mode, credible intervals, etc. with aprecision limited only by the length of the trace. Accuracy will be good if and only if themodel is good.

Marginals for p and λ are shown in Figures 10.2–10.3. Figure 10.3 reveals that theuncertainty range for λ is well within its Uniform(0, 20) prior confirming that the latterwas quite vague, as intended.

��

�

�

�

�

��

��

��

�

��

Figure 10.2: Marginal for Parameter p

10.4.1 Credible IntervalsTable 10.2 shows the solution for this problem in our usual format, giving estimates forthe unknowns along with an indication of posterior uncertainties.

Recall that the (p, λ) values used to synthesize our dataset were (0.7, 8.3), resp. Withonly 30 datapoints, we do not have much information12 so our posterior uncertainties are

12As a rough rule-of-thumb, 30 points per parameter is a bare minimum for accurate results.


� � � � ��

��

��

��

��

λ

��

Figure 10.3: Marginal for Parameter λ

Table 10.2: Synthetic Dataset (solution)

Unknown Estimate 95% Credible IntervalMAP Mean Lower Limit Upper Limit

p 0.707 0.705 0.648 0.764λ 7.73 7.77 6.81 8.80

still a bit large. Even the most likely estimates (MAP values) are not as close to the truevalues as one might wish. In fact, the posterior parameter means are here a bit closer toground truth. Whether to use means or MAP values (or medians, . . . ) as point estimatesshould be decided a priori on a case-by-case basis. Typically, posterior means are usedsince they are, by definition, the expected value of the unknown.

Uncertainty in the final result is usually reported by giving the range of values withinwhich the true answer lies. The wider this “credible” interval, the less we can say aboutthe true value. Typically, one desires the interval in which there is a 95-percent probabilityof finding the true value.13

There are two kinds of credible interval, central intervals and highest-posterior-density(HPD) intervals. For a 95-percent central interval, a marginal vector is sorted from low tohigh and the bottom 2.5 percent and top 2.5 percent ignored. What remains is the centralcredible interval. Other central intervals are found in analogous fashion.

13N.B., The frequentist 95-percent “confidence interval” does not actually provide this.


The 95-percent HPD interval is found by scanning the sorted marginal vector from oneend and choosing the shortest interval that includes 95-percent of the rows. HPD intervalsmake the most sense when the marginal is unimodal (almost always the case when MCMCsucceeds).

With a large number of iterations, there is usually little difference between centraland HPD credible intervals unless the marginal is noticeably skewed. That can happenwhen the marginal is intrinsically asymmetric or when there is only a small number ofdatapoints. For the examples in this book, we shall employ the term “credible interval”generically although they will all be HPD intervals.

10.5 Model ComparisonFinding estimates for the unknowns is not the only thing of interest in an analysis. Thereis also the topic of model comparison (see Sect. 1.4.3) which, ideally, is assessed using themarginal/global likelihood. This quantity cannot be found easily from the trace alone. Itscomputation is much more difficult and requires special attention.

10.5.1 Global MarginalizationWe have seen that any parameter can be integrated out of the RHS of Bayes’ Rule for theproblem. If there are k parameters, then the result is a function of the remaining k − 1parameters. Clearly, this process may be repeated so that, ultimately, we can integrate outall of the parameters. The final result is then a simple number. That number is equal to thedenominator on the RHS of Bayes’ Rule, usually called the marginal likelihood.

The marginal likelihood is not a function of (is independent of) any parameter valuesbut, given the data, it is a function of three other things:

• The model as a whole

• The vagueness of the priors

• The number of parameters

The larger the marginal likelihood, the better the model as a whole irrespective ofparameter values. On the other hand, as the priors become more and more vague or, evenworse, as the parameters become superfluous, the marginal likelihood decreases rapidly.Thus, the marginal likelihood is an excellent measure of the quality of the model and it isworth the effort to compute it whenever practicable.

It is not always practicable. The computation is a multi-dimensional integration and,if there are too many dimensions, then round-off error will eventually dominate even thebest integration procedures. In fact, MCMC software usually does not even try to computethe marginal likelihood.14 In this section, we describe one good numerical procedure for

14MacMCMC being a notable exception

https://causascientia.org/software/MacMCMC/MacMCMC.html


computing the marginal likelihood as an adjunct to an MCMC run. Subsequent exampleswill show how it is used.

10.5.2 Nested Restricted Monte Carlo IntegrationMonte Carlo integration is a procedure for obtaining the value of a definite integral givensome computable code for the integrand and integration limits in each dimension. Briefly,it approximates the domain by choosing a very large number of points in it and determinesthe associated hypervolume. In standard Monte Carlo, the points are chosen randomly; inQuasi-Monte Carlo, more efficient space-filling points are utilized.

When applied to MCMC, there are some other considerations. For instance, it canhappen that the posterior is multi-modal with some of the peaks tall but very narrow and,perhaps, rarely sampled. Even more important, it is very desirable to be able to “watch”the integration converge in order to recognize when it has done so. Both of these concernscan be addressed by utilizing Nested Restricted Monte Carlo (NRMC) integration. [28]

For computing a marginal likelihood specifically, one begins by setting nested limitsin each dimension using credible intervals obtained from the trace. For example, a startingset of multi-dimensional integration limits might be the {30, 50, 70, 90, 95, 99, 99.9,99.99}-percent credible intervals computed at the end of an MCMC run plus an epsiloncriterion to define a convergence test. Monte Carlo integration is then carried out for thenested multi-dimensional “shells” defined by the corresponding inner and outer limits.The hypervolume obtained will increase as the limits increase. If the integration has notconverged even with the widest limits, the latter can be expanded until the number of rowsin the trace is exhausted. Thus, this integration procedure is both nested and restricted.

NRMC integration, using Quasi-Monte Carlo, was carried out for the example that wasdiscussed at the beginning of this chapter. Figure 10.4 shows the convergence of marginallikelihood as more outer shells are added. In this plot, the abscissa is the fraction of thetotal hypercube containing the parameter space. Clearly, this posterior occupies only aboutten percent of the hypervolume defined by the priors.

A nice feature of NRMC integration is that the shells can be integrated in parallel.This can be a great saving in execution time. The problem of accumulating round-offerrors with increasing dimensionality, however, is unavoidable.

Note: Monte Carlo integration cannot be carried out in log space since addition of logscorresponds to multiplication. In order to avoid numerical overflow, an offset15 must besubtracted in log space from every logPosterior computed before the rest of the integrationloop is executed in normal space. This offset must be added back at the end.

For our example, log(marginal likelihood) =−127.761 (using natural logs). This valuewill always be smaller (worse) than the MAP value of the logPosterior because it includes“wasted” parameter space. The ratio of the two is called the Occam factor. [26, sect. 3.5]

15such as the MAP value of the logPosterior

https://en.wikipedia.org/wiki/Monte_Carlo_integration

https://en.wikipedia.org/wiki/Quasi-Monte_Carlo_method


��

��

��

��

��

��

��

��

��

��×��

Figure 10.4: NRMC Convergence

10.5.3 Quantifying Model ComparisonWe desire a valid number quantifying the probability that one model, M1, is better thananother, M2. If there are more than two alternatives, we can pick one as reference andcompare the others to it pairwise. The relative probabilities are then normalized to giveabsolute probabilities for each.

This is a problem that can, and should, be addressed using Bayes’ Rule. In doing so,we shall assume that we have no prior information suggesting that one model is better thananother in which case the prior probability for every model considered is the same. Giventhe data, D, we have the following for M1 (and likewise for M2).

P (M1|D) =P (M1)P (D|M1)

P (D)(10.16)

where the likelihood term is the marginal likelihood so that it does not depend on anyparticular parameter value(s).

Since the priors are equal and the data likewise, the ratio of the posteriors for M1 andM2, called the Bayes factor, is therefore just the ratio of marginal likelihoods.

P (M1|D)

P (M2|D)=P (D|M1)

P (D|M2)(10.17)

In log space, this ratio becomes the difference in log(marginal likelihood).

https://en.wikipedia.org/wiki/Bayes_factor


Using the example above, we can illustrate this procedure by defining an alternatemodel. LetM2 be the same as that already defined,M1, except that parameter p is assumedto be 0.8 which we (secretly) know should be a bit too large.16

When MCMC is run for M2, log(marginal likelihood) = −130.866. Therefore, we cancompare the two models directly.

P (M1|D)

P (M2|D)= exp(−127.761− (−130.866)) = 22.3 (10.18)

Given our dataset, our original model is 22 times more credible than our hypothesizedalternative even though the latter has fewer parameters. This is a robust result with nofurther arguments required to “prove” it.

We need only add that, by convention, anything less than a five-fold improvement isconsidered not worth talking about while an improvement of 100 or more is considereddecisive. Here, we found something in between so we would formulate any conclusionsaccordingly.

We shall have more to say about model comparison in Section 11.3.3.

16Table 10.2 says much the same.

Chapter 11

Computational Details II

THE previous chapter described a very general approach to inference, viz., Bayesianinference implemented via Markov-chain Monte Carlo (MCMC). MCMC generatesthe posterior for the problem which quantifies our state of knowledge now that we

have seen new data. While MCMC can solve any problem of inference, when carried outproperly, it can have problems of its own. These problems fall into two categories:

• Bad model

• Bad traversal

We begin this chapter with a discussion of these issues.A second topic concerns the MCMC algorithm itself. The RWM algorithm that we

described earlier (pg. 104) is just the simplest. There are others which are more efficientbut more complicated. We provide brief descriptions of some alternate MCMC strategiesand some reasons for using them.

Finally, we conclude this chapter with a partial survey of free, general-purpose MCMCsoftware.1

11.1 Bad ModelEvery data analysis requires a model with which that analysis will be carried out. Whatthe form of that model should be is left up to the expertise of the analyst; it is not a part ofMCMC per se. We shall see many examples of models of various kind beginning in thenext chapter. Here, we concern ourselves with how a model might be bad.

11.1.1 Model ErrorsThe reasons why a model might be bad are of two types: syntactic and semantic.

1available at the time of writing

115

CHAPTER 11. COMPUTATIONAL DETAILS II 116

Syntactic errors arise when software is used to perform MCMC. The model must beencoded somehow prior to the MCMC run and errors may occur due to a misunderstandingof what the software requires or, simply, a typo. A software package may or may not catchall such errors.

Semantic errors are much more serious. These arise due to conceptual mistakes and itis not possible to list all of the ways in which they might happen. Here are a few obvious(and simplistic) examples:

• Prior information is ignored.

• The form of the likelihood is inappropriate.

• A parameter is given a prior that does not reflect the real-world situation.

• Two priors are modeled as independent when they are significantly correlated.

• A regression is modeled as least-squares when it should be Poisson.

• A least-squares model ignores significant errors in the independent variable.

• Degrees-of-freedom are modeled incorrectly.

In other words, the model is wrong in some way that matters. This wrongness may ormay not be obvious and can arise because of lack of real-world knowledge or because theanalyst did not know how to encode that knowledge in correct, probabilistic language.

11.1.2 Goodness-of-fitIt should go without saying that a model is not a model unless it describes the observeddata. One should see a “good fit” between what the data look like and what the model saysthey should look like. No data analysis is complete without quantifying goodness-of-fit.Since this is a fairly large topic, it will be discussed separately in Chapter 13.

11.2 Bad TraversalMCMC proceeds as a “walker” traverses the parameter space, keeping a record (trace) ofthe states visited. As explained in Section 10.3.2, the walker must move through this spaceat constant speed so that the relative frequencies of states in the trace reflect their intrinsicprobabilities, not some artifact of the traversal. If this speed is not constant, the trace willnot describe the desired posterior. In other words, the answer will be wrong.

There are several ways in which the traversal can go wrong. First, the “burn-in” phasemay be incomplete. Second, there can be too much autocorrelation in the states visited.In addition to these common occurrences, it may be that there is something in the modelthat is causing difficulty in moving from one state to another. Most traversal problems will

https://en.wikipedia.org/wiki/Autocorrelation


become apparent by plotting the raw (unsorted) trace a parameter at a time. We addresssome of these issues in this Section along with a few suggested tests and remedies. Theliterature contains much more information on this subject. [4][20]

11.2.1 Problem DetectionIn order to detect a traversal problem, it is usually necessary to carry out two (or more)independent MCMC runs on the same problem under the same conditions. Nearly allMCMC software packages do this automatically and will plot parameter traces for these“chains” superimposed on each other with color-coding. If the traversal is good, the twochains will not show any systematic deviations from each other although they will alwaysexhibit the jitter characteristic of any stochastic process. Figure 11.1 is a screenshot2

showing traces for a parameter from two MCMC chains with good performance.

Figure 11.1: Good Trace

Plots such as this cannot show every state because the states outnumber the pixelsacross the display. Therefore, raw traces are sampled periodically. This is typical behaviorsince MCMC chains are often very long.

Pictures provide immediate feedback but more robust measures are desired as well.This involves post-processing of the recorded trace(s) usually with pre-existing software.One diagnostic tool that is very widely used is the CODA package for the freely-availableR language about which more is said below. CODA requires that the trace be recorded ina specific format so the original trace may have to be rearranged to fit these requirements.

CODA incorporates several MCMC diagnostic tools and its documentation should beconsulted for descriptions of these.

11.2.2 Incomplete Burn-inAs noted in Section 10.3.2, detailed balancing requires equilibrium. The burn-in phase ofMCMC must be long enough for this to happen. Once equilibrium is achieved, it remainsin that condition thereafter. However, if the burn-in phase is too short, one or more chainsmay still be proceeding towards equilibrium. If the model is poorly defined in some way or,perhaps, highly inappropriate given the data, it may occasionally happen that equilibriumis never achieved.

2All screenshots in this chapter are from MacMCMC.

https://cran.r-project.org/web/packages/coda/index.html

https://cran.r-project.org


When burn-in is incomplete, it should be apparent from the trace plot of at least one ofthe parameters (or other monitored unknowns). Figure 11.2 shows an example.

Figure 11.2: Incomplete Burn-in

In this figure, the trace shown in red seems to have reached equilibrium but the onein black does not do so until later. One possible remedy for this situation is to make theburn-in phase longer. How one does this is software-specific.

11.2.3 Excessive AutocorrelationEvery Markov process is going to be autocorrelated to some extent by construction. TheCODA package referenced above will quantify autocorrelation and determine how long itpersists.

Autocorrelation will not be a problem so long as it “dies away” quickly enough that along chain will overwhelm its effects. If it persists too long, the walker will not “forget”where it came from fast enough for the posterior to be valid.

Two possible remedies for excessive autocorrelation are i) make the sample bigger(longer chain) and ii) increase the amount of thinning. Thinning skips over most of thestates in the traversal and this tends to alleviate autocorrelation. Typically, Nthin − 1states are skipped as shown in Algorithm 2.

11.2.4 Other ProblemsThere can be additional problems with a traversal but most of them will show up in thetrace plots. Figure 11.3 shows a plot of a parameter trace that did not work well at all.

Figure 11.3: Bad Trace

The reason for this behavior need not be obvious; it is certainly not obvious from theplot. All one can say in general is that the two chains differ much more than they should.


A specific test for this problem is the Gelman-Rubin statistic. [17] It compares vari-ance within a chain to variance between chains and should have a value very close to oneimplying that the chains are all very similar. Usually, the value is on the order of 1.01.3

This test is another in the suite of tests provided by CODA.Plots of individual marginals could also reveal that something has gone wrong. It

may be that the traversal was poorly executed or it could be that a marginal is truncatedunexpectedly because a prior was incorrect.

In any case, one must always examine all MCMC output carefully. At the very least,any important analysis should be carried out more than once, perhaps with different choicesfor priors.

11.3 Alternate MCMC AlgorithmsIn this Section, we discuss some additional strategies developed to implement the generalMCMC concept. We confine our attention to those algorithms suitable for addressing allproblems of inference and omit those that were designed for specific applications evenwhen the latter are themselves fairly general, e.g., Bayesian networks. There are manysuch application-specific MCMC algorithms, with more being developed all the time, andthey are usually very efficient in their specialized domains.

11.3.1 Metropolis-HastingsThe Metropolis Random Walk is not the most general of the Metropolis algorithms. TheMetropolis algorithm can be used to propose states in an absolute sense, not just as offsetsto the current state. As with RWM, it is still necessary for the proposal distribution to besymmetric.

The Metropolis-Hastings algorithm relaxes this restriction, permitting a non-symmetricproposal distribution. However, in order to guarantee the correct posterior, the transitionprobability must be modified to restore the symmetry needed to ensure detailed balance.This is achieved with the following modified acceptance criterion for moving from state sto state y:

probAccept(s , y) = max

(1,p(y)q(s|y)

p(s)q(y|s)

)(11.1)

where

• p(y)/p(s) is the ratio of proposed to current posterior values.

• q(y|s) is the probability for going to y from s in the proposal distribution.

• q(s|y) is the same for the reverse move.

3Very rarely, round-off error might give a value a bit less than one.

https://en.wikipedia.org/wiki/Bayesian_network


11.3.2 Gibbs SamplingGibbs sampling is a special case of Metropolis and is probably the most popular approachto MCMC. It is also the oldest, dating from 1984 [18]. In its purest form, it depends onbeing able to

• Construct the full-conditional distribution for each parameter in the analytical formof the unnormalized posterior

• Draw a random variate from each of these full-conditional distributions

This is usually easier said than done but offers significant advantages and, in the literature,much effort is expended to design models for which Gibbs sampling can be utilized.

These advantages arise because the full-conditional for parameter θ is the univariatedistribution constructed by taking the unnormalized posterior and setting all parametersconstant except θ. If the model is valid, using the full-conditional as a proposal distributionfor θi will give a new value, θi+1, that is always acceptable by construction. Since MCMCcan be carried out stepwise, one parameter at a time, this guaranteed acceptability breaksthe so-called “curse of dimensionality”. Execution times with Gibbs sampling tend to bemuch faster than with other forms of MCMC.

Once again, this is a lot easier to understand and appreciate with a real example. Giventhe popularity of this technique, we shall describe a fairly trivial example in detail.

For a dataset, we shall use some published data for normal body temperature for adultmales. [60] These temperatures (minus 98 F), yi, are presented in Table 11.1, (N = 65).

Table 11.1: Normal Body Temperatures (−98 F)

−1.7 −1.3 −1.1 −1 −0.9 −0.9 −0.9 −0.8 −0.7 −0.6 −0.6 −0.6−0.6 −0.5 −0.5 −0.4 −0.4 −0.4 −0.3 −0.2 −0.2 −0.2 −0.2 −0.1−0.1 0 0 0 0 0 0 0.1 0.1 0.2 0.2 0.20.2 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.6 0.6 0.60.6 0.6 0.6 0.7 0.7 0.8 0.8 0.8 0.9 1 1 11.1 1.2 1.3 1.4 1.5

Our model will be extremely simple, just a Normal(µ, τ ) likelihood with µ = the truevalue of (offset) temperature and τ = precision = 1/variance, a common parametrizationused to simplify the math. Our prior for µ will be Normal(0, τµ) and that for τ will beGamma(a, r), a Gamma distribution with shape a and rate (= 1/scale) r.4

The unnormalized posterior, with constants deleted, is shown in (11.2), cf. (10.7).

P (µ, τ |D, I) ∝ exp(−τµ

2µ2)τN/2+a−1 exp(−r τ) exp

(−τ

2

N∑i=1

(yi − µ)2

)(11.2)

4This example is discussed in Reference [20, pg. 75].

https://en.wikipedia.org/wiki/Gibbs_sampling

https://en.wikipedia.org/wiki/Curse_of_dimensionality

https://en.wikipedia.org/wiki/Gamma_distribution


Since (11.2) is just a proportion (all we need for parameter estimation), we can findthe full-conditionals by including just those factors containing the parameter of interest.Thus, the full-conditional for µ is

fcMu ∝ exp(−τµ

2µ2)

exp

(−τ

2

N∑i=1

(yi − µ)2

)

∝ exp

(−Nτ + τµ

2µ2 + τµ

N∑i=1

yi

) (11.3)

after expanding the summand and deleting constants again.Letting the final summation in (11.3) = S , the exponential argument, Z, is

Z = −Nτ + τµ2

µ2 + τµS

= −Nτ + τµ2

(µ2 − 2 τµS

Nτ + τµ

)∝ −Nτ + τµ

2

(µ− τS

Nτ + τµ

)2

(11.4)

by completing the square and deleting resulting constants.Putting Z back where it came from, we have

fcMu ∝ exp

[−Nτ + τµ

2

(µ− τS

Nτ + τµ

)2]

(11.5)

Expression (11.5) looks like a whole lot of algebra with nothing to show for it unlessyou happen to notice that it describes an unnormalized Normal distribution with mean =(τS )/(Nτ + τµ) and precision = Nτ + τµ. It is easy, then, to select (not just propose!)the next value for µ since it is just a Normal variate.

Proceeding in the same fashion, we can derive the full-conditional distribution for τstarting with (11.2).

fcTau ∝ τN/2+a−1 exp(−r τ) exp

(−τ

2

N∑i=1

(yi − µ)2

)

= τN/2+a−1 exp

[−

(r +

1

2

N∑i=1

(yi − µ)2

)τ

] (11.6)

which is just another Gamma distribution with shape = N/2 + a and rate = the expressionin large parentheses. Selecting a Gamma variate is another well-known operation.

We can now carry out MCMC via Gibbs sampling. First, we specify the priors bysetting τµ = 2, a = 2 and r = 1/3, all of which are sufficiently vague. Also, we set the


final sample size Nsamp = 100000 along with Nburnin = 1000 and Nthin = 10, much as inAlgorithm 2.

The entire computer program is so straightforward that it can be written in about onepage (see Appendix A). The results are shown below with τ converted to sigma on the flyas an “Extra”. Figure 11.4 shows that the prior for µ was very vague. Expected normaltemperature is, of course, µ+ 98; sigma is not affected by this (location) offset.

Table 11.2: Normal Temperature (solution)

Unknown Estimate 95% Credible IntervalMean Lower Limit Upper Limit

µ 0.1029 −0.0649 0.2722sigma 0.6927 0.5862 0.8236

-� -� � � ��

�

�

�

�

�

�� μ

��

Prior

Posterior

Figure 11.4: Marginal for µ via Gibbs Sampling

This example was chosen for its simplicity (believe it or not). In most applications ofGibbs sampling, one or more of the full-conditionals does not have the form of a standarddistribution from which variates may be easily selected. When that occurs, there are twoways to proceed. One way is to select variates using a generic algorithm, e.g., rejectionsampling or slice sampling. These are both less efficient, in general, than distribution-specific algorithms. The second way is to utilize a Metropolis(-Hastings) algorithm for anyparameter with a non-standard full-conditional, termed Metropolis-within-Gibbs. Unlike

https://en.wikipedia.org/wiki/Rejection_sampling

https://en.wikipedia.org/wiki/Rejection_sampling

https://en.wikipedia.org/wiki/Slice_sampling


Gibbs sampling, this will result in many acceptance failures for each Metropolis step sothe parameter-space traversal will take longer. Nevertheless, this is common practice withmost available MCMC software (designed to work every time).

For many good examples, see Reference [5].

Conjugate priors

A conjugate prior is a prior with an analytical form that, when multiplied by the likelihood,yields a form of the posterior that is the same as that of the prior but with its parametersaltered in a known fashion. The referenced hyperlink lists several examples.

A conjugate prior is very useful not only because it allows the posterior to be evaluatedin closed form but, when the data are accumulated incrementally, updating the posteriorbecomes trivial.

The difficulty is that most likelihood forms do not have a simple conjugate prior. Onehas a fair amount of flexibility in choosing a prior but, usually, little or none in choosingthe likelihood.

11.3.3 Reversible-jump MCMCReversible-jump MCMC (RJMCMC) was designed to handle the situation in which thenumber of unknowns is (sadly) one of the unknowns. [4, chap. 3][24][55] RJMCMC isa generalization of Metropolis-Hastings MCMC. The primary utility of RJMCMC is forconstructing an MCMC chain that incorporates several competing models with parameterspaces that have different sizes. As the walker moves from state to state, the parameterspace might grow or shrink. If so, mapping the parameters from one state to the next isproblematical. RJMCMC solves this problem

Most often, competing models are run separately with the best one chosen in post-processing. However, there are situations in which it is desirable to consider all modelssimultaneously along with weighting factors giving the probability of each. Of course, thisincreases the number of parameters significantly. Therefore, RJMCMC is apt to require alot of data.

For an example, see Reference [40].

11.3.4 Hamiltonian MCMCHamiltonian MCMC5 utilizes the equations of Hamiltonian dynamics to construct theMCMC procedure. [4, chap. 5] This strategy is different from any Metropolis algorithmbecause it defines a “momentum” for each parameter in addition to its value (“position”).There are some advantages in doing this and such procedures can be quite efficient eventhough they are more trouble to implement.


5sometimes called hybrid MCMC

https://en.wikipedia.org/wiki/Conjugate_prior

https://en.wikipedia.org/wiki/Reversible-jump_Markov_chain_Monte_Carlo

https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo

https://en.wikipedia.org/wiki/Hamiltonian_mechanics


11.3.5 Ensemble MCMCEnsemble MCMC is a general description for MCMC that simulates several parallel chainssimultaneously, each with its own, independent walker.

One particular strategy is that of Goodman and Weare. [22] At each iteration, for eachwalker, this approach generates a proposed state from a linear combination of the statesof two randomly-selected co-walkers. By doing so, problems due to correlation betweenparameters are alleviated to a large extent. Moreover, the multiplicity of walkers avoidsthe necessity of tuning the proposals; this becomes automatic.

The downside to this approach is that the number of walkers (several per parameter)can become impracticable even when many chains are executed in parallel. On the otherhand, if the dimensionality is not too large, this method can be quite efficient.


11.4 MCMC SoftwareIdeally, the best software for any application is that which you code yourself. However, itis usually more convenient to use existing packages even though they may not be optimallyadapted to the analysis. There are four classes of published MCMC software:

• General-purpose, standalone apps/tools

• General-purpose tools that generate C/C++ code to be compiled by the user

• Apps/tools for a particular application or domain

• Packages for use within a broader language, e.g., R or Python

Here, we shall restrict ourselves to the first two of these categories with the addedcondition that everything in the package is available for free. At the time of writing, thereappear to be five such software packages. Brief descriptions are given below.6 All takemodel pseudocode (see Chapter 12) and data as input.

11.4.1 StandaloneBUGS (WINBUGS, OpenBUGS)

BUGS is an application for WindowsTM platforms. It was the first software developedfor the general application of MCMC to data analysis. Originally, it focused on Gibbssampling as the method of choice. It has evolved over the years to several more recentincarnations. Its features continue to vary as well. A variant of BUGS is also available asa package for the R language.

The BUGS package comes with three volumes of interesting examples.6This list is not guaranteed to be comprehensive.

https://cran.r-project.org

https://www.python.org

https://www.mrc-bsu.cam.ac.uk/software/bugs/


JAGS

JAGS is a multi-platform tool with a console interface. It was originally intended to makeWinBUGS functionality available for all computer platforms and to be extensible. Itsdownload package includes sample code for the BUGS examples to facilitate comparison.

There is also a variant for use with the R language.

MacMCMC

MacMCMC is an MCMC application developed for MacOSTM. It has all those featurescharacteristic of MacOS document apps and its features include several capabilities notfound in other MCMC packages.

The models in succeeding chapters will all exemplify MacMCMC syntax. Internally,MacMCMC utilizes ensemble MCMC.

11.4.2 Code GeneratorsMCSim

MCSim is a package that converts model pseudocode into C which must then be compiled.The software itself is a GNU package which must be compiled (makefile provided) beforeuse.

Stan

Stan is a package of several tools that convert model pseudocode to C++ and carry outanalysis of the results. These must all be compiled by the user (makefile provided). Thereis a console interface as well as variants for incorporation into other programs, e.g., R andPython. Features depend somewhat on the interface.

Internally, Stan utilizes Hamiltonian MCMC.

http://mcmc-jags.sourceforge.net

https://causascientia.org/software/MacMCMC/MacMCMC.html

https://www.gnu.org/software/mcsim/

https://www.gnu.org/software/software.en.html

https://en.wikipedia.org/wiki/Makefile

http://mc-stan.org

Chapter 12

Modeling

THE previous two chapters provided algorithmic descriptions of the mathematicaloperations that one would have to carry out in order to evaluate the RHS of Bayes’Rule and thereby output the posterior given a model for a specified problem. In

practice, these operations are not performed directly but, rather, by using some software inwhich the mathematics is hidden and only the topmost level of the RHS visible to the user.This is fortunate because, for any serious analysis task, simply creating a valid model forthe posterior is usually not easy to do and the mathematics is formidable. There are noalgorithms for creating an MCMC model. The only useful guidelines one might offer areto begin with the likelihood function, examine its parameters then try to incorporate priorexpertise by devising models that “explain” why these parameters have the values that areneeded to describe the data. As we shall see, this process is recursive.

Since there are no rules for creating MCMC models, the best way forward is to studymany examples of successful models and the data that they were trying to analyze. Thischapter and those following will describe a sequence of such models using real data setsand real analytical problems. The analyses included here are far from comprehensive anddo not even span the space of possibilities. As was noted earlier, Bayesian inference is notonly the correct way to analyze data, it is sufficient for any such task whatever the field ofapplication. In this book, most examples are taken from the sciences partly because thereare many to choose from and partly because scientific problems are usually quantitativein character. It is hoped that the reader will begin to get an idea of how models should becreated by studying this sequence of models, from the very simple to the not-so-simple.

12.1 Melting Point of n-OctadecaneThere is probably no simpler task in data analysis than to estimate the value of an uncertainquantity from repeated measurements given very little prior information. In this example,we analyze a set of measurements with “known” measurement errors. The quantity beingmeasured is the standard melting point (MP) of a natural substance.

126

CHAPTER 12. MODELING 127

Table 12.1: C18H38 MP Data

Year MP (K) σmeas

1999 301.65 0.21985 301 0.11970 301.22 0.151968 300.95 0.51967 300.7 0.51957 301 31956 300.3 0.51955 301.4 0.31955 301.31 0.021955 301.31 0.021955 301.33 0.011954 301 21953 301.35 0.151951 301.23 0.21951 301.25 0.21951 301.25 0.21950 301.15 1.51947 301 0.51946 301 0.51946 300.85 0.31944 301.3 0.61943 300.15 1.51943 301.2 0.61943 300.9 0.61942 301.2 0.31940 300.3 0.51938 300.09 0.51932 300 21932 301.01 0.31925 301 1.51921 301 21915 301 21886 301 21882 301 21882 301 3

The substance in question is n-octadecane, a componentof petroleum with the chemical formula H3C(CH2)16CH3.1

This substance is solid but just barely. If you held some inyour hand, it would quickly melt. Experimental meltingpoints and their uncertainties are shown in Table 12.1. [48]The “official” melting point for n-octadecane is reportedas 301.0± 0.7 K.2 This 301.0 is just the average value. Theofficial uncertainty, 0.7, is a 95-percent confidence interval3

computed by estimating the net standard deviation. [48]In analyzing this dataset, we shall assume that we have

no further information. For instance, we have no way ofknowing why some measurements are better than others orwhether different techniques were used in different yearsso our analysis cannot include models for those things. Allwe know is what is in Table 12.1 plus the official literaturevalue for the melting point.

What we want to know, of course, is the true value forthe melting point of n-octadecane since physical propertiesare always important. It is tempting to select the value withthe lowest error, viz., 301.33 K, but there are at least twoproblems with that. First, as uncertain as observed valuesare, their measurement errors are even more uncertain. Thatis the nature of experimental errors generally. So 301.33 Kmight not, in fact, be the best value. Secondly, we have35 measurements and, even though some are very clearlybetter than others, the total information contained in these35 observations will necessarily exceed the information inany one of them. Therefore, we are better off using all ofthe available information, not just some of it. Moreover,this is what the Desiderata of Valid Inference require [Sect.8.2]. Therefore, we shall just take the data presented hereat face value and see what they tell us about the true valuefor the melting point of n-octadecane.

As always, we require a model to extract the desiredinformation from the data. In this model, we describe theobservations using a likelihood function and the parametersof that likelihood using priors. The latter might be vague(describing ignorance) or they might be very specific. It alldepends upon the amount of prior information available.

1also found (not so naturally) in diesel fuel2K indicates the Kelvin scale = Celsius + 273.15.3±1.96 sigma

https://en.wikipedia.org/wiki/Diesel_fuel


12.1.1 Choosing a Likelihood FunctionHow should we describe observations of a constant, real-valued quantity? If we havenegligible prior information, MaxEnt (pg. 61) requires us to describe each observation asa Normal (Gaussian) distribution (Sect. 6.5.4). The parameters of this distribution includea nominal (mean) value as well as a quantity characterizing the error that occurs when wetry to observe this value. In other words, what we observe is a value more or less closeto the nominal value but not quite—there will be some measurement error. Given a mean(location parameter) and a standard deviation (scale parameter), the Normal distributiondescribes the probability (density) for measurement errors of various sizes, all the wayfrom −∞ to +∞. Also, the Normal distribution is symmetric about its mean so, by usingit in the likelihood function with a mean equal to the true/unknown melting point, we areassuming that an error of−ε is just as likely as an error of +ε. Otherwise, our errors wouldbe biased and we would have to model that somehow. Our prior information here gives noindication that measurement errors are biased.

Thus, we describe each measurement, yi (Table 12.1, col. 2), as Normally distributed:

yi ∼1

σi√

2πexp

[−(yi − µ)2

2σ2i

](12.1)

Here, each measurement has a 1-sigma uncertainty of its own (assumed known, notuncertain) so σ has an index, i, but there is only one mean, µ, because this represents theconstant, unknown, true value of the melting point.

In English, relation (12.1) says that every MP value we observe is Normally distributedabout the true value, µ, with (possibly) different errors in the observations.4 It does notrequire that the true melting point be unknown but, for this analysis, it is.

The complete likelihood function, L , is proportional to the joint probability of all thedatapoints. As usual, we consider these to be independent, so the Product Rule states thatthis joint probability is just the product of all the individual probabilities:

L = (2π)−N2

N∏i=1

1

σiexp

[−(yi − µ)2

2σ2i

](12.2)

where N = 35.As noted earlier (Sect. 10.4), computations are always done in log space; otherwise,

the likelihood function will usually underflow producing a runtime error.5 To avoid this,we utilize the (natural) logarithm of (12.2) giving the logLikelihood:

log(L ) = −N2

log(2π) +N∑i=1

− log(σi)−(yi − µ)2

2σ2i

(12.3)

4That is, not just different in these observations but different on average. Note: Nothing prevents σivalues from being equal.

5Even with our small dataset, L is on the order of 10−70.


12.1.2 Choosing the PriorsOur likelihood function has only one unknown (uncertain) parameter, µ, so we need onlyone prior. The literature value for the melting point is 301.0 K and we may properlyconsider that prior information, i.e., known before looking at Table 12.1. Nevertheless, weexpect some posterior uncertainty in this value. To play it safe, we shall adopt a vague(ignorance) prior for µ:

µ ∼ Uniform(295, 305) (12.4)

In English, we are saying that µ might have any value in the range [295, 305] andthat, as far as we know, all values in that range are equally likely. After we finish thecomputation, we shall have to go back and make certain that (12.4) really was vague, thatit did not influence posterior knowledge of µ to any significant extent. If it did, then weshall have to repeat the analysis with a wider (more vague) prior.

Note that not only is (12.4) a vague prior, it is also a fixed prior. This means that weare not attempting to develop a model for the prior itself along with our model for thelikelihood so (12.4) does not contain any hyperparameters of its own, just numbers. Werewe able to suggest a model (explanation) for the prior for µ, then we would have to havespecific prior information to create that model and so the prior would be, to some extent,informative, not vague.

We should emphasize yet again that we must not utilize anything from Table 12.1 tocreate our prior for µ. If, in the future, we should come across more data in addition toTable 12.1, it would then be not only permissible but proper to take our posterior from thisanalysis and use it as a prior for any new analysis so long as that new analysis used onlythe new data. Breaking this rule invalidates the results of the analytical procedure.6

Given a sufficiently large amount of data, one could select a random subset of thatdata, analyze this subset with a vague prior, then use an analytical form of the resultingposterior as a prior for the rest of the data. Usually, there is nothing to be gained by doingthis. One might as well just use all of the available data at once along with a vague prior.

12.1.3 Specifying the Complete ModelWe now have the two pieces of our desired model: likelihood and prior. Next comes thehard part—obtaining a large sample from the posterior. This could not have been donevery easily before the MCMC technique and the computing power to implement it becameavailable but, in order to employ MCMC, we must specify the model in some way sothat it can be used as input, along with the data, to whatever software we plan to use (seeSect. 11.4).

The exact format for specifying the model depends on the requirements of the softwareemployed but, typically, it will be expressed not in mathematics but in pseudocode in a textfile structured more or less as shown in Model 12.1 (line numbers added).7

6since it violates the Product Rule of probability7The format shown here is that of MacMCMC. See also Appendix B.


Model 12.1: Melting Point (real-valued scalar)

1 Constants:2 N = 35; // # of points3 Data:4 MP[N], sig.meas[N];5 Variables:6 trueMP, i;7 Priors:8 trueMP ˜ Uniform(295, 305);9 Likelihood:

10 for (i, 1:N) {11 MP[i] ˜ Normal(trueMP, sig.meas[i]);12 }13 Extras:14 Monitored:15 trueMP;

All of the necessary components are evident in this pseudocode and the entire listingis very much what one would expect in an ordinary computer program. This, of course, isby design. In some MCMC software, the order of the statements matters; in others, it doesnot. Also, different software packages define distributions (see lines 8 and 11) in differentways and it is crucial to use the definitions in the software documentation.

With some software packages, there may be additional input files required besidesthose for model and data. Data will likewise have a software-dependent format.

12.1.4 Computing the PosteriorAll that remains is to specify MCMC details such as sample size and thinning, etc. Oncethis is done, it is time to “launch the frog” (see Sect. 10.3.1). Depending on the numberof datapoints and the complexity of the model, this phase might require a fair amountof patience. A short run might take only a few seconds but the repeated traversal of theparameter space could easily take hours or even overnight. It pays to check everythingcarefully before you hit the button.

As the traversal proceeds, each non-thinned state visited will be written to the outputarray.8 When finished, this trace will be saved and available for post-processing. Withsome software, a summary output file and graphs of marginals may also be available.

When this example was run9, there were one million iterations with every tenth state

8Some software packages output “thinned” states as well and ignore them in post-processing.9All examples in this book were run using MacMCMC.


saved to output. Runtime was about four seconds. The beginning of the trace is shown inTable 12.2.

Table 12.2: Partial MCMC Trace

logPosterior trueMP-24.9832 301.318-25.7053 301.329-26.2872 301.306-25.5320 301.328-25.0296 301.322-25.5707 301.310-25.3664 301.312-25.1623 301.324-24.9795 301.320-24.9750 301.320

......

As explained in Section 10.3, this trace is the posterior for the problem recorded asa tabular histogram. The output precision of this posterior is proportional to the numberof states (rows) in this table. All MCMC software packages allow the user to vary thissample size as needed.10

12.1.5 Melting Point: solutionThe complete posterior, all 100,000 rows starting with those shown in Table 12.2, describesour state of knowledge after having seen the data. In principle, anything knowable can befound from this posterior. What we wanted to know most of all was the true melting pointof n-octadecane.

Each column in a tabulated posterior records values for model parameters and otherquantities of interest corresponding to the states visited during the traversal and takes intoaccount everything else in the problem, including data, prior information, correlations, etc.Here, column 2 contains marginal values for trueMP, our posterior state of knowledge forthe melting point of n-octadecane. A plot of this marginal (Figure 12.1) shows, in graphicalform, what we can say about this melting point now that we have seen the data.

There are several points of interest regarding this plot. First, it should be noted thatthe location, width and shape of this curve did not arise by using some theory to tell uswhere it should be or what it should look like. The curve arose naturally and automaticallyby moving over the entire parameter space11 at constant speed and recording states visited

10Expect that any serious analysis will require some trial-and-error.11Here, the parameter space is just one-dimensional. In more complicated models, there may be dozens

or even thousands of dimensions, one dimension for each parameter.


��

��

��

��

��

��

��

�� -�� (�)

��

Figure 12.1: Marginal for True MP of n-Octadecane

along the way (see Sect. 10.3.2). Its position on the temperature axis, its height and the factthat it looks very much like a Normal or Students-t distribution is a result of the ProductRule of probability, not an assumption or approximation of any kind.

Second, the literature melting point is 301.0 K but our results are different since themarginal goes to zero well above this temperature. In addition, our answer is much moreprecise. The literature uncertainty is ± 0.7 K but this plot shows clearly that the posterioruncertainty is in the second decimal place so we have gained an additional significantfigure. Also, its narrow range confirms that our prior for trueMP was vague. If we sortthe values in the complete Table 12.2, col. 2 from low to high, we can find the 95-percentHPD credible interval as described in Section 10.4.1. Likewise, we can find the mean ofthe values in column 2 to get the mean point estimate for the melting point. The MAP valuehas to be computed by interpolation. This can be done by finding the row in the posteriorwith the largest logPosterior, along with nearby values, and optimizing the logPosteriorusing some maximization algorithm, e.g., the Nelder-Mead simplex algorithm. All ofthese numerical results are presented in Table 12.3 which shows three decimal places forlater comparison.12

That this marginal plot is not smooth is not a mistake. Most marginal plots have abumpy appearance because such plots are the envelope of a histogram, not the graph of amathematical function. We could get a smoother curve by outputting a bigger sample.

12Note that the mean estimate for trueMP is not equal to the mean of the data (300.983). This remainstrue even with a much wider prior.

https://en.wikipedia.org/wiki/Nelder\OT1\textendash Mead_method


Table 12.3: Melting Point (solution)


trueMP 301.319 301.319 301.303 301.334

Another question that is especially noteworthy. Suppose that we had skipped all of thisMCMC computation and just selected the best experimental result (Table 12.1, row 11).How does the solution above compare with that?

The answer is that our solution is much better. Although there is almost no differencein the point estimate, the original datapoint has a 1-sigma uncertainty of 0.01 K. With aGaussian likelihood for that single datum, traditional (frequentist) methodology gives a95-percent confidence interval of ±1.96σ, a range of 0.0392 K. Table 12.3, on the otherhand, gives a 95-percent credible interval that is nearly 21 percent narrower (0.031 K) so,by taking advantage of the entire dataset via Bayesian inference, we have done that muchbetter than even our “best” datapoint.

The desiderata discussed in Section 8.2 state that we should utilize all available, validinformation to make the best inference. This is a good illustration of that principle.

12.2 OutliersWe cannot end our discussion of this melting-point example without saying somethingabout outliers. As it happens, the data in Table 12.1 are not really complete; there is adatapoint missing. An experimental observation of 299 K with σmeas = 3 K was omittedwhen computing the literature value. [48] Presumably, this was done because it was anoutlier, i.e., too different from the remaining observations. One common rule-of-thumbis to label a datapoint an outlier whenever it is more than 3 sigma away from the samplemean. It is worth asking whether throwing data away for this reason is a good thing to do.

Declaring a datapoint an “outlier” means that, for some reason, it does not belong inthe same dataset as the rest of the sample. It might have been included by mistake or itcould even be a typo. Clearly, those are legitimate reasons for excluding it. On the otherhand, it might simply be a relatively poor measurement in which case it is not at all clearthat discarding it is correct procedure.

Assuming normality (as is customary), the probability that any measurement will beoutside the ±3 sigma interval is 0.0027. This seems quite low and perhaps a point of suchlow probability should be considered an outlier. However, including this “outlier”, ourdataset contains 36 points, not just one and, assuming that these points are independent,the probability that all 36 will fall within the 3-sigma interval = (1− 0.0027)36 = 0.907.Thus, there is almost a 10-percent chance that one or more of the three dozen points willbe outside this interval. If we see such a point, does it really deserve to be thrown away?Probably not, unless we have some additional reason(s) to do so.


What happens if we do throw it away? Well, one thing that would certainly happen isthat we would reduce the average measurement error. That is, we would be pretending thatthe measurements are a bit more accurate than they really were. This is very unfair andis, in fact, a form of cheating. All data analysis presupposes that we are telling the truthas best we know it. If experimental observations are not all wonderful then, other thingsbeing equal, we are obliged to say so. Otherwise, our final credible interval will be toonarrow. Our final point estimates might change as well although this is not as predictablesince relationships in models are usually highly nonlinear.

If we do not throw this particular point away then the results change very little. Thecomparison is shown in Table 12.4.

Table 12.4: Melting Point (solution #2)

Unknown # Datapoints Estimate 95% Credible IntervalMAP Mean Lower Limit Upper Limit

trueMP35 301.319 301.319 301.303 301.33436 301.319 301.319 301.303 301.335

The only difference in the results when this “outlier” is included is that the upper limitof the 95-percent credible interval increases by 0.001. Since there are only two significantdecimal places in this analysis (see Table 12.1), this means that there is really no differenceat all. Figure 12.2 shows the marginal plot for the true melting point with the “outlier”included. It actually looks a bit better than our previous plot.

The moral here is that one should not be too quick to discard outliers or even to declarethat some point is an outlier in the first place. You need a really good reason to do that.

12.3 Something IncredibleIn the last section, we drew attention to the situation in which one of the points was anoutlier. This is a common occurrence but things could be even worse. What if the onlyobserved datapoint you have contradicts what you already know or think you know? Whatif prior information and data are completely incompatible? What do you do then?

If you have understood our presentation up until now, you know the answer. You dothe same as always; you use Bayesian inference, the necessary and sufficient approach toany data analysis.

Consider a genuine example. Between 1924 and 1968, the melting (freezing) point ofmethane, CH4 (a gas at room temperature and pressure), was measured four times. [48]The results showed that the 95-percent credible interval for the true melting point was[90.31, 90.81] K and the 99-percent credible interval was not much wider, [90.24, 90.89] K,so it is a good bet that the true melting point of methane was “known” to be in the interval[90, 91] K as of 1968. However, in 1971, the melting point of methane was measuredagain. [48] This time, the result was 85.7 K with σmeas = 0.2 K.


��

��

��

��

��

��

��

�� -�� (�)

��

Figure 12.2: Marginal for True MP of n-Octadecane (including “outlier”)

Suppose that you were the chemist who carried out the 1971 experiment and that youdid not know the previous results in detail, only that the true melting point was supposedto be in the range [90, 91]. What would you make of your own result? Supposedly, it isnot credible, perhaps even impossible. Would you throw it away?

One might think that an “impossible” result is useless and deserves to be thrown awaybut we know by now that every relevant datum contains information. In this example, the1971 result and the prior, Uniform[90, 91], both contain information. It is conceivablethat the 1971 result is really correct and that there was something wrong with the earliermeasurements. From a data analysis perspective, the only fair approach is to put everythingtogether and see what the posterior says afterwards. This is done in Model 12.2.

This model is identical to our model for the melting point of n-octadecane except forthe number of points and the prior (lines 2 and 8). The resulting marginal and solutionsummary are shown below in Table 12.5 and Figure 12.3.

Table 12.5: Methane MP Solution From One “Impossible” Point


trueMP 90 90.009 90 90.028

There are some important features worth noting in this solution beginning with the(perhaps surprising) fact that there is a non-trivial solution! After all, prior information


Model 12.2: Melting Point (one “impossible” datapoint)

1 Constants:2 N = 1; // # of points3 Data:4 MP[N], sig.meas[N];5 Variables:6 trueMP, i;7 Priors:8 trueMP ˜ Uniform(90, 91);9 Likelihood:

10 for (i, 1:N) {11 MP[i] ˜ Normal(trueMP, sig.meas[i]);12 }13 Extras:14 Monitored:15 trueMP;

��

��

��

��

��

��

��

��

��

��

��

��

�� (�)

�� Prior info: MP ∈ [90, 91]

Observed: MP = 85.7 ± 0.2

Figure 12.3: Marginal for True Methane MP

said that the probability of an answer outside the range [90, 91] was zero and the observedMP∼Normal(85.7, 0.2) has cumulative probability in the prior range of just 7.8× 10−103.


To a very good approximation, these two domains do not overlap.However, as we have pointed out before, we prefer to avoid approximations in data

analysis whenever possible. When we rely solely on Bayes’ Rule, it turns out that ourposterior does have a non-zero range and its shape was shown in Figure 12.3. The glaringdisconnect between prior and data has not gone away. The joint probability of the two hasa maximum at trueMP = 90 but this maximum is tiny.

P (observedMP = 85 .7 , trueMP = 90 ) = 8.4× 10−101 (12.5)

Given this prior and this single observation, it should be clear that further investigationis necessary. The real lesson here is that Bayesian inference works even in a pathologicalcase like this one.

12.4 PumpsThe melting point example above addressed real-valued observations. We can carry outBayesian inference just as easily when the data are discrete, for instance when we aresimply counting something.

For a lot of reasons, count data are often assumed to be ∼ Poisson. An example,discussed several times in the literature, concerned the reliability of pumps used in powerplants. The task was to estimate their failure rates, θi, by recording the number of failures,xi, when observed for ti kilohours. [19][5, vol. 1] Relevant data are shown in Table 12.6.

Table 12.6: Pumps Data

Pump 1 2 3 4 5 6 7 8 9 10xi 5 1 5 14 3 19 1 1 4 22ti 94.3 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5

The Poisson distribution has only a single parameter equal to the expected value of thecount. Here, count = (failures) khr−1 × khr so our likelihood is

xi ∼ Poisson(θi ti) (12.6)

with θ a vector of unknowns.13

We require a prior for each unknown failure rate. The worst pump, apparently, is # 10with more than two failures per khr so it is probably safe to assume that

θi ∼ Uniform(0, 10) (12.7)

This is a very simplistic prior but it will suffice for now.The complete model is shown below. In this model, the unknown rates have a subscript

so there are ten of them, possibly different but not necessarily.13Note that the parameter of a Poisson distribution is dimensionless as befits a count.



Model 12.3: Pumps (Poisson data)

1 Constants:2 N = 10; // # of pumps3 Data:4 x[N], t[N];5 Variables:6 theta[N], j;7 Priors:8 for (j, 1:N) {9 theta[j] ˜ Uniform(0, 10);

10 }11 Likelihood:12 for (j, 1:N) {13 x[j] ˜ Poisson(theta[j]*t[j]);14 }15 Extras:16 Monitored:17 theta[];

12.4.1 Pumps: solutionThis model gave the results presented in Table 12.7. In this table, the nominal value shownis just the observed xi/ti failure rate.

Table 12.7: Pumps (solution)

Pump Failure RateNominal Estimate 95% Credible Interval

MAP Mean Lower Limit Upper Limit1 0.053 0.052 0.064 0.019 0.1162 0.064 0.058 0.128 0.003 0.3043 0.079 0.078 0.095 0.027 0.1724 0.111 0.111 0.119 0.062 0.1815 0.573 0.610 0.764 0.133 1.5126 0.605 0.629 0.637 0.371 0.9207 0.952 1.049 1.906 0.045 4.5558 0.952 0.642 1.904 0.034 4.5359 1.905 2.039 2.383 0.591 4.495

10 2.095 2.157 2.194 1.338 3.109


12.5 Draws in ChessIn Section 1.4.3, we examined some data for draws in chess and, in particular, whetherdraws were more likely in Queen-pawn (QP) openings than in King-pawn (KP) openings.In this section, we describe the model from which we obtained our results.

Our dataset (Table 1.4) contains twelve monthly observations for the number of drawsin each case, Ndrawsj , and the number of games played, Ngamesj . Distinguishing the twocases, we can write the observations as 2-D quantities, Ndrawsj,k and Ngamesj,k whereindex k = {1 (KP), 2 (QP)}.

Since a chess game is either drawn or not drawn, the games played can be partitionedinto two subsets, draws and non-draws, and we have the observed frequencies as data.This suggests immediately that the likelihood should be Binomial, that is,

Ndrawsj,k ∼ Binomial(probDrawk,Ngamesj,k) (12.8)

Indeed, it would be difficult to justify any other likelihood function.The Binomial distribution takes an argument, probDrawk, describing the probability

that there will be Ndrawsj,k when Ngamesj,k are played. We shall assume that we have noidea what probDrawk should be in either the KP or QP case so we shall use a completelyvague prior for each and just let the data decide:

probDraw[k] ∼ Uniform(0, 1) (12.9)

Once again, this says that all possible values of probDraw[k] are equally likely a priori.We are going to add one extra feature to our model. As described in Chapter 1, we

would like to track (monitor) the difference, deltaP , between the probability of QP drawsand that of KP draws to see if it might actually be zero. As an MCMC traversal proceeds,it is a trivial matter to compute such an ancillary quantity on the fly and save it to theoutput along with the unknown parameters (see below). With this addition, our model isnow complete and can be written as shown in Model 12.4.

The model shown need not have been written in exactly this way. For instance, wecould have defined variables pKP and pQP as the unknowns instead of array p[2] (line 6).This would eliminate the inner loop in the likelihood (lines 13–15) and replace it with twolines similar to (12.8) but using pKP and pQP separately.

If you study the structure defined in this model, it should be apparent that there is anunavoidable connection between the form of the model and the format of the input data.The model indexes the datapoints from 1 to N with a second index, k, inside of that. Thedata file must be structured accordingly. How this is accomplished will vary from oneMCMC software package to another and warrants careful attention.

12.5.1 Draws in Chess: solutionWe have already presented the full solution to this problem in Section 1.4.3.

https://en.wikipedia.org/wiki/Partition_of_a_set

https://en.wikipedia.org/wiki/Binomial_distribution


Model 12.4: Draws in Chess (binomial)

1 Constants:2 N = 12; // # of months3 Data:4 Ndraws[N][2], Ngames[N][2];5 Variables:6 probDraw[2], deltaP, j, k;7 Priors:8 probDraw[1] ˜ Uniform(0, 1);9 probDraw[2] ˜ Uniform(0, 1);

10 Likelihood:11 for (j, 1:N) {12 for (k, 1:2) {13 Ndraws[j][k] ˜ Binomial(probDraw[k], Ngames[j][k]);14 }15 }16 Extras:17 deltaP = probDraw[2] - probDraw[1];18 Monitored:19 probDraw[], deltaP;

12.6 Model ComparisonThe chess-draws example described above defined a variable, deltaP , as a measure ofthe difference between the probability of QP draws and KP draws. The hypothesis wasthat this difference is reliably greater than zero and the marginal for this metric (Fig. 1.9)showed this hypothesis to be true.

However, this is not the best method for comparing models. The ideal procedure,described in Section 10.5.3, is to output the log(marginal likelihood) and use it to computethe odds ratio for the various model alternatives (from competing hypotheses). This givesthe relative probabilities of the alternatives directly, independent of any parameter values.

In this example, the model used one probability for KP draws and another for QP drawsallowing (but not forcing) them to be different. We could, instead, hypothesize that thisdistinction is spurious, that KP or QP does not matter (see Model 12.5).

This new model has only one parameter since p[1] determines p[2] (line 9). As such,p[2] does not have a prior even though we define it in the Priors: part of the model.14

We have thus reduced the dimensionality of our parameter space from 2-D to 1-D. Thisfact alone would increase the log(marginal likelihood) significantly and, other things being

14We could have put it in the Likelihood part. (MCMC software need not split the model into parts.)


Model 12.5: Draws in Chess #2 (binomial)

1 Constants:2 N = 12; // # of months3 Data:4 Ndraws[N][2], Ngames[N][2];5 Variables:6 probDraw[2], j, k;7 Priors:8 probDraw[1] ˜ Uniform(0, 1);9 probDraw[2] = probDraw[1];

10 Likelihood:11 for (j, 1:N) {12 for (k, 1:2) {13 Ndraws[j][k] ˜ Binomial(probDraw[k], Ngames[j][k]);14 }15 }16 Extras:17 Monitored:18 probDraw[1];

equal, would indicate a more credible model.However, other things are definitely not equal. Model 12.4 is much more credible than

Model 12.5, something we saw graphically on page 14. We can quantify this by insertingthe log(marginal likelihood) values (Table 12.8) into the formula for the odds ratio.

Table 12.8: Draws in Chess: log(marginal likelihoods)

Model log(marginal likelihood)M1: probDraw[2] 6= probDraw[1] -143.585M2: probDraw[2] = probDraw[1] -161.163

Let oddsM2 be the odds that the simpler model, M2, is better (more credible) than theoriginal model, M1, and move to real space from log space.

oddsM2 = exp(−161.163− (−143.585)) = 2.32258× 10−8 (12.10)

Then,

P(M2 better than M1) =oddsM2

1 + oddsM2= 2.32258× 10−8 (12.11)

Therefore, using the Sum Rule,

P(M1 better than M2) = 1− 2.32258× 10−8 ≈ 0.99999998 (12.12)


As a practical matter, had we asked whether M1 was better than M2 initially, most softwarewould have output Prob = 1 due to round-off error.

The fact that M1 is the better model even though it doubles the dimensionality of theproblem is especially impressive. Of course, we already knew that from Figure 1.8. Still,for good data analysis, it is always preferable to see the numbers.

This illustrates the correct method for comparing models but, often, the integrationrequired to compute the log(marginal likelihood) is too computationally intensive. As thedimensionality of a problem increases, this integration gets exponentially more difficult.There are suggestions in the literature for faster, albeit approximate, alternatives.

12.7 PGA Tour Championship, 2013The U.S. Professional Golfers Association (PGA) holds a tour championship every year.In 2013, 29 players finished all four rounds (72 holes). [21] There was the usual assortmentof strokes per hole as listed, by player, in Table 12.9.

Table 12.9: Golf Strokes Over Four Rounds

Eagle Birdie Par Bogey DblBogey+ Eagle Birdie Par Bogey DblBogey+1 16 45 7 3 0 17 41 11 30 10 50 12 0 0 17 45 10 00 13 53 5 1 1 13 48 9 10 9 45 15 3 0 11 49 11 10 12 51 6 3 0 14 48 8 20 15 46 9 2 0 13 48 10 12 14 43 10 3 0 18 42 10 20 12 47 12 1 0 18 43 6 50 18 43 9 2 0 16 40 16 00 13 50 7 2 0 15 47 10 00 14 42 14 2 1 18 44 8 10 20 45 7 0 0 12 51 7 20 12 50 9 1 0 18 46 7 11 13 48 9 1 0 10 58 4 00 18 40 13 1

This is a very simple example in which count data fall into disjoint categories, eachwith a fixed probability. Very often, such categories contain nominal data (e.g., True,False) so cannot be quantified directly. The simplest cases have only two categories and,as such, are binomial (see Section 12.5). In general, there can be any number of categories.The likelihood is then multinomial with each category having a separate probability (seeSection 6.6.2) although there is nothing to prevent category probabilities from being equal.

There are various ways to implement a multinomial model but all of them have twoimportant constraints. The defined categories must include all possibilities and the total


probability for all categories must equal one. In most problems, we seek values for oneor more of the probabilities. In the model shown below, the required normalization isdone manually. This is not the most elegant solution but it is easy to understand and isobviously extendable to any number of categories. Here, we have five categories and weadopt a vague prior for all of the probabilities, p[i], with indices corresponding to theordering in Table 12.9.

Model 12.6: Strokes vs. Par (multinomial)

1 Constants:2 Nplayers = 29, Nholes = 72;3 Data:4 Tally[Nplayers][5];5 Variables:6 p[5], theta[3], i;7 Priors:8 // Enforce normalization of p[] manually.9 theta[1] ˜ Uniform(0, 1);

10 theta[2] ˜ Uniform(0, 1);11 theta[3] ˜ Uniform(0, 1);12 p[1] ˜ Uniform(0, 1); // Eagle13 p[2] = theta[1]*(1 - p[1]);14 p[3] = theta[2]*(1 - p[1] - p[2]);15 p[4] = theta[3]*(1 - p[1] - p[2] - p[3]);16 p[5] = 1 - p[1] - p[2] - p[3] - p[4]; // DblBogey+17 Likelihood:18 for (i, 1:Nplayers) {19 Tally[i][] ˜ Multinomial(p[], Nholes);20 }21 Extras:22 Monitored:23 p[];

A model with only vague, Uniform priors will necessarily give a posterior with MAPparameters equal to the M-L estimates so that, in this example, point estimates for thedesired probabilities are easy to obtain. This is true for this problem since each marginalcategory (cell) probability is Binomial and the M-L estimate for a binomial probability isjust the observed probability (relative frequency). However, credible intervals cannot beso easily estimated unless we use MCMC. This is especially true when priors are no longervague because we are not totally ignorant, a priori.

The model above gives the solution in Table 12.10 with marginals in Figure 12.4.


Table 12.10: Strokes vs. Par (solution)

Probability Estimate 95% Credible IntervalMAP Mean Lower Limit Upper Limit

p[1] 0.0029 0.0034 0.0012 0.0059p[2] 0.2002 0.2009 0.1840 0.2184p[3] 0.6461 0.6447 0.6239 0.6649p[4] 0.1297 0.1296 0.1154 0.1441p[5] 0.0211 0.0214 0.0153 0.0276

��

��

��

��

��

��

��

��

��(��)

��

p[1]

p[2] p[3]p[4]

p[5]

Figure 12.4: Marginals for Stroke Probabilities

The M-L parameter estimates are {0.0029, 0.2007, 0.6456, 0.1278, 0.0211} (withsome round-off error). As expected, these are the same as the MAP values apart fromsome slight noise. Since MCMC is a stochastic process, there is always some low-levelnoise. This means that replicate MCMC runs will not be exactly the same. They should bevery close, however, if the parameter-space traversal is working properly.

We noted that the marginal probability for any given cell is binomial. Therefore, if wemarginalize (integrate out) all other cells, we get a model with just one degree of freedom.Suppose we marginalize all cells except that for Pars. This gives just two categories,{Pars, Others}. If we partition the data in the same way, Model 12.6 reduces to a modelanalogous to Model 12.5. Then the solution is that shown in Table 12.11—essentially thesame as that given above for p[3]. There is less noise here, compared to the M-L estimate,primarily because of the simplicity of the parameter space.


Table 12.11: Pars vs. Others

Parameter Estimate 95% Credible IntervalMAP Mean Lower Limit Upper Limit

probPar 0.6456 0.6455 0.6250 0.6657

It is interesting to compare this result for probPar, especially its credible interval, withthat obtained from theory. If you input the combined data, 1348 pars out of 2088 holes,into the calculator provided here, the latter will output the “exact” 95-percent credibleinterval, that is, the HPD interval computed in closed form as well as the central interval.

Finally, if we use MCMC on the combined data, the binomial model with one datapointoutputs the same answer for probPar, with essentially the same credible interval, as shownin Table 12.11. The log(marginal likelihood), however, is now much larger (−7.64 versus−85.58). No surprise since, other things being equal, the joint likelihood of 29 points isgoing to be a lot smaller than the likelihood of one point.

12.8 A Triple PointIn our first example in this chapter, we examined data for the melting point of n-octadecaneseeking, as usual, a value for the true melting point. It was assumed that we knew nothinguseful about this value so our prior for this number, trueMP, was very vague (model 12.1).In most cases, however, we are not so ignorant. Almost always, we have some expertisein the subject by the time we perform a data analysis. To see this in action, consider somedata for a closely related substance, n-nonadecane, H3C(CH2)17CH3.

Here, we are not interested in the melting point but in a different physical property. Themelting point of any pure substance is the temperature at which solid and liquid phases arein equilibrium. This temperature is a function of pressure. In contrast, suppose we insistthat solid, liquid and gas phases all have to be in equilibrium simultaneously. That turnsout to be possible but just barely. The constraints of thermodynamics require that such atriple-point be unique, that is, the temperature and pressure at which this can happen arefixed for three phases of any pure substance.15 There are no degrees of freedom left.

When it comes to triple-point temperatures of closely related substances, there is oftena reliable pattern. Knowledge of such a pattern constitutes prior information and this can(and should!) be included in any model for the true triple-point temperature, trueTp. Fornormal alkanes in the region of interest, this pattern in shown in Figure 12.5. [48]

The triple-point temperature for n-nonadecane appears to be unknown so finding itsvalue might make a good undergraduate chemistry experiment. Let us imagine such ascenario and suppose that this experiment was carried out by four students and that theirresults were {303, 307, 314, 304} (K) with an uncertainty in all cases of 2.5 K.. Thequality of undergraduate lab experiments is typically poor and uncertainties very dubious.

15but not any three phases

https://www.causaScientia.org/math_stat/ProportionCI.html

https://en.wikipedia.org/wiki/Alkane


� ��

��

��

��

��

��

��

��

# �� -��

��-��

(�)

Figure 12.5: Related Triple-point Temperatures

We shall imagine that the value 2.5 K was the value reported by two chemistry majors inthe same fraternity who performed this experiment the previous year so, for the four newexperiments, this may be taken as prior information. The pattern is monotonic and thetriple-point temperatures below and above C-19 are {301.0, 309.6} K, with uncertaintiesin the second decimal place. We shall assume that the value for C-19 must fall in between.

None of this is factual except the values for C-18 and C-20. The rest was contrived sothat we may show how such a problem could be analyzed correctly.

We can safely adopt a Normal distribution for the prior for trueTp with a mean atthe midpoint of the allowed range (305.3 K). For the scale factor, we can use 2.5 K (forlack of anything better). However, this gives a prior which extends into regions that priorinformation declares are invalid. Consequently, the prior for trueTp must be truncated asshown in Figure 12.6 (see sect. 6.8.1).

All well and good. So far, this is very much like our n-octadecane example. But howdo we specify a truncated prior with arbitrary truncation points? It turns out that this alldepends on the software package being used. The code in Model 12.7 uses MacMCMCsyntax and lines 8–10 require some explanation.

First, MacMCMC implements a Generic distribution computing log(PDF) from a user-supplied formula.16 Also, this software returns standard values for Boolean expressions:True = 1 and False = 0. In line 8, the log(·) term adds either zero or−∞, resp. Needless tosay, if log(prior) = −∞, the proposal will never be accepted and will never be included in

16The second and third parameters are bounds used to initialize the MCMC chain.

https://en.wikipedia.org/wiki/Truncated_normal_distribution

https://en.wikipedia.org/wiki/Boolean_expression


��

��

��

��

��

��

��

Shaded areas invalid

Figure 12.6: A Truncated Prior

Model 12.7: Triple-point Temperature (truncated prior)

1 Constants:2 N = 4; // # of points3 Data:4 Tp[N];5 Variables:6 trueTp, sigma, i;7 Priors:8 trueTp ˜ Generic(log((trueTp >= 301.0) && (trueTp <= 309.6))9 - 1.83523 - 0.08*(trueTp - 305.3)ˆ2 + 0.0893039, 301.0, 309.6);

10 sigma ˜ Jeffreys(0.1, 20);11 Likelihood:12 for (i, 1:N) {13 Tp[i] ˜ Normal(trueTp, sigma);14 }15 Extras:16 Monitored:17 trueTp, sigma;


the posterior trace. Finally, the value −1.83523 is the log of the Gaussian normalizationconstant and the value 0.0893039 is an additional log(normalization constant) needed sothat the valid portion of the prior will integrate to one. All truncated distributions need thisextra constant if they are to be normalized. Note: In unfavorable circumstances, this termcan take a long time to compute.17

The Jeffreys prior in line 10 is a very common choice for scale parameters—for severalreasons, Uniform priors are not ideal for describing scale parameters when the prior isvague. For instance, a Uniform prior for σ would give different results than a (properlyadjusted) Uniform prior for variance, σ2. One recommended solution, especially wheneven the order of magnitude is uncertain, is to use a prior Uniform in log space such as aJeffreys prior. [26][15] The Jeffreys parameters shown here are defined in real space and,as we shall see, are sufficiently vague.

That the truncated prior for trueTp worked as desired is evident from a histogram ofits trace in the posterior (Fig. 12.7). The (invalid) tails are slightly missing (cf. Fig. 12.6).This effect is real, not just the result of bin quantization. Point estimates for the solutionare shown in Table 12.12.

��

��

��

��

��

��

��

��

Figure 12.7: trueTp: Posterior Marginal Histogram

Some comparisons are of interest. Using frequentist statistics, we find that the M-Lestimate, the mean of the data, is 307.0 K with a 95% central confidence interval, from thet distribution, of {299.1, 314.9}—the latter obviously not credible. Using MCMC with

17Recall, however, that normalization constants are unnecessary if the marginal likelihood is not goingto be computed.

https://en.wikipedia.org/wiki/Jeffreys_prior


the same prior for trueTp but untruncated, we get a mean estimate of 306.11 K with a 95%credible interval of {302.37, 309.63}. This credible interval leaks into the invalid regionso the truncated prior was worth the trouble in a practical sense, not just theoretically.18

Table 12.12: Triple-point Temperature (solution)


trueTp 306.36 306.03 302.83 309.28sigma 3.89 5.75 2.26 11.17

The posterior for sigma is heavily skewed to the right (mean > MAP). It usually is butthe skewness is exaggerated when there is little data (Fig. 12.8). We wanted the Jeffreysprior to be vague and it was evidently vague enough. There are more elegant ways to makea prior vague which we shall encounter in Chapter 14. Finally, note that the (multivariate)MAP value of a parameter need not equal the mode of its marginal.

� � ��

��

��

��

��

��

��

��

Figure 12.8: Marginal for sigma

18Only tradition suggests the 95% interval. The 99% interval, here {301.08, 310.86}, would have beeneven less credible.


12.9 Height vs. Weight CorrelationIn previous examples in this chapter, the observables were all scalars but, occasionally, thisis not the case. It may happen that two or more quantities are always observed togetherbecause they have (or are thought to have) some intrinsic association. The observable isthen a vector with, typically, correlated components.

To illustrate this possibility, we consider data for the height (cm) and weight (kg) of247 adult men. [31] A scatterplot of the data (Fig. 12.9) shows that these two observedquantities are correlated with a slightly positive trend. There are several ways in whichthese data could be described but we shall model them using a BivariateNormal likelihood(see sect. 6.5.10). That is, we shall assume that both height and weight are marginallyNormal with unknown means and standard deviations plus an unknown correlation, rho.The latter parameter must be in the range [−1, 1] and we shall pretend that we do notknow that it is positive (just to be fair). With all priors vague we get Model 12.8 with theposterior result shown in Table 12.13.

��

��

��

��

��

��

�� (��)

��(��)

Figure 12.9: Height versus Weight for 247 Adult Men

A contour plot using mean parameter estimates is shown in Figure 12.10.


Model 12.8: Height vs. Weight (bivariate normal)

1 Constants:2 N = 247; // # of points (rows)3 Data:4 WtHt[N][2];5 Variables:6 m1, sig1, m2, sig2, rho, row;7 Priors:8 m1 ˜ Uniform(50, 100); // weight9 sig1 ˜ Jeffreys(0.1, 20);

10 m2 ˜ Uniform(150, 250); // height11 sig2 ˜ Jeffreys(0.1, 20);12 rho ˜ Uniform(-1, 1);13 Likelihood:14 for (row, 1:N) {15 WtHt[row][] ˜ BivariateNormal(m1, m2, sig1, sig2, rho);16 }17 Extras:18 Monitored:19 m1, sig1, m2, sig2, rho;

Table 12.13: Height vs. Weight (solution)


µ1 78.18 78.15 76.84 79.46σ1 10.48 10.55 9.63 11.49µ2 177.78 177.75 176.84 178.64σ2 7.13 7.21 6.58 7.86ρ 0.53 0.53 0.44 0.62


��

��

��

��

��

��

�� (��)

��(��)

Figure 12.10: BivariateNormal Contour Plot

12.10 Models and InformationSection 1.3 listed several reasons why we need models to perform data analysis. Initially,they are a way to extract information from data and, in the present chapter, that is primarilyall that we have done with them.

In the melting-point example, we were given three dozen experimental observations ofwhat should be a fixed, real-valued quantity along with estimates for the accuracy of eachobservation and asked to determine the true value of this quantity. We had almost no otherinformation so there was nothing available for model construction except the maximallyconservative strategy to express each observation as normally distributed. The prior wasassumed so vague that it did not have any effect on the final answer. Consequently, all ofthe posterior knowledge came from the data alone.

In spite of this initial ignorance, the results were very precise even when we includedan “outlier” observation. This illustrates that the Sum and Product Rules of probability arequite sufficient for data analysis provided that we have sufficient data (hence, sufficientinformation) for the purpose intended.

In the chess-draws example, the same conclusion is evident when the data are discreterather than continuous. In this case, we assumed that we had no prior information at all,not enough “expertise” to provide even a hint of the answers. In fact, we considered twocompeting hypotheses and used the Product Rule to compute a quantitative estimate that


one hypothesis (model) was better than the other.These two examples offered little scope for exhibiting the power of data analysis via

Bayesian inference and MCMC and the same is largely true with all the other examples.Practically everything we concluded came from whatever information was present in thedata alone. Also, the answers sought were of the most rudimentary kind. In subsequentchapters, we shall ask much more difficult questions. For these, we shall require muchmore elaborate models.

Chapter 13

Goodness-of-fit

IN Chapter 12, we went into some detail about how to create a simple model to extractsome desired information out of a dataset. We looked at several examples and theresults were fairly impressive. However, we did not address one issue that is very

important in any data analysis. We did not ask whether the models found were, in fact,good models. A good model is one which, first of all, describes the data. Regardless ofhow nice some marginals may appear, they do not mean much if the model that generatedthem does not describe the observed data.

How well a model describes the data is termed goodness-of-fit. Determining the latteris often called model validation or model checking. There are many suggested approachesfor checking a model but we shall focus on just one.

If a model is good and describes the data well then it follows that one could use themodel to answer questions instead of obtaining more data. If the model is equivalentto additional data then the model should be able to generate a valid dataset all by itself.Any dataset synthesized in this fashion would not be unique. Chances are that predicteddatasets would all be different. Nevertheless, all of them would be legitimate provided thatthe model is a good one.

To make all of this quantitative and convincing, we require two things: i) a measure orindicator showing the amount of discrepancy between the model predictions and the dataand ii) a valid procedure for determining the distribution of predicted results. We discussone such procedure using our chess-draws example (Sect. 1.4.3) plus one new example.

13.1 Discrepancy MeasureThe measure used to exhibit the discrepancy between model predictions and data can bewhatever is convenient provided it has the power to reveal the discrepancy of interest whenit is present. This measure can be a numerical metric or, when appropriate, a picture.

Sometimes, there is a feature of the likelihood that can provide a numerical metricdirectly. For instance, if the likelihood is Poisson, then a characteristic of this distributionis that its mean and variance are equal. If we were to find that the mean and variance

154


CHAPTER 13. GOODNESS-OF-FIT 155

of our model were appreciably different, then we could justifiably conclude that the datawere not Poisson after all. We would then need to find a better model.

When the likelihood does not provide a metric directly, one can often define a metricthat quantifies some property of the data that we expect to be true and which is importantin the problem being analyzed. Since this is task-dependent, it is impossible to generalizeon the nature of such a metric.

Quite often, the purpose of testing goodness-of-fit is simply to convince some audiencethat the proposed model is really a good one. In this context, a picture might well be worththe proverbial “thousand words”.

Whatever the selected measure might be, we need a mathematically valid method tomake the model predictions to which we shall compare the data.

13.2 Posterior-predictive CheckingIf the posterior truly describes the data, then it should be successful in predicting futureobservations. This is, indeed, a common practice. Better yet, if the posterior can predictfuture observations then it could have predicted the observed data as well! This abilitymotivates the use of the posterior-predictive distribution for model goodness-of-fit.

To check the quality of a model using a sample (size = m) from the posterior-predictivedistribution estimated using the trace array (size = n), we proceed as follows:

Algorithm 4 Posterior-predictive Model Checkingprocedure GOODNESS(m, n)

Define some discrepancy measure of interest.Apply it to the original data, D . . ideally, before modeling the datandx[ ]← RandomInteger(1, n)[m] . m indices, uniformly randomfor j = 1 to m do

R← row[ndx[j]] in traceUse the parameters in R to synthesize a dataset, S , the same size as D by

drawing random variates from the likelihood.D[j]← discrepancy measure applied to S

end forTake D[1 . . .m] (sorted) as the empirical distribution of the discrepancy measure

conditioned on all else in the problem.Compare the observed measure from D to its distribution, D.

end procedure

If the measure is numeric, its empirical distribution, D, should provide an estimate ofthe probability of the value of this measure with the original data. This probability will notbe extreme if the model is good. If the measure is a picture, then the comparison shouldbe evident graphically.

https://en.wikipedia.org/wiki/Posterior_predictive_distribution


13.3 Checking the Chess-draws ModelFor this model, we shall use a posterior-predictive check with a graphical measure. Westart with a plot of the data (cf. Table 1.4).

� � ��

��

��

��

��

��

��

��

��

Figure 13.1: Draws in Chess: Data

In this plot, the KP draws and QP draws are indexed separately (1. . . 12, 13. . . 24).There is considerable scatter (total variance) so there is much for the model to “explain”.We test whether we have explained it by using the MCMC trace to generate posteriorpredictions for a relevant measure.

The measure that is most intuitive is just the number of draws predicted by the modelversus the number seen in the data. Model predictions are stochastic not deterministic sothere will be significant variance in predictions even with a perfect model. Even worse,there is posterior uncertainty in the two model parameters, inflating the variance of thepredictions even further.1 To make the predictions, Algorithm 4 was used with m = 100,giving 100 predictions of Ndraws for each of the 24 datapoints. The predictions versusdata comparison is shown in Figure 13.2 (predictions shown in gray).

Except for points five and six, all of the datapoints fall within the spread of the modelpredictions. It is probably fair to say that this two-parameter model does describe the datareasonably well. It is interesting, however, that Ndraws[5] is so unusual but it was unusualin the original dataset as well. Perhaps there is yet more to be explained.

1The latter variance is usually ignored in classical bootstrap procedures.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)


� � ��

��

��

��

��

��

��

��

��

Figure 13.2: Draws in Chess: Predictions vs. Data

13.4 Continuous MixtureThis example deals with count data—daily usage counts for the online Bayesian calculatorreferenced earlier (pg. 145) for the years 2015–2017. The data analyzed are shown in thefrequency histogram (binwidth = 2) in Figure 13.3. The latter data do not comprise theentire dataset. There were actually 1,104 datapoints but 50 random points were removedfor exploratory analysis for reasons discussed below.

Count data are usually described as Poisson (see sect. 6.6.4). Quite often, however,the Poisson(λ) distribution is too simplistic to describe real counts accurately. In fact,the frequency histogram for these data looks nothing like a Poisson dataset. There is toomuch variation in the bin frequencies. In such cases, one can try to model the counts asPoisson(λ) but with a mixture of values for λ. With a finite number of values, one gets adiscrete mixture; with a continuous (infinite) number of values, the result is a continuousmixture.

In a continuous mixture, the likelihood parameter(s) are themselves modeled whichmakes the overall model hierarchical (see Chap. 14). The model for the parameter(s) isitself a (joint) distribution which acts as a continuous weighting factor (see sect. 6.7.2).

With overdispersed Poisson counts as these appear to be, a common approach is tomodel λ as Gamma(α, β), the Gamma parameters being shape and scale, resp. If we dothat, then λ becomes redundant and can be marginalized out, being replaced with α and β,as shown on the following page.

https://en.wikipedia.org/wiki/Overdispersion


� ��

��

��

��

��

��

��

��

��

�� = ��(�� )

Figure 13.3: Calculator Usage: Data

13.4.1 A Continuous Poisson MixtureOur desired mixture, PG, is symbolized

Poisson(λ)∧λ

Gamma(α, β)

Its derivation can be carried out in closed form as follows:

PG =

∫λ

Gamma(α, β)Poisson(λ)

=

∫ ∞0

λα−1 exp(−λ/β)

Γ(α)βαexp(−λ)λx

x!dλ

=βxΓ(α + x)

(1 + β)α+xΓ(α)x!

(13.1)

Letting α = n, β = 1−pp

and recalling that z! = Γ(z + 1), we have

PG =

(n+ x− 1

n− 1

)pn(1− p)x (13.2)

which is called the NegativeBinomial(p, n) distribution.

https://en.wikipedia.org/wiki/Negative_binomial_distribution


13.4.2 Formulating the ModelThe NegativeBinomial(p, n) distribution usually describes the number of failures, x, withprobability (1 - p), that occur prior to n successes. In that case, both x and n are integers.However, here we are not talking about successes and failures; we are just taking advantageof a closed-form solution for our desired mixture so n need not be an integer in which casethe NegativeBinomial distribution is often referred to as the Polya distribution.

Although we now have a closed form for our likelihood, there is a technical problem.Like the Binomial(p, n) distribution, the parameters of the NegativeBinomial cannot bothbe unknown. That leads to an infinite number of solutions and MCMC marginals wouldthen be meaningless. The only practicable remedy is to use priors that are informativewhich, in this example, we can do only by looking at the data. Since we are not permittedto use the same data twice, this means that we must remove a small, random subset ofour datapoints and do some preliminary, exploratory analysis. Here, 50 of the original1,104 points were removed. Our priors for p and n (lines 8–9) were based on the M-LNegativeBinomial parameters using these 50 selected points. These M-L parameters areaccessible because the M-L solution is unique with NegativeBinomial data.

Our model is now almost trivial:

Model 13.1: Online-calculator Usage (continuous mixture)

1 Constants:2 N = 1054; // # of points3 Data:4 count[N];5 Variables:6 n, p, i;7 Priors:8 p ˜ HalfNormal(0.1);9 n ˜ Normal(2.4, 0.1); // need not be integer

10 Likelihood:11 for (i, 1:N) {12 count[i] ˜ NegativeBinomial(p, n);13 }14 Extras:15 Monitored:16 p, n;

13.4.3 ResultsThe solution with this model is shown in Table 13.1 with marginals in Figure 13.4.


Table 13.1: Online-calculator Usage (solution)

Probability Estimate 95% Credible IntervalMAP Mean Lower Limit Upper Limit

p 0.1536 0.1537 0.1438 0.1635n 2.415 2.417 2.264 2.568

��

��

��

��

��

�

��

��

�

�

�

�

�

�

�

��

Figure 13.4: Marginals for p and n

13.4.4 Posterior-predictive CheckThe solution and marginals look quite good but the true test is whether this “good” modelactually describes the data. Again, a really good model is one that could have generated adataset like our observed data. That will be our test.

Everything is done in post-processing just as in the chess-draws example except thatwe shall use every row in the trace. For each row, we generate 1,054 random variates fromthe likelihood using the {p, n} values in that row. Here, we generate variates from theNegativeBinomial distribution using an appropriate algorithm. [9, pg. 543] We then binthese counts and average over all rows in the trace and use this average to predict the binfrequencies in a histogram like that in Figure 13.3. In this case, we get Figure 13.5 wherethe average posterior-predicted frequencies are shown as black dots.

The fit is not perfect, of course, but it is quite acceptable. We can safely conclude thatour data are well-described by this continuous mixture of Poisson distributions.


� ��

��

��

��

��

��

��

��

��

�� = ��(�� )

Figure 13.5: Online-calculator Usage: Model vs. Data

13.5 SummaryThere are many ways to use the MCMC trace to assess goodness-of-fit and here we havedescribed just two of them. The key concept is that the posterior contains whatever isknowable given the data and prior information so it should be sufficient to serve as a basisfor testing the quality of the model. This is one place where much depends upon therequirements of the specific problem and the expertise/imagination of the analyst.

In general, model checking is not always easy and research in this area remains veryactive.

Chapter 14

Hierarchical Models

ALL of the models we discussed in the previous two chapters were extremely simple,developed in order to analyze some data about which we had, by assumption,virtually no prior information. Consequently, the priors employed were vague

and, necessarily, fixed (containing only constants) since we could say nothing more aboutthem. In real-world situations, this is hardly ever the case. By the time a problem has beendefined and data ready to be analyzed (or collected), almost everyone involved has somenon-trivial amount of expertise in the domain being investigated. This expertise constitutesprior information and, as we have seen, all valid information should be incorporated intoany model used to analyze data. Otherwise, the output will be suboptimal.

Easy to say but how can one incorporate prior information (expertise) into a model?The best way is to use Bayesian inference, via MCMC, together with a hierarchical model.A model may become hierarchical either because the parameters of the likelihood need alot of explaining and/or because they are not constant. The final model structure wouldthen have multiple levels and look something like this:

• The likelihood describes the (relative probability of the) observations/data (Level 1).

– The likelihood contains parameters.

– These parameters must have priors.

• Level-1 priors may have hyperparameters (Level 2). (hierarchy begins here)

– Level-2 hyperparameters must have hyperpriors.

• Level 3, Level 4, . . . ?

Eventually, we run out of information to explain further so the topmost hyperpriors,called founder nodes, are fixed and vague. In practice, this recursive structure is not verydeep because there is hardly ever enough information in the data to say anything useful athigher levels. As a practical matter, you can tell that there is no explanatory informationfor an unknown when its marginal looks very much like its prior. The model is then overlyelaborate and its marginal likelihood unnecessarily small due to wasted parameter space.

162

CHAPTER 14. HIERARCHICAL MODELS 163

Constructing a hierarchical model is, to a limited extent, something of an art form.There is usually no unique answer, no universal template into which one can “plug in” theunknowns and their relationships. This is another area requiring expertise and experience.

As in earlier chapters, the procedure is best illustrated by actual examples. We shallbegin with some familiar data and a familiar type of analysis.

14.1 Triple-point Temperature ReduxIn Section 12.8, we analyzed data to find the triple-point temperature of n-nonadecane.We used a model that assumed a fairly vague prior for the answer, trueTp. One thingwe did know, a priori, was that there was a monotonic trend in triple-point temperaturewith increasing carbon number (see Fig. 12.5) and so we used a truncated Normal priorfor the unknown. Suppose that we go further and assume that this trend is linear in theregion of interest. This seems reasonable even with the new (student) data added (seeFig. 14.1). If we consider a linear regression using all of these points, that would constitutean “explanation”, that is, a formula/model for triple-point temperature, Tp, as a functionof carbon number including the unknown Tp for n-nonadecane.

��

��

��

��

��

��

��

��

��

��

��

��

new data

Figure 14.1: Triple-point Temperatures (old and new)

This new model uses more of the available prior information than simply knowingabout the strict, monotonic trend in Tp. Also, this model (14.1) is hierarchical becausethe expected value of a triple-point temperature, mu, is no longer constant but a function

https://en.wikipedia.org/wiki/Linear_regression


of carbon number with its own, Level-2 hyperparameters. In our previous analysis, weconsidered only the C19 datapoint and a fixed prior for that unknown was reasonable.

Model 14.1: Triple-point Temperature (linear trend)

1 Constants:2 n = 6, N = 10; // # of points {literature, All}3 Data:4 carbons[N], Tp[N]; // C19 points at end5 Variables:6 mu[N], slope, intercept, sigma[2], pred19, i, k;7 Priors:8 intercept ˜ Uniform(100, 300);9 slope ˜ Uniform(0.1, 10);

10 sigma[1] ˜ Jeffreys(0.1, 10); // literature data11 sigma[2] ˜ Jeffreys(0.1, 20); // student data12 Likelihood:13 for (i, 1:N) {14 k = 1 + (i > n);15 mu[i] = intercept + slope*carbons[i];16 Tp[i] ˜ Normal(mu[i], sigma[k]);17 }18 Extras:19 pred19 = intercept + 19*slope; // trueTp for C1920 Monitored:21 intercept, slope, sigma[], pred19;

This model distinguishes the old (six points) and new (four points) data but supposesall of them to follow a trend which we are assuming is linear. This is standard regressionand each observation is assumed to have a Gaussian uncertainty about its mean. Of course,the literature data are almost certainly more accurate than the student data so we allow forthe possibility that the sigmas of the respective datapoints might be different by definingtwo different scale priors, both vague (lines 10–11). The linear trend parameters, interceptand slope, are also given vague priors.

Line 14 deserves some comment. It is a simple switch that chooses the scale parameter,in line 16, corresponding to the old or new data as a function of the data index. Here, thedata file has been arranged with the four new points at the end. How one implements sucha switch depends on the MCMC software being used. As before, True = 1, False = 0.

The model also defines an ancillary (extra) variable, pred19, and monitors it so that wemay later construct its marginal and estimate its credible interval. pred19 is a predictionfor the triple-point temperature of n-nonadecane and we estimate this value in the manner


shown because this is the preferred way to do it. One might think that you could just takepoint estimates for slope and intercept and plug in 19, as in line 19, to get a prediction forthe desired temperature but this is not true. A prediction made using parameters found forall of the points is not necessarily the same as one made conditioned on specific values ofthe independent variables—a fact not widely appreciated. [1, Sect. 8.1.3]

Before comparing the new solution to what we found before, it is helpful to expressthe pseudocode in Model 14.1 as a directed acyclic graph (DAG)1 (Fig. 14.2) so that thehierarchy explaining Tp is more obvious. If a model cannot be drawn this way, it is invalid.

The hierarchy shows up as stochastic input nodes (slope, intercept) that are outside thelikelihood box. By convention, a DAG omits fixed hyperpriors.

Tp[i]

slope intercept

mu[i] sigma[k]

ik

carbons[i]

pred19

for (i, 1:N)

Figure 14.2: DAG: New Tp Model

14.1.1 New Solution for n-NonadecaneMore prior information should mean a better (less uncertain) posterior. Did this happen?The new result is shown in Table 14.1. Our previous result is shown in italics.

It is evident that the new mean estimate, 305.60 K, is better than our previous estimate,306.03 K, since it has a narrower credible interval. It is also interesting that this newcredible interval does not even come close to the truncation interval, [301.0, 309.6], thatwe needed before to keep the answer in the known range. It happened automatically.

1many more examples in Reference [5]

https://en.wikipedia.org/wiki/Dependent_and_independent_variables

https://en.wikipedia.org/wiki/Directed_acyclic_graph


Table 14.1: Triple-point Temperature (new solution)


pred19 305.51 305.60 303.75 307.56trueTp 306.36 306.03 302.83 309.28

intercept 205.70 205.35 194.01 216.29slope 5.25 5.28 4.64 5.95

sigma[1] 0.96 1.53 0.63 2.87sigma[2] 4.16 5.68 2.29 10.92

Not surprisingly, our prior suspicion that the student data warranted a larger sigma hasbeen verified; their marginals are compared in Figure 14.3. These values are “total sigmas”because the uncertainty in the likelihood (line 16) includes both measurement error aswell as any error attributable to assuming that the trend is linear. Recall (Sect. 4.2.4) thatvariance is additive (if and only if components are independent).

� � � � � ��

��

��

��

��

��

��

��

literature

students

Figure 14.3: Literature vs. Student Sigmas

Did we benefit by including the (not so great) student measurements? Unless those(valid) datapoints were infinitely uncertain, they should have contributed some informationand thereby decreased posterior uncertainty in pred19. If we rewrite Model 14.1 so thatit uses only the six literature measurements, we get pred19 = 305.36 K (mean estimate)with a credible interval of [303.19, 307.56], fifteen percent wider than the interval shown


in Table 14.1, line 1. As always, we are better off using all of the data providing it is validand we analyze it correctly.

We should also ask whether the regression line from our new model fits the raw data.Judging by the plot below, it seems to be quite a reasonable fit.

��

��

��

��

��

��

��

��

��

��

��

��

Figure 14.4: Linear Trend Goodness-of-fit (mean estimates)

It is important to note that this model is termed “linear” not because the graph is astraight line but, more technically, because the formula for mu (line 15) is a linear functionof the unknowns (here, slope and intercept). The plotted curve could be nonlinear evenwith a linear model. For instance, the model for the results shown in Table 2.1 came fromModel 14.2. The formula for mu (line 15) is a cubic polynomial so its graph would notbe a straight line. Nevertheless, this formula is linear in the unknowns, A–D, making it alinear model in the mathematical sense. In the next Section, we shall consider a nonlinearregression model, i.e., a model that is intrinsically nonlinear in one or more unknowns.

Finally, we recognize that the first row in Table 2.1 (for zero points) comes, of course,from the prior for parameter A which is what we knew before there were any datapointsto analyze. In this model, all of the parameters were given very vague priors so that anyconclusions came from the data alone. Thus, the credible interval for zero points is simplythe 95-percent HPD interval of the Normal(0, 30) prior.


Model 14.2: Information in Data

1 Constants:2 N = 1; // # of points in first case3 Data:4 x[N], y[N];5 Variables:6 mu[N], A, B, C, D, sigma, i;7 Priors:8 A ˜ Normal(0, 30);9 B ˜ Normal(0, 30);

10 C ˜ Normal(0, 30);11 D ˜ Normal(0, 30);12 sigma ˜ Jeffreys(0.01, 10);13 Likelihood:14 for (i, 1:N) {15 mu[i] = A*xˆ3 + B*xˆ2 + C*x + D;16 y[i] ˜ Normal(mu[i], sigma);17 }18 Extras:19 Monitored:20 A;

14.2 DaytimeIn this example, we examine data for the length of daytime (sunrise to sunset, in minutes)at Boston, MA, USA. The data are presented in Table 14.2. [13] This table contains datafor the first day of each month, over three years, plus the longest and shortest days of theyear, all rounded off to the nearest minute. Figure 14.5 shows a plot of these points. Itlooks more or less like a sine wave, something we could have guessed a priori.

In defining a regression model for these data, we shall proceed in much the same way aswe did for the triple-point temperatures in the last Section. The datapoints will be assumedindependent and the likelihood Gaussian with a mean determined by the variable, day.A Gaussian (Normal) likelihood is nearly always the default assumption for real-valuedmeasurements and a very good choice unless there is reason to do otherwise.

Our Gaussian likelihood has two parameters, mean (location) and sigma (scale), bothunknown or, at least, uncertain. Do we know anything about these parameters a priori?Can we say anything about them without looking at the data? Unlike earler examples, theanswer is definitely Yes. Daytime and nighttime are very familiar as is the length of theyear and the seasons. The two equinoxes and two solstices are also familiar. Therefore,our priors should no longer be vague. We should incorporate our prior knowledge.

https://en.wikipedia.org/wiki/Equinox

https://en.wikipedia.org/wiki/Solstice


Table 14.2: Length of Daytime (Boston, MA, USA)

Date Day Daytime (min) Date Day Daytime (min)1/1/1995 1 545 579 865

32 595 610 78260 669 640 69891 758 671 614121 839 701 554152 901 12/21/1996 721 540

6/21/1995 172 915 1/1/1997 732 545182 912 763 597213 867 791 671244 784 822 760274 700 852 839305 616 883 902335 555 6/21/1997 903 915

12/22/1995 356 540 913 9121/1/1996 366 544 944 865

397 595 975 783426 671 1005 699457 760 1036 615487 840 1066 554518 902 12/21/1997 1086 540

6/21/1996 538 915 1/1/1998 1097 545548 912

14.2.1 Prior for MeanThe likelihood mean changes from day to day so we make it an explicit function of time,µ(t). As in previous examples, µ(t) symbolizes the true value of daytime which we chooseto model as sinusoidal (14.1).

µ(t) = A sin

[2πt

P+ φ

]+D (14.1)

In our model, we shall use this equation to set the value of µ(t) deterministically so µ hasno prior. Note that P and φ enter (14.1) in a nonlinear way so that makes this a nonlinearregression.

The sine function describing µ(t) has four hyperparameters, {A,P, φ,D}. These areunknown (but not totally) so each of them will require a hyperprior. Let us consider themone by one, think about their physical meaning and see what, if anything, we already knewabout them before examining Table 14.2.


� ��

��

��

��

��

��

��

��(��)

Figure 14.5: Daytime Data

Hyperparameter A

A is the amplitude of the sine function. It describes the height of the sine wave above ahorizontal line drawn through the middle of the wave (parallel to the abscissa). It is bestto be conservative, so we shall assume that this amplitude might be as much as half a dayso its value, measured in minutes, is in the range [0, 720]. Our hyperprior will be Uniformin this range meaning that, as far as we know, all values in this range are equally likely.

Hyperparameter P

P is the period of the sine wave (in days). We knew already that the seasons vary over ayear so this hyperparameter should have a value close to 365.25, allowing for leap years.We must also allow for some uncertainty however because our model is imperfect andsome of this imperfection will leak into estimated parameter values. Therefore, we shalluse a Normal prior for P with a mean of 365.25 and a sigma of one day. This should besufficiently vague but we must check for vagueness later.

Hyperparameter φ

The physical meaning of φ requires some further knowledge but does not require anythingin our dataset. Astronomically, the year begins at the Spring equinox (ca. March 21)so this should be t = 0 and the previous Winter solstice (ca. December 22) should be


t = −π/2 (in radians). In the data, our days have values with zero set to January 1. Toadjust March 21 on that scale so that it comes out zero, we must “add” an offset (termedthe phase) in the range [−π/2, 0]. We shall give φ a hyperprior Uniform in that range.

Hyperparameter D

D is used to change the standard minimum of the sine function (−1) to some other value.It is another kind of offset. D must be positive but not more than one day (1440 min) sowe shall use a hyperprior Uniform in [0, 1440]. This is maximally vague.

14.2.2 Prior for SigmaSigma is the scale parameter in our likelihood function. We are assuming all points equallyprecise (an unweighted regression) so sigma is some constant symbolizing our uncertaintyregarding the likelihood function itself plus any other uncertainty not yet “explained”. Canwe say anything about sigma a priori?

Yes, we can. Without looking at specific daytime values, we know that our sourcefor these numbers rounds them off to the nearest minute. Therefore, there is quantizationerror due to this rounding. In data analysis, it is common practice to quantify sourcesof error in terms of their variances. As we have seen (4.2.4), variances for independentcomponents are additive regardless of the model we are using. For our likelihood, one ofthe components contributing to the total variance will be the variance due to rounding. Ifthe quantization interval is q, then rounding variance, qVar , is equal to the variance of aUniform distribution with width = q.

qVar =q2

12(14.2)

In this example, q equals one minute so qVar = 1/12.We must allow for more uncertainty than this, however. We know that our sinusoidal

model is imperfect and there might be other uncertainties as well. Consequently, we musthave a second component of total variance for the likelihood. We are talking about a scaleparameter, not a location parameter, so we shall adopt a Jeffreys distribution assigning ahyperprior, procVar, in the range [0.1, 50] (in real space).

The total error variance will be the sum of procVar (intrinsic error variance) and qVar(quantization error variance). We need to take the square root of this sum2 given ourdefinition of the Normal distribution. Once again, when the MCMC run is finished, wemust check to make sure that this prior really was vague.

14.2.3 Complete Hierarchical ModelPutting everything together, we get Model 14.3 with sigma = sqrt(procVar + qVar) definedoutside the likelihood loop so that it is computed just once per iteration.

2This root-sum-squared (RSS) procedure is common practice in data analysis generally.

https://en.wikipedia.org/wiki/Quantization_(signal_processing)

https://en.wikipedia.org/wiki/Quantization_(signal_processing)


Model 14.3: Daytime (informative hyperpriors)

1 Constants:2 N = 43, // # of points3 qVar = 1/12; // quantization variance4 Data:5 day[N], daytime[N];6 Variables:7 mu[N], A, P, phi, D, procVar, sigma, i;8 Priors:9 A ˜ Uniform(0, 720);

10 P ˜ Normal(365.25, 1);11 phi ˜ Uniform(-Pi/2, 0); // with Pi a known constant12 D ˜ Uniform(0, 1440);13 procVar ˜ Jeffreys(0.1, 50);14 sigma = sqrt(procVar + qVar); // standard RSS15 Likelihood:16 for (i, 1:N) {17 mu[i] = A*sin(2*Pi*day[i]/P + phi) + D;18 daytime[i] ˜ Normal(mu[i], sigma);19 }20 Extras:21 Monitored:22 A, P, phi, D, sigma;

The hierarchy is evident in the computation of mu[i]. The latter is just one parameterin the (Normal) likelihood but it has four hyperparameters, {A,P, φ,D}. We had someidea of what the values of these hyperparameters should be and we encoded that priorinformation in the four corresponding hyperpriors. Parameter procVar, on the other hand,has a fixed prior because we really knew nothing about it a priori so its prior is quite vague.

14.2.4 Daytime ResultsThe MCMC run for this problem output a trace with more than 250,000 rows and thesummary results shown in Table 14.3. We can use the MAP values for {A,P, φ,D} forthe best plot of a sine wave together with the data (see Figure 14.6).

It is apparent that this simple model does a good job of describing the data even thoughwe know that it is far from perfect. Any suitable astronomy reference will specify a muchmore elaborate model. [59, chap. 9] Our sine wave does not quite reach the data duringthe solstices (high/low points). This failure would have been much more noticeable hadwe picked a location with more variation in daytime, placing more demands on the model.


Table 14.3: Daytime (solution)


A 183.34 183.32 181.51 185.11P 365.49 365.46 364.71 366.18φ −1.39 −1.39 −1.41 −1.37D 728.43 728.44 727.05 729.85

sigma 4.14 4.52 3.56 5.57

� ��

��

��

��

��

��

��

��(��)

Figure 14.6: Daytime (MAP Result)

As mentioned above, we should check that our “vague” priors were actually vague. Ifwe compare the 95-percent credible intervals in Table 14.3 with the priors in our model,it should be clear that we were very conservative in our prior estimates. The results havemuch narrower ranges.

Parameter sigma is of special concern since it tells how well our model matches thedata, i.e., how much variation remains unexplained. We specified a broad prior rangefor procVar using a Jeffreys prior. Posterior sigma is in the range [3.6, 5.6] so we wereconservative here as well. Figure 14.7 shows the marginal for sigma. It is almost perfectlysmooth even though it is the envelope of a raw histogram. This is a natural consequenceof having a large trace array.3 Note also that it is right-skewed which is typical for scaleparameters since they cannot be negative.

3always assuming that the MCMC run was successful


� � � � ��

��

��

��

��

��

��

Figure 14.7: Daytime (marginal for sigma)

14.3 Hale-Bopp CNWith very minor changes to the previous example, we can perform a regression analysisin which the data are not equally precise, i.e., a weighted regression. Such models can getvery complicated but we shall consider a simple case to illustrate some important points.Our data, extracted from a published plot, consist of measured rates for the release ofcyanide, CN, from comet Hale-Bopp as it approached the sun. [54] The observed rates,along with their 1-sigma uncertainties, are listed in Table 14.4 and plotted in Figure 14.8.

Table 14.4: Hale-Bopp CN Data

Rate Distance Uncertainty(molecules per second)/1025 AUa (molecules per second)/1025

130 2.9 40190 3.1 7090 3.3 2060 4.0 2020 4.6 1011 5.0 66 6.8 3

aastronomical unit = Earth–Sun distance


� � � � � ��

��

��

��

��

��

��

�� (��)

��

Figure 14.8: Hale-Bopp CN Release vs. Distance

In this analysis, there are only seven datapoints and most of them are very uncertain(large error bars). Therefore, these data contain only a little information. With only a littleinformation, we cannot expect a lot in the way of conclusions. This is a general truth thatcannot be fully appreciated unless robust credible intervals are computed which is why oursolution tables always include them. A limited amount of information will support only asimple model with very few unknowns since any available information will get spread out(“diluted”) over the posterior (hyper)volume and this volume increases exponentially withthe number of parameters.

Here, the points closest to the sun are at a distance of about 3 AU—approximately thedistance of the asteroid belt. The comet still has a long way to go to get close to the sunso the rate of CN release could increase a lot more. This suggests that we could modelthe observed rate as exponential, going to zero at infinity but getting very large at smalldistances. That would require only two parameters, A and B, as shown in (14.3).

rate = A exp(−B distance) ;A,B > 0 (14.3)

Clearly, A and B must be positive but we do not know much else about them so theirpriors will have to be quite vague. All of our conclusions must come from the data. Themodel, this time with a measurement uncertainty for each datapoint plus some extra workto quantify goodness-of-fit, is shown in Model 14.4. Note: We do not include a parameterto describe uncertainty for the model as a whole because there is not enough informationin this dataset to justify it (cf. sect. 12.6).


Model 14.4: Hale-Bopp CN (weighted regression)

1 Constants:2 N = 7, // # of points3 TSS = 628.66; // total_sum_squares4 Data:5 rate[N], distance[N], sig[N];6 Variables:7 mu[N], A, B, err, ESS, RSquared, i;8 Priors:9 A ˜ Jeffreys(100, 10000);

10 B ˜ Jeffreys(0.1, 10);11 Likelihood:12 for (i, 1:N) {13 mu[i] = A*exp(-B*distance[i]);14 rate[i] ˜ Normal(mu[i], sig[i]);15 }16 Extras:17 ESS = 0;18 for (i, 1:N) {19 err = (rate[i] - mu[i])/sig[i]; // weighted error20 ESS = ESS + err*err; // increment error_sum_squares21 }22 RSquared = 1 - ESS/TSS; // fit metric23 Monitored:24 A, B, RSquared;

Given the nature of this model, the priors for A and B needed a little trial-and-error(exploratory data analysis) to ensure that they were sufficiently vague. Ultimately, Jeffreyspriors were chosen but Uniform priors could also have been used. We should not expectto get the same answers for both. That happens only when there is so much informationin the data that it effectively overwhelms the priors. Not so in this case; we do not haveenough information to say, a priori, which prior form should be preferred.

As part of exploratory analysis, we could get a hint about the values of A and B bytaking (natural) logs of both sides of (14.3) giving (14.4) and doing the resulting linearregression. Indeed, there are some “scientific” calculators that will perform this transformautomatically then output the M-L parameters.

log(rate) = log(A)−B distance (14.4)

Sadly, there is a catch. The implication that the M-L values of A and B resulting fromthis least-squares procedure using (14.4) will be the same as the values from a nonlinear

https://en.wikipedia.org/wiki/Exploratory_data_analysis

https://en.wikipedia.org/wiki/Least_squares


least-squares analysis using (14.3) is not true. When you take logs of (14.3), the dependentvariable and its “error” from the model get reduced in value and this reduction is nonlinear,that is, it affects different datapoints differently. The net result is that the M-L values forany parameters will change. With MCMC, this problem is not fixed. Whatever you do,(14.3) and (14.4) are not equivalent with respect to parameter estimation.

In addition to parameter values, our model outputs extra quantities that will help assessgoodness-of-fit. We shall discuss results for these two outputs separately.

14.3.1 Results for ParametersAs the credible intervals in Table 14.5 show, the large measurement uncertainties in thedata have produced large uncertainties in the parameter estimates.

Table 14.5: Hale-Bopp CN (solution)


A 1664.2 3242.4 399.4 7644.7B 0.9027 1.0369 0.7232 1.3781

RSquared 0.9924 0.9905 0.9843 0.9935

For comparison, the M-L solution, using (weighted) least-squares with model (14.3), is{A = 2926.1, B = 1.0464}. Had we used model (14.4) and linear (weighted) least-squares,the M-L solution would have been {A = 7399.6, B = 3.7666}.

14.3.2 Results for Goodness-of-fitA plot of the MAP and Mean results is shown in Figure 14.9. The mean estimates, usuallypreferred, give a plot that goes through the 1-sigma error bars for all seven points so thisfit would be deemed acceptable provided that the uncertainties in the data are taken intoconsideration.

We can do better than just a plot, however. We went to the trouble of computing theR-squared goodness-of-fit metric (see chap. 13) as part of the model, something that couldalso be done in a post-inference phase starting with the trace if all needed variables aremonitored. The total sum squares, TSS = N*sample variance, was computed in advanceand input as a constant (line 3). The error sum squares, ESS, and the R-squared metricwere computed as an extra. Table 14.5, line 3 shows that, given the uncertainties, thisexponential model “explains” more than 99 percent of TSS (what there is to be explained).Obviously, RSquared would have been a lot smaller, and the model much worse, had thedata uncertainties been ignored since the model comes close to the error bars but not to thedatapoints.


� � � � � ��

��

��

��

��

��

��

�� (��)

��

MAP

Mean

Figure 14.9: Hale-Bopp CN Fit (MAP and Mean)

14.4 Carbon-14 DatingThe previous example discussed a weighted, nonlinear regression in which the weightsapplied, as usual, to the dependent variable. Measurement errors were supplied as dataand we assumed no additional model error because the errors were already quite large andthe small amount of data would not support too many unknowns. In this example, we gofurther by considering a problem in which there are errors in both the independent variableas well as the dependent variable, a so-called errors-in-variables problem.

Our dataset goes back all the way to 1949. At that time, the technique of carbon datingwas just being developed and tested against old carbon samples with “known” ages. [2] Inprinciple, if you know the specific activity (counts per minute per gram) of fresh samplesof carbon-14 (e.g., from living wood) then, given the half-life of C-14, you can predictthe activity of old samples in which some of the C-14 would have decayed to nitrogen-14according to the well-known process

146C −→ 14

7N + e− + νe

These data were assembled in order to test this idea but there are some issues, mostnotably the quotation marks in the preceding paragraph. Ancient samples of wood are notfound saying “2,500 B.C.” on them! Their assigned ages are just some expert’s opinion andthere is typically a fair amount of uncertainty in such estimates. Nevertheless, we mightwant to compare the age of old samples with the age predicted from measurements of

https://en.wikipedia.org/wiki/Errors-in-variables_models

https://en.wikipedia.org/wiki/Radiocarbon_dating

https://en.wikipedia.org/wiki/Specific_activity


residual C-14 in the samples. We shall allow for error in known age as well as experimentalerror. Modeling error should be negligible since the law governing this radioactive decayis also well known (14.5).

y = A exp

(− ln2

ht

)(14.5)

where ln2 = (natural) log(2) and h is the half-life.In 1949, the C-14 half-life was thought to be 5, 720 ± 47 years and the parameter

for the activity of fresh C-14, A, had been measured4 as 12.5± 0.23 min−1g−1. Both ofthese values will be considered prior information. The data (seven samples with replicatemeasurements) are as follows:

Table 14.6: Carbon-14 Test Data

Sample Sp. Activity (y) σy Age (t) σt1 11.10 0.31 1372 50

11.52 0.3511.34 0.2510.15 0.4411.08 0.31

2 9.50 0.45 2149 1503 8.97 0.31 2624 50

9.03 0.309.53 0.32

4 8.81 0.26 2928 528.56 0.22

5 7.73 0.36 3792 508.21 0.50

6 7.88 0.74 4650 757.36 0.53

7 6.95 0.40 4575 757.42 0.386.26 0.41

The model, most of which should look familiar, is Model 14.5. As in the previousexample, we omit a parameter to describe uncertainty for the model as a whole, in thiscase because we know that the model is correct.

The hyperpriors for A and h are informative (not vague) because of prior information.The hyperpriors for trueT are all extremely vague, so much so that we have used a Jeffreysprior in order to give all orders of magnitude equal probability.

The solution is presented in Table 14.7. Marginals will not be shown since they appearapproximately Gaussian. Goodness-of-fit for this model is shown Figure 14.10.

4adjusted for background counts


Model 14.5: Carbon-14 Dating (errors in variables)

1 Constants:2 N = 18, ln2 = log(2);3 Data:4 y[N], sigY[N], t[N], sigT[N];5 Variables:6 trueT[N], mu[N], A, h, i;7 Priors:8 A ˜ Normal(12.5, 0.23);9 h ˜ Normal(5720, 47);

10 for (i, 1:N) {11 trueT[i] ˜ Jeffreys(100, 10000);12 }13 Likelihood:14 for (i, 1:N) {15 t[i] ˜ Normal(trueT[i], sigT[i]);16 mu[i] = A*exp(-(ln2/h)*trueT[i]);17 y[i] ˜ Normal(mu[i], sigY[i]);18 }19 Extras:20 Monitored:21 A, h;

Table 14.7: Carbon-14 Dating (solution)


A 12.67 12.70 12.50 12.90h 5717 5709 5617 5799

Some points of particular interest: First, the posterior curve is consistent with priorinformation for A since the curve goes through its 1-sigma error bar even though thisconclusion requires significant extrapolation. Second, the new value for the half-life ofcarbon-14 is a bit less than the prior value but, as might be expected from uncertain data,not too much different. The new 95-percent credible implies a 1-sigma uncertainty of 46years (down from 47) assuming that this interval is the standard Gaussian ±1.96 sigma.However, perhaps the noteworthy thing about this analysis is that we could do it at all.Nonlinear regressions can be tricky and this one had errors in both variables. There is afair amount of literature discussing such analyses and they can be approached in severaldifferent ways. Model 14.5 was used here because it is simply an extension of Model 14.4.


� ��

�

�

�

�

��

��

��

��

�� (��)

��(��

-��-�)

Figure 14.10: Carbon-14 Dating Fit(prior for A shown in red)

14.5 Atmospheric CO2

In Chapter 1, we discussed the problem of modeling atmospheric carbon dioxide, CO2.This is a more difficult analysis than those discussed above because it requires a piecewiseregression combining a linear baseline with an exponential function. Here, we begin to geta glimpse of the power of Bayesian inference and MCMC.

The data were shown in Figure 1.4. This graph exhibits two distinct regions: a constantbaseline with relatively small fluctuations, lasting for at least 800 millennia, followed by asudden increase sometime in the eighteenth century and continuing to the present. This isnot a shape for which any simple mathematical function will suffice. Instead, we need todescribe the two regions with separate functions and connect them at the proper time, tc.Unfortunately, we do not know the value of tc nor, as usual, do we know the parametersof the models describing each region.

As stated earlier, we shall focus on trends and ignore minor fluctuations. Thus, we shallmodel the left region as a constant baseline. The region on the right looks exponential so,for purposes of illustration, we shall assume it is. We shall have to connect these two partstogether to get the final expression for each mu[i]. Setting the start of the exponential

https://en.wikipedia.org/wiki/Segmented_regression

https://en.wikipedia.org/wiki/Segmented_regression


region at x = 0, this yields the piecewise (segmented) regression model (14.6).

y =

{baseline t < tc (x < 0)

A exp(x/B) t ≥ tc (x ≥ 0)(14.6)

This is all straightforward but, even if we knew the value of tc in advance, there wouldstill be a problem because there is nothing to guarantee that the baseline region and theexponential region will connect up properly. When the exponential begins (x = 0), y = Aand it is extremely unlikely that the baseline value will also be A. Therefore, the modelwill not be continuous which it must be if it is to describe a real atmosphere.

This is where a little prior information of the mathematical variety comes to the rescue.If a function is exponential, then its derivative is also exponential. This is, in fact, thedefining characteristic of the exponential function. If we describe the right-hand region interms of its (exponential) derivative, we gain an additional degree of freedom. We can usethis degree of freedom to specify that y(0) = 0. Thus, we have the system

y′ = A exp(x/B); y(0) = 0 (14.7)

with the solutiony = AB [exp(x/B)− 1] (14.8)

Using (14.8) for the right-hand region, we can now model mu[i] as follows:

mu[i] = baseline + (year [i] ≥ tc)AB [exp((year [i]− tc)/B)− 1] (14.9)

where (year [i] ≥ tc) is, again, a standard Boolean expression.This expression for mu[i] makes sense only because, in MCMC, we do a traversal

of the parameter space in which every state visited has a unique value for tc that can beimmediately compared to year[i]. Therefore, the exponential term will be added if andonly if it is appropriate. Note that tc is still unknown.

14.5.1 PriorsAll of our likelihood parameters must be greater than zero and we shall try to give vaguepriors to those we know nothing about. We do know, a priori, that tc is almost certainlysometime in the eighteenth century so we adopt a Normal(1750, 15) prior for it. A three-sigma deviation would then take us almost to the end of the century in either direction. Weshall use Jeffreys priors for A, B and sigma.

The priors for A, B and sigma had to be chosen with some care. Equation 14.9 showsthat parameter A appears nowhere by itself, only as the product AB and there will be aninfinite number of AB pairs that give the same product. Since both parameters requirevague priors, there is little to constrain the value of A. Indeed, were it not for the fact thatB appears by itself elsewhere in the formula, the MCMC process would never converge!If this model is run with priors that are “too vague”, this lack of constraint will likely give

https://en.wikipedia.org/wiki/Exponential_function#Derivatives_and_differential_equations


spurious solutions, that is, solutions found that have no physical credibility.5 As in the lastexample, we rely on prior information.

14.5.2 Complete ModelPutting everything together, we get Model 14.6.

Model 14.6: Atmospheric CO2 (piecewise regression)

1 Constants:2 N = 2016; // # of points3 Data:4 year[N], CO2conc[N];5 Variables:6 A, B, tc, baseline, mu, sigma, i;7 Priors:8 A ˜ Jeffreys(0.001, 0.1);9 B ˜ Jeffreys(10, 100);

10 tc ˜ Normal(1750, 15);11 baseline ˜ Uniform(250, 300);12 sigma ˜ Jeffreys(1, 10);13 Likelihood:14 for (i, 1:N) {15 mu = baseline +16 (year[i] >= tc)*A*B*(exp((year[i] - tc)/B) - 1);17 CO2conc[i] ˜ Normal(mu, sigma);18 }19 Extras:20 Monitored:21 A, B, tc, baseline, sigma;

14.5.3 ResultsPartial results were provided in Table 1.3. More complete results are listed in Table 14.8.Clearly, the graph shown earlier (1.5), which uses parameter means for point estimates,looks like a good fit for the trend. The value for tc, 1734, is quite reasonable but is stillsomewhat uncertain. In general, parameters that connect two models, either as here or asweighting factors, are often the most uncertain in a posterior result. The results for sigma

5There are genuinely multi-modal cases. See, for example, Ref. [27].


(small, with a narrow range) verify that the model is quite good and that the prior for sigma(line 12) was sufficiently vague.

Table 14.8: Atmospheric CO2 (solution)


A 0.013 0.013 0.008 0.017B 54.56 54.61 53.59 55.64

baseline 279.37 279.37 279.26 279.48tc 1734.06 1734.17 1716.16 1751.06

sigma 2.42 2.42 2.35 2.50

14.6 Horned LizardsIn Chapter 1, we described the problem of estimating the size of a closed population ofhorned lizards in a given region. The experimental capture-recapture data were presentedin Table 1.1. However, this problem is simple enough that we can get by with grouped dataas shown in Table 14.9. The unknown of interest is the size, N, of the lizard population.The capture probability, p, is taken as an unknown constant. The results were presented inTable 1.2 and we can now describe the model, M0, from which they were obtained.

Table 14.9: Horned Lizards: Grouped Data

# Times Captured (k) 1 2 3 4 5 6 7 8 9 10 11 12 13 14Frequency (nk) 34 16 10 4 2 2 0 0 0 0 0 0 0 0

From Table 14.9, it is apparent that the grouped data for all Nobs = 68 lizards fall intoJ = 14 disjoint categories giving the number of lizards, nk, which were captured k times.These 14 categories are not complete because the category n0 is, of course, missing. Wereit not, vector n[ ] would be Multinomial. We provide a hierarchical explanation for thesemultinomial cell probabilities, pk, specifying them to be Binomial(p, J), hence (14.10).

n[ ] ∼ N !∏Jk=0 nk!

J∏k=0

[(J

k

)pk (1− p)J−k

]nk(14.10)

Except for k = 0, the denominator in the fraction above is constant. Since we do not planto do any model comparisons, we do not need the marginal likelihood so we can simplifythe algebra a bit by reducing this denominator to its first factor, n0! = (N − Nobs)!. Thecorresponding factor in the second product is

[(1− p)J

]N−Nobs . Omitting all constants,


we now have the logLikelihood as (14.11).

logLik = log(N !)− log((N − Nobs)!) + J(N − Nobs) log(1− p)

+J∑k=1

nk(k log(p) + (J − k) log(1− p))(14.11)

in which every term contains an unknown, N or p.This formula is not a standard Multinomial since we have no data for n0. Therefore,

we shall have to do some (software-dependent) customizing in the pseudocode in order toimplement it. Also, since our indices start at one, we must augment the data vector with afifteenth “observation” for k = 0.

Our priors will be vague. For N , we shall use a DiscreteUniform(68, 150) and, for p,a Uniform(0, 1) hyperprior. The final model is Model 14.7.6

We have already discussed the solution coming out of this model but it is interesting tosee the complete marginals for N (discrete) and p (Figs. 14.11 and 14.12). These figuresverify that our priors were vague, as intended.

We should emphasize again that the syntax in line 22 of this model is specific to thesoftware being used to implement MCMC. Any MCMC software application contains aparser to interpret the pseudocode supplied by the user and such parsers contain whateverrules the developer decided to use given the desired functionality of MCMC models. It isvitally important that users read the appropriate software documentation thoroughly andnot assume that such packages are all similar.

6In this MacMCMC syntax, a terminal underscore signifies that the variable will be treated as an integer.


Model 14.7: Horned Lizards (M0)

1 Constants:2 J = 14, J1 = J+1, Nobs = 68;3 Data: // augmented for unobserved4 n_[J1];5 Variables:6 N_, p, logLikTerm[J1], logp, log1mp, k;7 Priors:8 N_ ˜ DiscreteUniform(68, 150);9 p ˜ Uniform(0, 1);

10 Likelihood:11 // logLikelihood computation (omitting constants)12 logp = log(p);13 log1mp = log(1-p);14 for (k, 1:J) { // binomial bin probability15 logLikTerm[k] = n_[k]*(k*logp + (J-k)*log1mp);16 }17 // remaining terms18 logLikTerm[J1] = logFactorial(N_) - logFactorial(N_ - Nobs) +19 J*(N_ - Nobs)*log1mp;20 // likelihood statement21 for (k, 1:J1) {22 n_[k] ˜ Generic(logLikTerm[k], 0, 100);23 }24 Extras:25 Monitored:26 N_, p;


��

��

��

��

��

��

��

��

��

�

��

Figure 14.11: Horned Lizards (marginal for N)

��

�

��

��

��

��

��

��

�

��

Figure 14.12: Horned Lizards (marginal for p)


14.7 Hockey GoalsStandard “least-squares” gets its name from the fact that the M-L fit can be found byminimizing the sum of the squared deviations, along the ordinate, of the datapoints fromthe curve. This strategy assumes that the likelihood of the data is ∼ Normal(µi, σi) as wasdone with the previous examples in this chapter. However, this may not be appropriate.For instance, if the observations are integers with small values, their discreteness prohibitspretending that they are continuous and so they cannot be normally distributed about thecurve even as a first approximation.

Here, we discuss one alternative in which we take some discrete observations and tryto predict them with a linear function—a generalized linear model (glm). [16, chap. 6]

The data refer to the game of ice hockey, specifically the number of goals scored bythe winning team in each regular-season game over one season (U.S. NHL, 2017). [47] Asummary is presented in Table 14.10.

Table 14.10: Goals by the Winning Team (NHL, 2017)

# Goals 1 2 3 4 5 6 7 8 9 10Frequency 39 173 286 337 249 118 52 13 0 2

We shall consider just two explanatory variables: the number of ShotsFor taken by thewinner and whether or not it was a Home game. ShotsFor ranges from 15 to 52 and willbe assumed continuous. Home, a nominal datum, must be implemented as an indicatorvariable (see Sect. 3.4). We do not really have much prior information, not even enoughto know for certain that these “explanatory” covariates really explain the number of goalsso we shall consider four models with increasing numbers of covariates including oneinteraction quantity equal to the product of our two covariates. In all cases, we shalladopt a Poisson likelihood just as we did with our simple Pumps model (Sect. 12.4) butsubtracting 1 from Goals in order to begin at zero where the Poisson support begins.

This glm strategy produces four alternate models to predict a value for the Poissonparameter, λi, where i indexes the datapoints. As is often done and to ensure that λi isalways predicted to be positive, we compute it in log space, as follows:

log(λi) = c0 (14.12)log(λi) = c0 + cs ShotsFor[i] (14.13)log(λi) = c0 + cs ShotsFor[i] + ch Home[i] (14.14)log(λi) = c0 + cs ShotsFor[i] + ch Home[i] + ci ShotsFor[i] Home[i] (14.15)

The coefficients in each case are all unknown but we should be very chagrined if csturns out to be unimportant and/or negative! We shall assess these four models using theirrespective marginal likelihoods, as usual.

https://en.wikipedia.org/wiki/Generalized_linear_model


14.7.1 NHL Goals ModelThe second model, with appropriately vague priors, is listed below.

Model 14.8: NHL Goals Model m01 (Poisson regression)

1 Constants:2 N = 1269;3 Data:4 GoalsM1[N], ShotsFor[N], Home[N]; // recorded stats5 Variables:6 logLambda[N], c0, cs, i;7 Priors:8 c0 ˜ Normal(0, 1);9 cs ˜ HalfNormal(0.1); // to avoid overflow

10 Likelihood:11 for (i, 1:N) {12 logLambda[i] = c0 + cs*ShotsFor[i];13 GoalsM1[i] ˜ Poisson(exp(logLambda[i]));14 }15 Extras:16 Monitored:17 c0, cs;

14.7.2 ResultsHere are the results for model m01.

Table 14.11: NHL Goals Model m01 (solution)


c0 0.811467 0.811595 0.649792 0.969875cs 0.00855442 0.0085398 0.0037553 0.012634

The coefficients in a Poisson regression must be interpreted with care. Apart from c0,each coefficient gives the change in log(lambda) per unit change in the respective variable,all other things being equal. Thus, with Model m01, we expect log(lambda) to increase by0.0085398 for each additional ShotFor the winning team. Lambda is equal to the productof the exponentiated model terms so, with one more ShotFor, the average number of goalsby the winning team is predicted to increase by a factor of exp(0.0085398) ≈ 1.009 if


all other covariates are unchanged. Here, it is the average which is affected because, in aPoisson distribution, the average is equal to lambda.

In our model, we have used the covariates in their raw form. Often, with glm analysesespecially, it is useful to convert each to its standard form by subtracting its mean thendividing by its standard deviation. This allows MCMC to converge more efficiently butmakes interpretation a bit harder. In this example, standardization does not change any ofour conclusions.

14.7.3 Model ComparisonModel m01 is not our only possibility. Shown below are some results for the alternativesdescribed above along with their marginal likelihoods and normalized probabilities.

Table 14.12: NHL Goals: Model Comparison

Model Mean Estimate log(MargLik) Probabilityc0 cs ch ci

m0 1.08671 −2314.25 0.049m01 0.811595 0.0085399 −2311.37 0.918

m012 0.813437 0.0086486 −0.009770 −2314.73 0.031m0123 0.730338 0.011279 0.13707 −0.0045638 −2317.37 0.002

It is apparent (cf. Sect. 10.5.3) that Model m01 is by far the best of those considered.It might seem that Models m012 and m0123 should subsume model m01 and be at least asgood. However, as discussed earlier, there is a price to pay for using parameters that do notcontribute significantly to the model. So it is with the Home covariate and the interactionterm. Of course, c0 by itself is not much good either. That would imply that ShotsFor hadnothing to do with the number of goals scored!

14.7.4 Goodness-of-fitAlthough Model m01 is the best of our alternatives, we must always be mindful that “best”does not imply “good”. It is possible that Model m01 is not a good model after all. Weshall investigate this by a posterior-predictive method (see Sect. 13.2).

We do this check by post-processing the trace of the run with a double loop using them01 trace. In the outer loop, here repeated 100 times, we choose a random row of the trace.With these parameters, we loop over all of the datapoints, computing lambda for eachand then selecting a random number of GoalsM1 from the resulting Poisson distribution.The totals for the whole procedure were divided by 100 giving the comparison shown inTable 14.13 and Figure 14.13.

What can we say in conclusion? Clearly m01, our best model, is not wonderful but itis not terrible either. We identified only one covariate that was usefully informative, viz.,ShotsFor, but that was obvious from the beginning. In many sports-related models, playing


Table 14.13: Goals by the Winning Team (NHL, 2017)

Goals - 1 0 1 2 3 4 5 6 7 8 9 10Observed 39 173 286 337 249 118 52 13 0 2 0Predicted 66 195 287 283 211 125 62 26 10 4 1

� � � � � � � � � � ��

��

��

��

��

��

��

��

# �� - �

��

��

��

Figure 14.13: NHL Goals from Poisson Regression

at home is a decided advantage. Here it apparently is not although a dataset covering manyyears might change that conclusion. Since Home was not informative, the interaction termwas understandably useless as well.

One could easily elaborate this model with additional covariates. Hockey fans (andothers) are, of course, welcome to try.

14.8 About PredictionsOur examples thus far have shown some of the power of Bayesian inference and, whilethe results are impressive, none of them are very surprising which is a bit unfair becausesurprises do happen from time to time. For instance, suppose that you have observedprobabilities for the occurrence of some events and you want to make predictions of futureevents. The most likely (M-L) estimate for any proportion, assuming prior ignorance, isjust the observed proportion. An example is the dataset listed in Table 14.14. These datashow free-throw success for the best free-throw shooters in the U.S. National Basketball


Association (NBA) for the first ten games of the 2009–10 regular season. [62]

Table 14.14: Free Throw Success: Games 1–10

Team Free Throws Team Free ThrowsMade Attempted Made Attempted

ATL 13 15 MIL 11 14BOS 16 18 MIN 5 8CHA 24 32 NJN 29 35CHI 24 31 NOH 6 6CLE 24 27 NYK 22 25DAL 75 82 OKC 84 97DEN 55 61 ORL 14 17DET 58 69 PHI 15 19GSW 11 13 PHX 27 29HOU 32 41 POR 12 15IND 55 68 SAC 12 14LAC 18 25 SAS 45 53LAL 5 9 TOR 10 13MEM 26 32 UTA 13 17MIA 24 28 WAS 14 15

Suppose, back then, you had wanted to predict free-throws made, given number ofattempts, for these same players over the rest of that season. You might think that theset of probabilities for the first ten games would apply, approximately, to the rest of theseason. If your goal was to make the best prediction for just one or two players, then theirhistorical probabilities would be the best choice for making those predictions. In general,however, if you want to make the best set of predictions for more than two things, then youcan always do better than to trust historical proportions. [11] This situation is extremelycommon. It occurs, for example, in any study in which the events of interest happen inmany places and you need a “global” metric of some sort for decision-making purposes.

The reason why individual historical proportions do not suffice to optimize this metricis quite interesting and is exemplified with this dataset. These 30 best free-throw shootersare not random basketball players; they comprise an elite group. If our model for free-throw success ignores this “group-ness” or “family resemblance”, then the model is sub-optimal since it discounts meaningful prior information.

The standard likelihood for any proportion is Binomial(p, A) where, in this case, p isthe probability of making a free throw and A is the number of attempts. Here, we havea vector, p[], of size 30. We could just define this variable in our model with separateUniform(0, 1) priors. The model would then output the usual M-L result. To incorporatethe notion of “group-ness” we shall instead use a hierarchical model in which we havea single hyperprior for all 30 elements of p[]. This hyperprior can be thought of as the“parent” of 30 children so the “family resemblance” will be built in automatically.


The usual prior for a Binomial likelihood is the Beta(a, b) distribution because it is theconjugate prior (Sect. 11.3.2) of the Binomial distribution and is, therefore, very useful forGibbs sampling (Sect. 11.3). However, its parameters pose some problems. Both are shapeparameters and it is difficult to get an intuitive feel for what their values should be. Also,they are strongly correlated and this is deleterious to good MCMC mixing. One solution,often mentioned in the literature, is to parametrize the Beta differently, using more familiarquantities from which the standard parameters, a and b, can be easily computed. Thesuggested quantities are the mean, mu, of the Beta = a/(a + b) and the denominator of thelatter, sz, which is very roughly the size of the sample being modeled. We shall adopt thissuggestion. Our DAG will then look like Figure 14.14.

M[i] A[i]

p[i]

a b

mu sz

for (i, 1:30)

Figure 14.14: DAG: Free-throw Model

14.8.1 Complete modelSince the 30 priors for p[] all come, by design, from a Beta hyperprior, the only additionalpriors we require are those for mu and sz. For mu, we can use a Uniform[0.5, 1] prior

https://en.wikipedia.org/wiki/Beta_distribution


because the players of interest will surely make more than half their free throws. The szparameter must be greater than zero and have reasonable density stretching well beyondour population of 30. A simple choice is HalfNormal(s) with s = 100. As always, we mustcheck the resulting marginals to make sure that these choices were conservative since theyare meant to be vague.

Everything else is as usual and the complete pseudocode is presented in Model 14.9.Note, in particular, the computation of Beta parameters a and b in lines 10–11. Thisdeterministic sequence corresponds to the dashed arrows in the upper part of DAG.

Model 14.9: Free-throw Model (making predictions)

1 Constants:2 N = 30; // # of points/teams3 Data:4 A[N], M[N]; // A = Attempted, M = Made5 Variables:6 p[N], mu, sz, a, b, i;7 Priors:8 mu ˜ Uniform(0.5, 1);9 sz ˜ HalfNormal(100);

10 a = mu*sz;11 b = sz - mu*sz;12 for (i, 1:N) {13 p[i] ˜ Beta(a, b);14 }15 Likelihood:16 for (i, 1:N) {17 M[i] ˜ Binomial(p[i], A[i]);18 }19 Extras:20 Monitored:21 p[], mu, sz, a, b;

14.8.2 ResultsLet us first check whether the marginals for mu and sz are acceptable (different fromtheir priors). Plots for these marginals are shown below (Fig. 14.15). Both of these plotsare very different from their priors implying that there was sufficient information in thedata to say something useful about them. Also, it is apparent that our priors for themwere conservative. Figure 14.16 shows what the Beta prior itself looks like. This Beta


distribution describes all of the p[] values collectively, i.e., the p[] “family”. Clearly, it isnot vague. Also, its posterior domain seems very much as expected for NBA free-throwchampions.

��

�

��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

Figure 14.15: Beta Hyperprior (marginals for mu and sz)

��

�

��

��

��

��

��

��

Mean

MAP

Figure 14.16: Beta PDF ({mu, sz} → {a, b})

Table 14.15 shows the complete solution for the Beta parameters plus the min and maxof the 30 values for mean free-throw probability where indices 13 and 6 correspond toLAL (Los Angeles) and DAL (Dallas), respectively.


Table 14.15: Free Throws (solution)


mu 0.847 0.833 0.803 0.861sz 335.5 119.6 24.02 231.8a 284.1 99.8 19.7 193.9b 51.4 19.8 4.2 38.3

p[13] 0.836 0.809 0.720 0.889p[6] 0.861 0.869 0.916 0.922

14.8.3 Checking the PredictionsSo, how well did we do? In order to answer this question, we need to define, a prori, ametric for model goodness. It is no fair looking at the results then choosing a metric to testtheir quality. If you want people to believe your analysis, you should state your intentionsin advance. In this example, that is easy to do because comparing a vector of observedfrequencies, xo, with the predictions of a model, xe, is almost always done using the χ2

(chi-squared) metric defined as follows:

χ2 =∑i

(xo[i]− xe[i])2

xe[i](14.16)

The larger the value of χ2, the worse the model. Under some fairly mild assumptions, theexpected value of χ2 is equal to the number of degrees of freedom, df, appropriate to theproblem. In our example, there are no constraints on total frequency so df = 30. A valuemuch greater than this will signify a poor model while one much less will indicate a goodmodel. With some standard assumptions, this assessment can be made more quantitative.

Our 30 comparisons are listed in Table 14.16 with the worst ones shown in red.. Here,we give the actual results for the rest of the season (cols. 2–3), the predictions madeignoring “group-ness” (col. 4) and the predictions made using our hierarchical model(cols. 6 and 8). The χ2 terms for these predictions are given as well (cols. 5, 7, 9, resp.)with the total metrics at the bottom in bold type. Given the latter, it should be apparentthat the M-L predictions are very poor while the predictions of our model are quite goodwhether we use the MAP or the Mean point estimates for p[i]. Any standard χ2 table willsay that, with df = 30, the probability of getting a chi-squared value as large as 44.45 isabout four percent while getting one as small as either of our point estimates is about onepercent. These percentages come from a frequentist analysis but they are extreme enoughthat a Bayesian analysis would not contradict them to any significant extent.

How would one do such a Bayesian analysis? Simply by computing the metric atevery iteration as an “extra” and outputting it as a column in the trace. That would give usthe marginal for χ2 under the conditions of our specific problem (i.e., without additionalassumptions) from which we could obtain a proper credible interval to which we could

https://en.wikipedia.org/wiki/Chi-squared_test


Table 14.16: Free Throws: Prediction Comparison

Team Games 11–82 M-L Result MAP Result Mean ResultMade Attempted Pred. χ2 term Pred. χ2 term Pred. χ2 term

ATL 55 64 55 0.00 55 0.00 54 0.02BOS 215 235 209 0.17 196 1.84 198 1.46CHA 78 99 74 0.22 83 0.30 80 0.05CHI 158 189 146 0.99 159 0.01 155 0.06CLE 153 171 152 0.01 145 0.44 145 0.44DAL 461 504 461 0.00 434 1.68 438 1.21DEN 411 451 407 0.04 388 1.36 387 1.49DET 140 161 135 0.19 136 0.12 135 0.19GSW 90 101 85 0.29 87 0.10 84 0.43HOU 213 257 201 0.72 214 0.00 210 0.04IND 308 360 291 0.99 302 0.12 296 0.49LAC 104 120 86 3.77 101 0.09 97 0.51LAL 94 112 62 16.52 94 0.00 91 0.10MEM 186 230 187 0.01 194 0.33 190 0.08MIA 168 212 182 1.08 179 0.68 178 0.56MIL 116 126 99 2.92 106 0.94 104 1.38MIN 56 62 39 7.41 52 0.31 51 0.49NJN 104 118 98 0.37 101 0.09 98 0.37NOH 90 101 101 1.20 86 0.19 85 0.29NYK 229 282 248 1.46 241 0.60 238 0.34OKC 672 743 643 1.31 627 3.23 631 2.66ORL 57 65 54 0.17 55 0.07 54 0.17PHI 80 95 75 0.33 80 0.00 78 0.05PHX 184 196 182 0.02 169 1.33 168 1.52POR 73 83 66 0.74 70 0.13 69 0.23SAC 106 127 109 0.08 108 0.04 106 0.00SAS 264 302 256 0.25 257 0.19 253 0.48TOR 193 228 175 1.85 195 0.02 188 0.13UTA 147 176 135 1.07 149 0.03 145 0.03WAS 115 130 121 0.30 111 0.14 110 0.23

χ2 = 44.45 14.38 15.50

compare the original, M-L metric (44.45). Alternatively, we could just look at the χ2

column in the trace and determine the fraction of entries less than 44.45, i.e., the quantileof the original metric (Sect. 3.3.3).

The values in Table 14.16 deserve some more comment. First, it is important to notethat the M-L predictions are sometimes better than our model predictions. Our model isbetter when the entire set of predictions is taken into consideration. Moreover, it is better


in expectation meaning that, if you do a lot of these analyses, from time to time you willdo worse overall with the hierarchical model (another way of saying that the χ2 metric isnot perfect). Also, how much better the hierarchical model is depends on the size of thepopulation under study. The smaller the group, the weaker the “group-ness” effect.

This effect, known for a long time, is often called regression toward the mean implyingthat quantities much less than average tend to increase in the future while those muchgreater than average tend to decrease. In other words, the global average acts somewhatlike a magnet as far as predictions are concerned. This helps to explain why the M-Lprediction for LAL was so bad since that prediction was well below average and based onvery little data (see Table 14.14). In contrast, the M-L prediction for OKC was well aboveaverage but based on a lot of data. In hindsight, forcing it down toward the average wasunfair so our model did not do well in that instance.

These (possibly surprising) results show again the importance of considering all validprior information. Sometimes this takes a bit of imagination but it is worth it.

14.9 Pumps (Hierarchical)In Section 12.4, we considered a sample of ten counts assumed to be ∼ Poisson. Sincewe had no prior information distinguishing the pumps, we gave a trivial prior for each oneseparately. However, considering the discussion for the previous example, it is apparentthat we could have done better; we could have acknowledged that the pumps were similarso the ten θ parameters should have some similarity as well.

Just as with our free-throws, such similarity is usually described by letting the similarparameters come from a common prior instead of being assumed totally separate. Ofcourse, this means additional parameters in the model and, other things being equal, thiswould make the model worse (smaller marginal likelihood). On the other hand, if the“family resemblance” is genuine and important, the model might get better in spite of theextra parameters. Let us see whether this is the case for our ten pumps.

We shall use the same data and likelihood as before but let θj be given a commonGamma(a, b) prior. Since we lack information about the values of the hyperparameters ofthis Gamma distribution, we shall have to give them vague hyperpriors. More parametersand still vague might sound like a bad idea but it is not as bad as it sounds because thevagueness is further removed from θj . In a sense, we are removing the vagueness from the“parents” of θj and giving it to their “grandparents”. The DAG would show this graphicallyas it did in the free-throws example. In the end, we are hoping that “family resemblance“among the θj will save the day and the new model will be better than the original. Thenew model is shown below.

The Gamma hyperpriors are in lines 8–9. Both are intended to be vague so, before welook at results for θj , we should consider whether these hyperpriors were indeed vague.Figure 14.17 shows that neither marginal looks anything like its prior and both have arange smaller than their priors would indicate. Therefore, it is probably fair to conclude

https://en.wikipedia.org/wiki/Regression_toward_the_mean


Model 14.10: Hierarchical Pumps Model (related Poisson counts)

1 Constants:2 N = 10; // # of points3 Data:4 x[N], t[N];5 Variables:6 theta[N], a, b, j;7 Priors:8 a ˜ Exponential(1);9 b ˜ HalfNormal(2);

10 for (j, 1:N) {11 theta[j] ˜ Gamma(a, b); // was Uniform(0, 10) before12 }13 Likelihood:14 for (j, 1:N) {15 x[j] ˜ Poisson(theta[j]*t[j]);16 }17 Extras:18 Monitored:19 theta[], a, b;

that they were sufficiently vague and that there was enough information in the data to givethem values.

� ��

��

��

��

��

�

��

� � � � � ��

��

��

��

��

�

��

Figure 14.17: Gamma Hyperprior (marginals for a and b)

It will also turn out that the Gamma(a, b) prior itself is virtually flat (a < 1) everywherethat matters so it is very vague as well but, this time, not by construction.


14.9.1 ResultsThe results for the Gamma hyperparameters are shown in Table 14.17. Table 14.18 showsthe new results for the failure rates, θj .

Table 14.17: Gamma Hyperparameters (solution)


a 0.7033 0.7543 0.2693 1.3379b 0.6271 1.2407 0.2482 2.7042

Table 14.18: Pumps (solution #2)

Pump Failure RateNominal Estimate 95% Credible Interval

MAP Mean Lower Limit Upper Limit1 0.053 0.052 0.060 0.017 0.1112 0.064 0.043 0.104 0.0003 0.2603 0.079 0.066 0.090 0.023 0.1654 0.111 0.102 0.116 0.060 0.1765 0.573 0.344 0.594 0.091 1.2046 0.605 0.583 0.609 0.355 0.8847 0.952 0.395 0.893 0.003 2.1838 0.952 0.230 0.894 0.003 2.1869 1.905 1.017 1.529 0.315 2.999

10 2.095 1.661 1.965 1.183 2.811

14.9.2 Model ComparisonWe shall compare the results from our two models in two ways. First, we shall comparevalues for θj directly, especially the widths of their credible intervals. Then, we shall seewhat the marginal likelihoods have to say about relative model credibility overall.

A comparison of posterior mean values for θj is given in Table 14.19.This table shows not only that the credible intervals are collectively better in our new

model but that each one of them is smaller than in our original model in Section 12.4.Even more interesting is what we see when we compare the width of the credible

intervals to the time over which data were observed (see Table 12.6). It appears that asthis time increases, the credible interval tends to decrease. Thus, pumps 7, 8 and 9 havethe shortest observation times and the worst credible intervals. We see, however, that thiseffect is smaller with model #2, not unexpected given our discussion in the free-throws


Table 14.19: Pumps (model comparison)

Pump Failure RateNominal Mean 95% Credible Interval

Model 1 Model 2 #1 Width #2 Width1 0.053 0.064 0.060 0.097 0.0942 0.064 0.128 0.104 0.301 0.2603 0.079 0.095 0.090 0.145 0.1424 0.111 0.119 0.116 0.119 0.1165 0.573 0.764 0.594 1.379 1.1136 0.605 0.637 0.609 0.549 0.5297 0.952 1.906 0.893 4.510 2.1808 0.952 1.904 0.894 4.501 2.1839 1.905 2.383 1.529 3.904 2.684

10 2.095 2.194 1.965 1.771 1.628Total = 17.276 10.929

example. With our new model, the common prior for θj enables individual rate parametersto “borrow strength from the ensemble” and regress closer to group behavior.

As we have shown before, it is possible to compute the probability that our new model,M2, is better (more credible) than our original model, M1. The results include values forlog(marginal likelihood) for M1 and M2, viz., −47.6965 and −34.5292, resp. The valuefor M2 is larger so it is the better model. We can make this conclusion quantitative as wedid before (pg. 141). The numbers speak for themselves.

oddsM1 = exp(−47.6965− (−34.5292)) = 1.91× 10−6 (14.17)

P(M1 better than M2) =oddsM1

1 + oddsM1= 1.91× 10−6 (14.18)

Therefore,P(M2 better than M1) = 1− 1.91× 10−6 ≈ 0.999998 (14.19)

14.10 SAT and HorsepowerIn traditional data analysis, one of the most important problems is to take a dataset (a set ofnumbers) and partition it into categories (and, perhaps, subcategories) such that the natureof the categories reveal something meaningful about the real world. A familiar examplewould be a medical study in which the categories {treatment, placebo} are of primaryinterest. There, the proposition would be that the “treatment” has a real benefit for curingsome disease. The data analysis problem, in general, is to take any combination of dataand categories and output a credible conclusion regarding a proposed difference between


categories. All real datasets exhibit variability, quantified as variance, and the traditionalapproach to these “category” problems is called Analysis of Variance (ANOVA).

To illustrate this type of problem, we shall use data for the average (combined) SATscores for students in the high schools of five counties in Northern Virginia, U.S.A. for theacademic year 2013–2014. [42] The dataset is shown in Table 14.20 with counties labeleds1, s2, . . . . The problem, as always, is to find out whether the categories (counties) aredifferent with respect to average SAT score. Since this is such a classic problem, we shallbegin by looking at the classic approach and solution.

Table 14.20: Average HS SAT Scores by County

s1 s2 s3 s4 s52182 1666 1887 1591 15591830 1656 1724 1582 15171774 1641 1686 1564 15041767 1623 1417 1556 14931759 1619 1527 14721749 1615 15011740 1610 15021673 1608 14781664 1597 14361660 1595 13701651 1586 12961647 15761647 1432163416171615159815641520151914981491147914641426

14.10.1 Traditional ANOVAThe basic idea of ANOVA is to partition the total variance of a dataset into some between-category component and within-category component. If the between-category component

https://en.wikipedia.org/wiki/Analysis_of_variance


is a “significant” fraction of the total variance, then it is fair to say that the categories reallymatter in some way. To characterize the variance between categories, the means of thecategories are computed and the between-category component set equal to the between-means variance. The within-category component is the total variance of the datapointsfrom their respective category mean.

There are a number of assumptions in the traditional ANOVA procedure not least ofwhich is the test used to provide a probability for the between-within comparison. Suchtests are a characteristic of frequentist methods and the standard test in this case is theF-test or variance-ratio test. Converting the F metric into a probability is done via thetheoretical sampling distribution of the F metric.

Since this is such a common analysis, it is hard-coded into many spreadsheet programs.Figure 14.18 is a screenshot showing the output for this problem from a spreadsheet in theusual ANOVA format where the qualifier “One-Way” signifies a single set of categories(groups) with no subcategories.

Figure 14.18: SAT: ANOVA From Spreadsheet

The F-metric is computed here as 62638.8/15600.4 = 4.015. The p-level is the areaunder the right tail of the sampling distribution, i.e., F(4, 53) > 4.015 which (supposedly)gives the probability that the observed F metric would exceed this value if the categoriesdid not matter (the null hypothesis). The conclusion here would be that the probabilitythat these five counties are indistinguishable, based on these SAT scores, is less than onepercent. In other words, the difference between counties is real. Whether the size of thisdifference is meaningful is another question, one that ANOVA cannot answer.

One shudders to think of all of the assumptions underlying this conclusion. That asingle sampling distribution could actually describe any categorization of any dataset, evenwith all the standard caveats considered, seems very unlikely. Worse still is the canonical

https://en.wikipedia.org/wiki/Sampling_distribution

https://en.wikipedia.org/wiki/Null_hypothesis


frequentist practice of deprecating the inherent variability of the procedure itself. Even if0.00644 is the most likely value for the answer in this problem, our coin-flipping example(pg. 95) showed that the most likely answer need not be likely at all and pretending thatit is the answer is a really bad idea. Given the algorithmic advances of the past quartercentury, namely MCMC, we can now do much better.

14.10.2 A Bayesian ApproachModel 14.11 goes back to first principles and describes a procedure for analyzing thisdataset without tests of any kind. As always, everything is done in (log) probability spacefrom the outset so we do not need any sampling distributions or associated tests.

The variables in this model should be mostly self-explanatory and all unknowns aregiven vague priors (lines 8–10) which assume only that the values must be positive. Thefive category means, mu[N], are used in the likelihood to describe the SAT scores withinthe categories and the mu[] mean is called theta (line 13). We again assume a commonscale, sig, for the cells. The key variable is fb which equals the fraction of the total varianceattributable to the difference between categories. Rather than test this with some samplingdistribution, we estimate its 95-percent credible interval directly. The solution is presentedbelow (Table 14.21) and some marginals (smoothed histograms) in Figures 14.19–14.20.7

Table 14.21: SAT ANOVA (solution)


fb 0.00159 0.301 0.0512 0.558var.between 26 7316 300 16766var.within 16321 15320 10399 20786

theta 1602 1586 1499 1671

The one thing most obvious in this result is the width of the fb credible interval. Themessage is that, even after analyzing the data, there is very little information in it that saysmuch about the difference between counties. Also, the mean estimate for fb is well under0.5 so most of the variance comes from within the categories not because of the categories.In other words, the counties might be different or they might not but this dataset does notsuffice to resolve the issue. Allowing the scale for the mu[] to be different from sig givesa worse model (odds against ≈ 100:1) while making only a negligible change to the meanvalue of fb so our conclusions remain the same.

Compare this Bayesian result to the frequentist analysis (Fig. 14.18) which concludedthat there was a greater than 99-percent chance that the categories were important. How

7Note: Smoothed histograms do not always go to zero even when they should.


Model 14.11: SAT Model (ANOVA)

1 Constants:2 N = 5, n1 = 25, n2 = 13, n3 = 4, n4 = 11, n5 = 5;3 Data:4 s1[n1], s2[n2], s3[n3], s4[n4], s5[n5];5 Variables:6 mu[N], theta, var.within, var.between, sig, fb, hs, s;7 Priors:8 theta ˜ HalfNormal(10000);9 var.within ˜ HalfNormal(10000);

10 var.between ˜ HalfNormal(10000);11 sig = sqrt(var.between);12 for (hs, 1:N) {13 mu[hs] ˜ Normal(theta, sig);14 }15 Likelihood:16 sig = sqrt(var.within); // reusing sig OK in this syntax17 for (s, 1:n1) {18 s1[s] ˜ Normal(mu[1], sig);19 }20 for (s, 1:n2) {21 s2[s] ˜ Normal(mu[2], sig);22 }23 for (s, 1:n3) {24 s3[s] ˜ Normal(mu[3], sig);25 }26 for (s, 1:n4) {27 s4[s] ˜ Normal(mu[4], sig);28 }29 for (s, 1:n5) {30 s5[s] ˜ Normal(mu[5], sig);31 }32 Extras:33 fb = var.between/(var.within + var.between);34 Monitored:35 theta, var.within, var.between, fb;

could these two analyses be so at odds? The answer must be that there are assumptions inthe traditional procedure that turn out not to be true with this dataset.

The procedure defined in Model 14.11 is very clearly what we are trying to examine


��

��

��

��

��

��

��

��

��

Figure 14.19: SAT: Fraction Variance-between

� ��

��

��

��

��

��

��

��

��

��

��

Within

Between

Figure 14.20: SAT: Variance Comparison

in this analysis and its MCMC output is also clear: Even if the counties are truly different,there is so much “noise” in this problem that we would need a much larger amount of datato prove it.

The next example shows that this is not always the case.


14.10.3 Cars and HorsepowerLet us consider some data in which the categories should be more intuitive. Table 14.22summarizes two columns from a published dataset listing several characteristics for a largenumber of automobiles. [52] The datapoints describe the horsepower rating for 393 carsdistinguished as having four, six or eight cylinders (the categories). One would expectthat, in this case, the category (with sample size n[i]) must surely be of some significancewith horsepower increasing, at least on average, with number of cylinders.

Table 14.22: Automobile Horsepower Summary

# Cylinders n Min Max Mean St. Dev.4 202 46 115 74.5 15.06 83 72 165 101.5 14.38 108 90 230 158.5 27.9

Apart from the number of categories, the model is almost unchanged. The priors,however, were tweaked a bit. Since the datapoints are an order of magnitude smaller thanin the SAT example, the scale factors were adjusted accordingly. Also, now that we havesome idea of the shape of a marginal for variance, the HalfNormal(10000) priors werereplaced with Gamma(3, 1000) priors. The Gamma mean = scale*shape (here, 3000) and,if shape ≥ 1, mode = scale*(shape - 1) (here, 2000).

The full solution is shown below and the marginal for fb in Figure 14.21.

Table 14.23: HP ANOVA (solution)


fb 0.542 0.579 0.416 0.732var.between 430 545 230 918var.within 364 373 323 427

theta 112.5 112.9 86.1 140.0

The results in this example are more satisfying, probably because the categories reallyare important and, also, there are a lot more data. First, the marginal for fb is fairlysymmetric and different from its prior which suggests not only that MCMC worked wellbut that there was enough information in the data to quantify fb to a useful extent. Thesame can be said for theta since its MAP and Mean estimates are nearly equal. The credibleinterval for fb is no doubt wider than one would prefer so there is still quite a lot of noisein this analysis. Nevertheless, this interval spans less than a third of its prior domain whichis a big improvement over what we saw in Figure 14.19. Better yet, most of the posteriordensity for fb is greater than 0.5.8 This means the variability in the data is due mainly tothe categories just as one would expect/hope given physical considerations.

882.6 percent, by counting trace rows


��

�

�

�

�

�

��

��

Figure 14.21: HP: Fraction Variance-between

It is difficult to overestimate the importance of ANOVA analyses. The number ofapplications is enormous. These two examples should demonstrate, however, that onemust be careful in how the analysis is done and how the results are interpreted.

14.11 Predicting a DrawIn an earlier example, a large database of chess games was used to determine whetherQueen-pawn (QP) openings resulted in a draw more often than King-pawn (KP) openings.That is the conventional wisdom and it turned out to be correct (see Fig. 1.8). A moreinteresting question is whether a draw can be predicted in advance. Clearly, the answer ingeneral is No. If the outcome of a game could be predicted, then there would be no pointin playing it! So, let us make the question a bit easier. Can one predict the probability ofa draw in a chess game? Now, the answer is Yes.

We shall use the same source of 2014 data as before but consider all openings, notjust KP and QP, subject to the same restrictions (Expert or above, etc.). The entire datasetcontains N = 19,539 games but we shall analyze only the games from the first six monthsof the year (n = 9,332), leaving the rest unseen so that we can test the predictive power ofour resulting model. Designing this model will depend on what we know a priori aboutchess in general and what factors, available from the data, might be useful in predictingwhether a game would be decisive or a draw.


14.11.1 Relevant FactorsIn addition to the result of the game {1–0, 0–1, 1/2–1/2}9, the dataset gives the Elo ratingof each player which, as described earlier, is a reliable indicator of playing strength. It alsogives the name of the opening. Since the latter is determined after only a few moves, longbefore the outcome of the game is decided, we shall consider it a legitimate “predictor”.

From these data elements, we define three (hopefully) relevant factors:

delRating The magnitude (absolute value) of the difference in player rating. Given thedata, delRating should be in the range [0, 800].10 We expect a draw to be more likelyif the players are of equal strength.

hrW The higher-rated player had the White pieces {True (1), False (0)}. The playerwith the White pieces makes the first move—a known advantage (which tournamentpairings are designed to cancel out). If the higher-rated player had the White pieces,this should make a draw less likely.

QP The opening was Queen-pawn {1, 0}.11 We know, from our earlier example, that thismakes a draw more likely.

14.11.2 Model and SolutionWe wish to predict a probability using a combination of the three factors (one ordinal, twonominal) described above. Generally, this combination is linear but a linear combinationcould output any value in [−∞,∞] whereas probability is limited to the range [0, 1]. Inorder to map the former to the latter, we utilize a link function, f(·). For this example, thelink function that we shall use is the logit, defined as the (natural) log(odds). By invertingthe logit, the corresponding probability, p, is easily recovered: logit → odds → p (seepg. 82). With P (D) the probability of a draw, we get the following expression, anotherexample of a generalized linear model.

P (D) = invLogit (c0 + c1 ∗ delRating + c2 ∗ hrW + c3 ∗QP) (14.20)

The values of the coefficients are unknown but, if the model makes sense, then weshould expect that c1, c2 < 0 and c3 > 0. We shall make all priors vague. Each datapointis a single game so the likelihood of a draw is ∼ Bernoulli(P(D)). The model is presentedbelow as Model m0123.

The only comment needed for this model is that care must be taken when describingthe priors. We want them to be vague since we know nothing useful about the coefficientsbut the unknowns are of different scales so what is vague in one case might not be so inanother. At the same time, it is risky to make priors “too vague” because that might allow

9White won, Black won, Draw10In fact, the maximum was only 712.11ECO code D or E

https://en.wikipedia.org/wiki/Chess_rating_system

https://en.wikipedia.org/wiki/Generalized_linear_model

https://en.wikipedia.org/wiki/Logistic_regression


Model 14.12: Predicting a Draw: m0123 (logit)

1 Constants:2 N = 9332; // # of games3 Data:4 Draw[N], delRating[N], hrW[N], QP[N];5 Variables:6 p, c0, c1, c2, c3, i;7 Priors:8 c0 ˜ Normal(0, 1);9 c1 ˜ Normal(0, 0.01);

10 c2 ˜ Normal(0, 1);11 c3 ˜ Normal(0, 1);12 Likelihood:13 for (i, 1:N) {14 p = invLogit(c0 + c1*delRating[i] + c2*hrW[i] + c3*QP[i]);15 Draw[i] ˜ Bernoulli(p);16 }17 Extras:18 Monitored:19 c0, c1, c2, c3;

spurious local maxima in the posterior. This is especially true for unknowns like c0. Thisterm describes the “Other” category for factors affecting the probability of a draw that wehave not considered. It is always a good idea to give some thought to the range of priorsand to try to make them reasonable.

The solution for m0123 is summarized in Table 14.24.

Table 14.24: Chess Draw (m0123 solution)


c0 −0.0454 −0.0465 −0.138 0.0469c1 −0.00420 −0.00420 −0.00466 −0.00373c2 −0.239 −0.237 −0.325 −0.149c3 0.182 0.183 0.0901 0.275

We shall not attempt to interpret these coefficients too exactly, just try to see whetherthey make sense. Parameters c1 and c2 are both negative; as delRating and hrW increase,the probability of a draw decreases. However, c3 is positive. All of this makes perfectsense.


14.11.3 Model ComparisonA question which often arises in logit models concerns the importance of the putativefactors. We might think that they should be relevant but we could be mistaken. As wehave seen (Sect. 10.5.3), the best way to compare models is by comparing their marginal(global) likelihoods. With this dataset, all eight combinations of included factors wereexamined. Table 14.25 summarizes the results.

Table 14.25: Chess Draw: Model Comparison

Model Mean Estimate log(MargLik) Probabilityc0 c1 c2 c3

m0 −0.699 −5935.19 6× 10−84

m01 −0.0972 −0.00424 −5758.08 5× 10−7

m02 −0.576 −0.248 −5922.37 2× 10−78

m03 −0.764 0.188 −5930.03 1× 10−81

m012 0.0141 −0.004218 −0.231 −5747.92 0.0121m013 −0.158 −0.004228 0.173 −5754.44 2× 10−5

m023 −0.639 −0.2568 0.200 −5916.22 1× 10−75

m0123 −0.0465 −0.00420 −0.237 0.183 −5743.52 0.9879

In all cases, the signs of the coefficients are just as expected. The model with all of thethree factors included, m0123, is the overwhelming favorite with m012 a distant second.At least, that is what our results indicate but these are only relative probabilities. Modelm0123 is far better than any of the alternatives but it could still be a poor model in theabsolute sense. We need to know how well predictions made using m0123 fare whenfaced with ground truth.

14.11.4 Testing the Model PredictionsTesting the predictions of model m0123 is not entirely straightforward because what thismodel predicts is not an observable, just a probability. If we apply it to a single game,it could easily give the wrong answer even if it is a good model. Therefore, we have toapply it to many games all at once so that random errors get a chance to cancel out. Theliterature describes several ways to do this. The Hosmer-Lemeshow test, in particular, isvery popular. However, we were lucky enough to have a lot of data and prudent enough toset aside half of it to serve as a test. Our test data, N = 10,207, should be very similar toour training set. There was no opportunity for any bias.

We still need to be careful. The delRating factor is almost continuous so it is veryunlikely that any logit and, therefore, any P(D) value will appear more than once duringour test. To get samples of reasonable size, we shall have to partition all N P(D) valuesinto bins and make predictions for the number of draws per bin using a single, collectiveP(D) value for each bin. With model m0123, the minimum P(D) is zero and the maximum

https://en.wikipedia.org/wiki/Logistic_regression#Hosmer.E2.80.93Lemeshow_test


(delRating = hrW = 0) is 0.5339. We want a big sample in each bin but we also want to seethe model working, that is, see different results in different bins. Therefore, we shall defineeleven bins with binwidth = 0.05 giving predicted P(D) in the range [0, 0.55]. Overall, thetest procedure will be as follows:

1. With the new, unseen data, use the m0123 coefficients in Table 14.25 to compute thelogit for each of the N = 10,207 games.

2. Convert each logit to probability and place that game into the corresponding bin, bj .

3. Predict the number of draws expected for bj, j = 1, 11.

4. Compare the observed number of draws in each bj to the predicted number.

To assess the quality of m0123, compare the results from this test to the results usingmodel m0.

In order to implement Step 3 of this test, we need to know the value of P(D) given bj .We can find these values using Bayes’ Rule.

P (D|bj) =P (D)P (bj|D)

P (bj)(14.21)

The denominator can best be understood by expanding it using the Sum and Product Rules.

P (bj) = P (bj, D + D) = P (bj, D) + P (bj, D)

= P (D)P (bj|D) + (1− P (D))P (bj|D)(14.22)

For P(D), we might as well use p = 0.332, the value from model m0. We do not needhigh precision since we are binning the answers. Everything else that we need can betallied from the training data. The m0123 values for P (D|bj) are shown in Figure 14.22.Values range from near zero for bin[1] to more than 0.5 for bin[11]. This should make fora good test. Also, it is gratifying that the plot is nearly linear. It suggests that our factorswere well chosen.

The results of all of these tests, summarized in Table 14.26, speak for themselves. It isapparent that model m0123 is far better at predicting the probability of a draw than is thefeatureless m0 model. The difference is even more obvious when the cumulative errorsare plotted (see Fig. 14.23). Our modeling efforts have been a success.

14.11.5 DIYThe one thing our model has not explained, by construction, is the value of parameter c0.We were limited by the information in our database so we needed this category. However,it is not difficult to think of other factors that might reduce the need for this term andmake the model even better. For instance, since we know the nature of the opening in eachgame, we could elaborate the QP factor into something more detailed, partitioning the setof openings into bins of their own perhaps (at the expense of additional unknowns).

This idea is left as an exercise for the reader.


� � � � � � � � � � ��

��

��

��

��

��

��

�

�(�

|��)

Figure 14.22: Probability of Draw Given Bin[j]

Table 14.26: Chess Draw: Test Results

Number of DrawsBin Observed Predicted Errors

m0 m0123 m0 m01231 0 1 0 1 02 5 33 7 28 23 33 130 48 97 154 110 235 120 125 105 297 432 294 135 36 487 547 453 60 347 636 602 612 34 248 679 578 677 101 29 607 463 596 144 11

10 402 289 373 113 2911 125 78 119 47 6

Total 885 136


� � � � � � � � � � ��

��

��

��

��

��

��

�� m0

m0123

Figure 14.23: Draw Test: Cumulative Errors

14.12 Logit: Another PerspectiveIn the previous section, we used the logit (log odds) simply as a mathematical transform tomap an arbitrary value in (−∞,∞) to the range (0, 1) so that it could then be interpreted asa probability. There is another, more elegant approach that is both intuitive and extensibleto situations in which there are more than two possible values for the observable. However,in describing it, we shall continue to focus on the binomial (true/false) scenario to keepthe math as simple as possible. In fact, we shall reanalyze the chess data from the lastexample but in a new way.

14.12.1 A Latent VariableThe new approach begins by hypothesizing/pretending that there is a continuous, latent(i.e., hidden) quantity in this chess-draw analysis. Here, ‘’latent” means that the quantityis intrinsically unobservable. We cannot see it directly; we can see only its effects. It isuseful to think of this latent variable as a score, s, that is related to the probability of adraw so that, as s increases, the probability of a draw increases. Thus, there will be somethreshold, t, such that the result of game[i] will be a draw if and only if si > t.

We do not, of course, imagine that this scoring function or the associated threshold arereal. They are just a mathematical device adopted for the sake of trying to make sense of


the data. We are constructing a model and all that matters, in the end, is that the model

• does not contradict prior information

• gives reasonable answers

• provides some insight into the problem

The scoring function, S(s), is almost always a linear function of some covariates plusa term, f(z), for residual uncertainty where f ∼ Logistic(0, 1) or f ∼ Normal(0, 1). Here,we shall adopt the Logistic distribution12 with PDF and CDF as follows:

fL(z) =1

b

exp(−z)

[1 + exp(−z)]2; z =

s− µb

(14.23)

FL(z) =1

1 + exp(−z)(14.24)

If we compute some metric, mui , to describe the probability of observing a draw ingame[i], then our scoring function for that game, S(s)i, becomes

S(s)i = mu i + fL(0, 1)

= fL(mu i, 1)(14.25)

where f(·) is the standard µ = 0, b = 1 Logistic distribution.Given a drawing threshold, t, common to all games, P(draw in game[i]), pi, is then

pi = 1− 1

1 + exp(−(t−mu i))(14.26)

as shown in Figure 14.24. Thus, when mui is much lower than t, a draw is unlikely.Finally, given log(odds) = λ, we have

p =odds

1 + odds=

exp(λ)

1 + exp(λ)=

1

1 + exp(−λ)(14.27)

Comparing this to Equation 14.24 tells us how the logit got its name (cf. probit).

14.12.2 A New ModelWe can now use Equation 14.26 to develop a model for our chess-draw training set just aswe did in the previous example. However. if we just use Model 14.12, line 14 as before,we encounter a big problem. Looking at Figure 14.24, it should be apparent that, if weadd any arbitrary constant, α, to parameter c0, then P(Draw) will be unchanged so long

12The Normal distribution is preferred whenever one is doing Gibbs sampling (Sect. 11.3) because thefull-conditionals can then be written in closed form.

https://en.wikipedia.org/wiki/Probit_model


μ� ��

��

�(��) = ��

Figure 14.24: Logistic(µi, 1) with Draw Threshold, t

as we add α to the threshold as well. All this does is slide the whole PDF curve alongthe s axis. Letting both c0 and t be free parameters will result in an infinite number of{c0, t} pairs, all of which give the same probability, p. This is one kind of what is calledan identifiability problem. Were we to ignore this problem, our MCMC process wouldnever converge since almost any {c0, t} pair would be just as good as any other.

This problem can be easily solved in two ways. We could set either c0 or t to zero inadvance. In what follows, we shall do both and compare the results. Since we have alreadydrawn Figure 14.24, we shall start by setting c0 to zero and keeping t a free parameter.Also, as an Extra, we shall determine how well our model does in predicting a draw ineach game. All of this is specified Model 14.13 with the solution shown in Table 14.27.

Table 14.27: Chess Draw (m123 solution)


c1 −0.00420 −0.00420 −0.00467 −0.00374c2 −0.235 −0.238 −0.325 −0.149c3 0.182 0.183 0.0903 0.275t 0.0460 0.0466 −0.0456 0.138

It is instructive to compare Table 14.27 above to our original solution in Table 14.24.

https://en.wikipedia.org/wiki/Identifiability


Model 14.13: Predicting a Draw: m123 (logit2)

1 Constants:2 N = 9332; // # of games3 Data:4 Draw[N], delRating[N], hrW[N], QP[N];5 Variables:6 p, c1, c2, c3, mu[N], t, good_[N], pct_good, i;7 Priors:8 t ˜ Normal(0, 1);9 c1 ˜ Normal(0, 0.01);

10 c2 ˜ Normal(0, 1);11 c3 ˜ Normal(0, 1);12 Likelihood:13 for (i, 1:N) {14 mu[i] = c1*delRating[i] + c2*hrW[i] + c3*QP[i];15 p = 1 - invLogit(t - mu[i]); // area above t16 Draw[i] ˜ Bernoulli(p);17 }18 Extras:19 for (i, 1:N) {20 // two ways to be correct21 good_[i] = ((Draw[i] == 0) && (mu[i] < t)) ||22 ((Draw[i] == 1) && (mu[i] >= t));23 }24 pct_good = 100*mean(good_[]);25 Monitored:26 c1, c2, c3, t, pct_good;

The covariate coefficients are the same, apart from a little noise, and t is the negative of c0(due to a change in definition) so everything we said about their values and the quality ofthe solution, on average, remains the same as before.

Note that, at each parameter state, there is a different value for t so P(Draw) for game[i]changes with each MCMC iteration. There were, in this run, 200,060 iterations and wetallied the percent correct predictions for each as an Extra. This percentage is not really acontinuous quantity since the sample of games is finite so we display the range of correctpredictions as a frequency histogram (Figure 14.25).

Our model gets the right answer about two-thirds of the time—good but not great. Itis clear that, even given valid covariates, there is a lot of residual uncertainty.


��

��

��

��

��

��

�� (� = ��)

��

�� = ��

Figure 14.25: Success of Model m123 With Training Set

14.12.3 The Alternate ChoiceIf we set t = 0 as a constant and restore parameter c0 to the model with the same prioras in m0123, we get essentially the same results (Table 14.28) as in our original analysis(Table 14.24).

Table 14.28: Chess Draw (m0123 solution #2)


c0 −0.0470 −0.0465 −0.138 0.0455c1 −0.00420 −0.00420 −0.00467 −0.00374c2 −0.237 −0.238 −0.327 −0.150c3 0.183 0.183 0.0897 0.275

This is just what we would expect. There is always a little Monte Carlo noise but, inany good run with enough iterations, we should expect that precision (and reproducibility)will be high.

These last two examples illustrate the usual manner in which any discrete binary choicemay be modeled. However, many datasets describe situations in which there are multiplediscrete choices. The next example will describe how our logit framework can be extendedto address that possibility.


14.13 Modeling Several Discrete ChoicesDiscrete, categorical data are very common, especially in survey data where, for instance,we often find people being asked to choose from a list such as

Strongly disagree, Disagree, Neutral, Agree, Strongly agree

typically recorded as [1,. . . ,5] or [0,. . . ,4] to make them suitable for later calculations.Unfortunately, computers do not understand nominal variables and will assume that thedifference between ‘Neutral’ and ‘Strongly agree’ is twice as large as between ‘Neutral’and ‘Agree’. Obviously, this is rarely true so we need some kind of transformation thatdoes make sense. As before, we shall hypothesize that there is a latent variable measuringthe quantity of interest, extending the idea illustrated in Figure 14.24 so that there aremore than two areas under the Logistic PDF representing probabilities.13 Each area willcorrespond to one of the discrete categories.

To illustrate this, we shall continue to use our chess dataset but, instead of asking onlywhether the game will be a draw, we shall try to model the outcome for the player with thewhite pieces (who moves first). There are three categories, {Loss, Draw, Win}. These areordinal (intrinsically ordered) rather than nominal (cf. Sect. 3.1). The latent variable, s, isintended to describe the chess prowess of White which we cannot observe. What we canobserve is White’s result in each game.

The likelihood for this problem is Multinomial(p[], N), here trinomial, where p[] is avector of three probabilities and N the sample size. Since we observe only one game at atime, N = 1 and we have the special case of a Categorical(p[]) likelihood (see Sect. 6.6.2).

Since there are three categories, there are two thresholds for s as in Figure 14.26. Weare trying to model the outcome for White so parameter hrW is not appropriate. Instead,parameter delRating is now signed. Everything else is analogous to what we describedin the previous example so our pseudocode for this problem, Model 14.14, should lookfamiliar. Note:

• With both thresholds free parameters, there can be no c0 parameter (lines 3, 16).

• Standard Multinomial variates are 0, 1, 2, . . . so, for consistency, the indices of p[]should be zero-based (lines 17–19) provided the software permits it.

We shall model the training set with and without parameter QP since its utility in thissetting is not certain. Comparative results (with posterior means) are shown in Table 14.29(cf. Table 14.25).

In this case, adding more information in the form of parameter QP made the modelworse! As discussed in the NHL example above, there is a cost associated with prioruncertainty whether this means too many parameters and/or very vague priors. Here, thecost of doubling the dimension of the parameter space was not worth it. The odds of the

13We could do the same with a Normal distribution.


�[�] μ� �[�]�

��

�(��) �(��) �(��)

Figure 14.26: Logistic(µi, 1) with Two Outcome Thresholds, t[]

Table 14.29: Chess Score for White: Model Comparison

Model Mean Estimate log(MargLik) Probabilityc1 c2 t[1] t[2]

m1 0.00604 −1.122 0.621 −8901.61 0.933m12 0.00604 −0.0438 −1.138 0.606 −8904.25 0.067

two-parameter model, m12, were 14:1 against. An odds ratio of at least 5:1 is consideredmeaningful so, for this problem, model m1 would be considered better (more credible).

With this training set, model m1 achieved 52-percent correct predictions. This is worsethan the 67 percent that we saw when trying to predict draws but that was a much easiertask. Here, there are more ways to go wrong. However, for a real test, we need some newdata just as before.

14.13.1 Goodness-of-fitThe literature describes several ways to assess goodness-of-fit for ordered choices. [25][34]This suggests, with some justification, that none of them are completely satisfactory so weshall simply use the mean parameter estimates in Table 14.29 to predict game outcomesfor our test dataset (N = 10,207) from Section 14.11.4. The nine tallies are presented inTable 14.30 with marginal totals shown in bold.


Model 14.14: Outcome of a Chess Game (ordered logit)

1 Constants:2 N = 9332; // # of games3 Data:4 delRating[N], QP[N], result[N];5 Variables:6 p[3], c1, c2, t[2], mu[N], delta, scale, good_[N], pct_good, i;7 Priors:8 scale ˜ Jeffreys(0.1, 10);9 delta ˜ Exponential(scale);

10 t[1] ˜ Normal(0, scale);11 t[2] = t[1] + delta; // > t[1]12 c1 ˜ Normal(0, 0.01);13 c2 ˜ Normal(0, 1);14 Likelihood:15 for (i, 1:N) {16 mu[i] = c1*delRating[i] + c2*QP[i];17 p[0] = invLogit(t[1] - mu[i]); // area below t[1]18 p[2] = 1 - invLogit(t[2] - mu[i]); // area above t[2]19 p[1] = 1 - p[0] - p[2];20 result[i] ˜ Categorical(p[]);21 }22 Extras:23 for (i, 1:N) {24 good_[i] = ((result[i] == 0) && (mu[i] < t[1])) ||25 ((result[i] == 1) && (mu[i] >= t[1]) && (mu[i] < t[2])) ||26 ((result[i] == 2) && (mu[i] >= t[2]));27 }28 pct_good = 100*mean(good_[]);29 Monitored:30 c1, c2, t[], pct_good;

In this table, correct predictions lie along the top-left to bottom-right diagonal andwe have 5412/10207 of those for a score of 53 percent, similar to the training set. Moreinteresting is the fact that the extremes of this array are small. That is, a loss is not oftenpredicted to be a win nor a win predicted to be a loss. We saw much the same whenwe were modeling draws (Table 14.26). This array also seems to show that our model isless balanced than the observations. Losses are under-predicted and draws over-predictedwhile wins seem to be predicted fairly well. Since we have a model with only a singleexplanatory covariate we should, perhaps, not have expected too much.


Table 14.30: Chess Outcome Tallies

PredictedLoss Draw Win

Obs

vd. Loss 1182 1469 330 2981

Draw 464 2040 877 3381Win 181 1474 2190 3845

1827 4983 3397 10207

In practice, logistic models such as this are not typically used to make predictions.Rather, they are used to assess the relative importance of the covariates. In this case,delRating is important as it should be. That is what chess ratings are for, after all. Thecovariate QP is not important here presumably because it adds little information comparedto the ratings of the two players. In fact, predictions from model m12 were slightly worse.

14.14 SummaryThere is a finite amount of information available in any dataset. Given an assumed model,this information gets partitioned into the dimensions of the corresponding hyperspace. Thepartitioning is almost always uneven because parameters are not equally informative withrespect to the data. The final “concentration” of information in a hyperspace is describedby the posterior which typically occupies only a tiny volume in the complete hyperspace.14

14In MCMC, the burn-in phase is used to find this tiny sub-volume.

Chapter 15

Discrete Mixtures

PREVIOUS chapters discuss a wide variety of problems and some models used toanalyze them so you might be wondering why discrete mixture models deserve aspecial chapter of their own. The reason is that some specific difficulties may arise

when trying to carry out MCMC with these models.Discrete mixtures are models in which the likelihood is a weighted combination of two

or more distributions. Mixtures are the intuitive solution whenever the data are samples ofmeaningful subpopulations, e.g., in clustering/classification problems.

Discrete mixtures fall into two classes depending upon whether the components are ofthe same analytical form, differing only in their parameters, or of distinct forms. We shallterm these homogeneous and heterogeneous mixtures, respectively. The former give riseto some very serious identifiability problems. Here, we present examples of both kinds toillustrate what can happen.

15.1 Heterogeneous MixturesA common heterogeneous mixture occurs with count data, x, in which some counts haveextra-large values due to the nature of the data being sampled. One such mixture is thezero-inflated Poisson (ZIP) distribution (15.1).

ZIP(xi) =

wt + (1− wt) exp(−λ) ;xi = 0

(1− wt)λxi exp(−λ)

xi!;xi ≥ 1

(15.1)

In this distribution, there are more zero counts than a simple Poisson would expect.

15.1.1 A Fishy StoryN = 250 campers were surveyed to find out how many fish they caught. [33] The resultsare shown in Figure 15.1.

223

https://en.wikipedia.org/wiki/Mixture_model

https://en.wikipedia.org/wiki/Zero-inflated_model

CHAPTER 15. DISCRETE MIXTURES 224

� ��

��

��

��

��

��

��

��

��

�� (�� )

��

� = ��

Figure 15.1: Fish Caught: Raw Data

The reason for the excess in the zero count is that one of the questions asked in thesurvey was, “Did you go fishing?” and a lot of people replied, “No.” Although this priorinformation (# non-fishers) is available, we shall pretend that it is not. Our dataset is thusa mixture of two subpopulations but we observe only one undifferentiated union.

The ZIP distribution is not usually included in MCMC software and how one constructsit in the model pseudocode is, as noted in earlier examples, software-dependent. Here, weemploy Model 15.1 on the sorted data.

There is nothing sophisticated in this model nor were there any MCMC problems. Theprior for lambda is just a guess and the weight factor (fraction of non-fishers, wt) is totallyvague. Since the data are sorted, we can split the likelihood into two parts avoiding aBoolean expression. The solution is shown in Table 15.1 and Figure 15.2.

Table 15.1: Fish Catch (solution)


lambda 7.6119 7.6200 7.1036 8.1455wt 0.5678 0.5674 0.5070 0.6288

It is worth mentioning that, in discrete mixtures, the weight factor(s) are almost alwaysmore poorly determined than other parameters, especially when there are more than twocomponents. We shall see this even more clearly with homogeneous mixtures.


Model 15.1: Fish Catch (zero-inflated Poisson)

1 Constants:2 N = 250, // # of points3 Nzeroes = 142;4 Data:5 count[N];6 Variables:7 lambda, wt, temp, i;8 Priors:9 lambda ˜ Exponential(5);

10 wt ˜ Uniform(0, 1);11 Likelihood:12 for (i, 1:Nzeroes) {13 temp = wt + (1 - wt)*exp(-lambda);14 count[i] ˜ Generic(log(temp), 0, 100);15 }16 for (i, Nzeroes+1:N) {17 temp = count[i];18 count[i] ˜ Generic(log(1 - wt) + temp*log(lambda)19 - logFactorial(temp) - lambda, 0, 100);20 }21 Extras:22 Monitored:23 lambda, wt;

��

��

��

��

�

�

�

�

��

��

��

��

��

Figure 15.2: Marginals for ZIP Model


The observed data actually do not fit this ZIP model very well. The latter predicts thatall observed zeroes came from non-fishers and this is not true. Moreover, for N = 250,a Poisson(7.62) distribution predicts just one datum = 1 and no data above 16 and this isnot true either. Obviously, there is something going on here that the ZIP model does notcapture. Still, it does show that one can carry out MCMC for a heterogeneous mixture ina straightforward fashion, usually without any difficulties.

15.2 Homogeneous MixturesVery often, component distributions of a mixture have the same form. The most common,a Gaussian mixture, is given by

data ∼n∑k=1

wk Normal(µk, σk) (15.2)

where wk is the weight for (fraction due to) subpopulation k.With any homogeneous mixture, MCMC becomes extremely difficult for reasons that

are best explained using an example.

15.2.1 SalariesWe shall consider a very simple dataset and model. The data (see Fig. 15.3) are academicsalaries (in hundreds of dollars) obtained from a 1994 survey of 1,161 institutions in theUnited States. [38] This survey covered all professorial ranks and it is evident that thedistribution is bimodal. We shall assume no other prior information including the reasonfor the bimodality although this is not hard to guess. We can readily believe that we arelooking at two populations merged into one dataset (possibly more than two).

Since we have no information suggesting an alternative, we shall model these datawith a binary mixture of Gaussians. In fact, this dataset has been analyzed before usingthis mixture and the M-L solution is given on the Regress+ webpage as

salaries ∼ 0.550 Normal(370, 53.6) + 0.450 Normal(482, 92.6) (15.3)

which seems reasonably consistent with the data histogram. If we express this model inpseudocode language in the most obvious way, we get Model 15.2.

Everything in this model should look familiar. The priors warrant little comment. Notethat both Normal components have the same priors (lines 9–12). Also, the prior for theweights is vague but not totally so. With mixtures having vague scale priors, it is risky toallow a weight to get close to zero or one. In that case, there is a good chance that one ofthe components will vanish and all of the data go into another component. This becomesmore likely if there is a lot of overlap in the components as there is in this dataset. Finally,with this syntax we again utilize a Generic distribution (line 24) that returns log(pdf).

https://www.causaScientia.org/software/Regress_plus.html


��

��

��

��

��

��

��

��

��/��

��

N = 1,161

Figure 15.3: Salaries: Raw Data

When we run this model in the usual way, we get a MAP solution of

salaries ∼ 0.449 Normal(483, 91.8) + 0.551 Normal(370, 53.4) (15.4)

This is, for all practical purposes, the same as the M-L solution—not surprising given thevague priors. However, the component labels are switched; component #1 is now #2 andvice versa. How did that happen?

It happened because there is nothing in Model 15.2 to connect either label to the restof the model. That is, the model includes no reason why, for example, component #1 must“go with” a mean of 370, etc.—a kind of identifiability problem. The value of the posteriorwill be exactly the same regardless of the labeling. [35] We can actually see this happeningif we look at a segment of the trace (Table 15.2).

Table 15.2: Salaries (model1): Trace Segment

logPosterior mean[1] mean[2] sigma[1] sigma[2]-6866.26 374.751 507.773 56.2277 85.0466-6868.31 489.013 381.279 90.4163 58.4084

......

......

...-6864.78 477.161 366.28 91.8927 52.4031-6865.43 371.503 479.088 51.4627 96.2196


Model 15.2: Salaries: Model 1 (naive mixture)

1 Constants:2 N = 1161, // # of points3 sqrt2Pi = 2.506628;4 Data:5 salary[N];6 Variables:7 pdf[2][N], mean[2], sigma[2], wt[2], norm[2], i, j;8 Priors:9 for (j, 1:2) {

10 mean[j] ˜ Uniform(200, 600);11 sigma[j] ˜ HalfNormal(50);12 }13 wt[1] ˜ Uniform(0.25, 0.75);14 wt[2] = 1 - wt[1];15 Likelihood:16 for (j, 1:2) {17 norm[j] = 1/(sigma[j]*sqrt2Pi);18 }19 for (i, 1:N) {20 for (j, 1:2) {21 pdf[j][i] =22 norm[j]*exp(-0.5*((salary[i] - mean[j])/sigma[j])ˆ2);23 }24 salary[i] ˜ Generic(log(wt[1]*pdf[1][i] + wt[2]*pdf[2][i]), 0, 1);25 }26 Extras:27 Monitored:28 mean[], sigma[], wt[];


This label switching can happen even with M-L procedures unless the initial parametervalues are sufficiently close to a global optimum.

If the MAP solution were the only reason for doing the analysis then this labelingissue would not matter. However, we usually want to know more. Mean parameters andcredible intervals are almost always desired. These are obtained from the marginals. Forthis particular MCMC run, the marginals for mean[] and sigma[] are shown in Figure 15.4.

��

��

��

��

��

��[�]

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��[�]

��

Figure 15.4: Marginals for mean[] and sigma[] from Model 1

Now the problem is evident. The “marginal” for any component parameter is a unionof the marginals obtained over all permutations of the labels contained in the trace. With ncomponents, there are n! possible permutations so the situation gets very bad very quickly.

One need hardly add that any quantity computed by averaging over the trace will bemeaningless. Here, for example, we find the mean estimates for component #1 to bemean[1] = 422 and sigma[1] = 71.4, values that do not describe anything about thiscomponent. They are simply the average over permutations.

OK. With MCMC, homogeneous mixtures exhibit an identifiability problem. So, howdo we fix it?

You will not be surprised to learn that label switching is an active area of research. Theliterature contains several suggestions. We shall discuss two of them, an up-front solutionand a post-processing solution.


15.2.2 Breaking the SymmetrySince the symmetry of the posterior to label switching causes the problem, one obvioussolution is to make sure that the model distinguishes the labels up front so that confusiondoes not arise. In this case, we could, for instance, insist that mean[2] be larger thanmean[1] at every iteration by writing their (Model 2) priors as

mean[1] ∼ Uniform(200, 600); mean[2] ∼ Uniform(mean[1], 600); (15.5)

This change yields the following, distinct marginals (cf. Fig. 15.3). The new solution isshown in Table 15.3. Figure 15.6 shows the data, model and weighted components.

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��[�]

��

��

��

��

��

��[�]

��

��

��

��

��

��

��

��[�]

��

Figure 15.5: Marginals for mean[] and sigma[] from Model 2

This is a good result and suggests a general strategy for mixture models. Unfortunately,prior constraints do not always work. [35] Also, they can induce a bias. If, as here, mean[2]is constrained to be greater than mean[1] at every iteration then, depending on the positionsand shapes of the true marginals, the output marginal for mean[2] might not extend to theleft as far as it should. Indeed, its marginal above is not quite symmetric (as the CentralLimit Theorem would require). All of this suggests that alternate strategies are desirable.

One alternative that comes readily to mind is to make the priors less vague by utilizingbetter prior information but, in many cases, such information is not available. We are leftwith trying some theoretical approach that performs some sort of manipulation of the tracein an effort to rearrange it so that labels are consistent. These post-processing proceduresare termed relabeling.


Table 15.3: Salaries (model2 solution)


mean[1] 370.1 370.9 361.6 380.2mean[2] 483.1 483.3 460.3 506.8sigma[1] 53.5 54.0 48.0 60.3sigma[2] 91.7 92.1 83.2 100.8

wt[1] 0.553 0.556 0.439 0.674wt[2] 0.447 0.444 0.326 0.561

��

��

��

��

��

��

��

��/��

��

N = 1,161

Figure 15.6: Salaries: Raw Data vs. Model 2 (mean estimates)

15.2.3 RelabelingRelabeling computations are not part of MCMC. They are carried out later using the trace,and often the model as well, as input. Therefore, they are done separately, most often withcompletely different software. [51]

We shall illustrate one approach to relabeling using the NORMLH algorithm. [68]NORMLH inputs the trace alone and knows nothing about the model. Nevertheless, it canoften provide a good result. Accordingly, this dataset was re-run, changing the likelihoodso that it followed expression 15.2 exactly and leaving the priors the same as before (Model3). The raw marginals were indistinguishable from those shown in Figure 15.4 but, afterrelabeling, they appeared as shown in Figure 15.7. These are almost the same as those


from Model 2 as is the solution (Table 15.4).

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��

��

��

��[�]

��

��

��

��

��

��

��

��[�]

��

Figure 15.7: Marginals for mean[] and sigma[] from Model 3 after Relabeling

Table 15.4: Salaries (model3 solution)


mean[1] 369.9 370.8 361.7 380.1mean[2] 481.3 483.1 460.6 507.1sigma[1] 53.4 53.9 47.9 60.1sigma[2] 91.9 92.1 83.3 100.9

wt[1] . . . 0.524 . . . . . .wt[2] . . . 0.476 . . . . . .

In Table 15.4, only mean weights are shown because relabeling changes the originalweights and, in doing so, computes only mean values. Of course, without relabeling,the weights are meaningless unless the symmetry is broken (as in Model 2). Here, thetrue answers are unknown so we cannot make that goodness-of-fit comparison. Since theparameters from Model 3 are nearly the same as from Model 2, Figure 15.6 is still ourbest measure of goodness (and it looks quite acceptable). Finally, it is possible to usethe relabeled trace to compute the probability that datum[i] came from subpopulation[k].Since we do not have ground truth for this example, we shall defer that topic for now.


15.2.4 Seeing Through Fisher’s IrisesWe conclude our discussion of discrete mixtures by analyzing a famous dataset, namelyFisher’s iris data.This dataset contains N = 150 vectors of four characteristics of irises ofthree different species (Table 15.5). There are 50 of each species.

Table 15.5: Partial Iris Data

Sepal length Sepal width Petal length Petal width Species5.1 3.5 1.4 0.2 I. setosa4.9 3.0 1.4 0.2 I. setosa4.7 3.2 1.3 0.2 I. setosa...

......

......

7.0 3.2 4.7 1.4 I. versicolor6.4 3.2 4.5 1.5 I. versicolor6.9 3.1 4.9 1.5 I. versicolor...

......

......

6.3 3.3 6.0 2.5 I. virginica5.8 2.7 5.1 1.9 I. virginica7.1 3.0 5.9 2.1 I. virginica...

......

......

The problem

These data have been used in the literature many times to test classification/clusteringalgorithms. Since ground truth is known, misclassifications can be readily identified but,for the sake of illustration, we are going to pretend that we know that there are three speciesbut that individual datum assignments are unknown.

Classification algorithms (e.g., k-means) partition the data, seeking only to assigneach datapoint to one of the identified clusters/classes. Here, we shall tackle a muchharder problem—attempting to find a probability distribution and weight for each cluster(species). Given this much, we can compute the probability that each datum comes fromeach cluster. Hence, we get the assignments as a bonus, at least for those points that likelybelong to just one cluster.

There are four variables in this problem requiring a four-dimensional distribution ifwe proceed in the most intuitive manner. However, prior information (here, just commonsense) tells us that these four characteristics are almost certainly correlated. Therefore,not only are four variables likely superfluous to some degree but their correlation makesthe observed variation greater than it “should” be (see Eq. 6.39). It behooves us to try andreduce the dimensionality of this problem if possible. This can be done but the explanationnecessitates a slight digression.

https://en.wikipedia.org/wiki/Iris_flower_data_set

https://en.wikipedia.org/wiki/K-means_clustering


Principal components

The total variation in our dataset can be quantified by its covariance matrix, cov . Here,

cov =

0.685694 −0.042434 1.27432 0.516271−0.042434 0.189979 −0.329656 −0.121639

1.27432 −0.329656 3.11628 1.295610.516271 −0.121639 1.29561 0.581006

There is strong correlation between these four characteristics (relatively large covariances).However, by a judicious rotation in parameter space, we can obtain four alternate variablescalled eigenvectors. These eigenvectors are termed principal components of the data.

Principal-component analysis is not part of MCMC so we shall omit the details andsimply assert that each eigenvector is a fixed linear combination of the original variables.For this dataset, the eigenvectors are

PC1 = −0.361387x1 + 0.0845225x2 − 0.856671x3 − 0.358289x4

PC2 = 0.656589x1 + 0.730161x2 − 0.173373x3 − 0.075481x4

PC3 = −0.58203x1 + 0.597911x2 + 0.0762361x3 + 0.545831x4

PC4 = −0.315487x1 + 0.319723x2 + 0.479839x3 − 0.753657x4

where xk denotes the kth original characteristic.It is easy to see why these are called principal components. If we use them instead of

the original variables, applying the formulas above to the datapoints, then the covariancematrix of the new dataset is

covEv =

4.22824 0 0 0

0 0.242671 0 00 0 0.0782095 00 0 0 0.0238351

That covEv is diagonal means that principal components are independent of each other

(always true). These diagonal elements are called eigenvalues and, when normalized,they describe the amount of variance “explained” by the respective eigenvectors, in thiscase {0.924, 0.053, 0.017, 0.005} (to three sig. fig.). Thus, PC1 and PC2 describe almost98 percent of the observed variation. If we carry out MCMC using these two principalcomponents alone, we should be able to get nearly as good a result as using the originalfour (somewhat superfluous) variables. This would be an enormous simplification since itcuts the number of variables in half.

Figure 15.8 suggests that this might be a good strategy. It shows a plot of the newdataset (now 2-D) color-coded by iris species (since we know the true answer). The threespecies are not completely distinct in this picture but the overlap is slight and occurs onlybetween two of the species so we can expect that MCMC should be able to find a goodthree-component mixture of a suitable bivariate distribution.

https://en.wikipedia.org/wiki/Principal_component_analysis


-�� -� -� -� -� -� -� -� -��

�

�

�

��

��

I. setosa

I. versicolor

I. virginica

Figure 15.8: Iris Data vs. First Two Principal Components

Using the posterior-mean mixture, we shall compare the predicted species for eachdatapoint to its true species and judge the goodness-of-fit of our model by the number ofmisclassification errors.

The model

Using PC1 and PC2 as our revised data, we shall begin by assuming that all we know isthat there are three bivariate components in our mixture (prior information).

By far, the most common distribution for bivariate data of continuous variables is theBivariateNormal (cf. Sect. 12.9) so, lacking any reason to do otherwise, we shall adopt itfor all three likelihood components. The BivariateNormal distribution has five parametersso a likelihood with three of them will have 15 unknowns plus two unknown weights.

With a dataset of N = 150 points and vague priors, we have little hope of getting agood result with MCMC especially since the BivariateNormal itself is extremely flexibleand that flexibility needs to be curtailed by information. As a rough rule of thumb, wewould like at least 30 points per parameter most of the time and, with mixtures, even thisis not nearly enough because weights are not well determined.

In this example, several runs with various fixed, vague priors did, in fact, outputmarginals that were rather embarrassing. However, recognizing that there should be alot of commonality in this problem, a hierarchical model seemed worth trying.


For this final attempt, a Gamma(3, 0.5) hyperprior (Fig. 15.9) was used for a family ofscale parameters and for estimating the standard error for component-mean hyperpriors.Also, the latter were given means of {-8, -6, -3}. Since these are utilized in hyperpriors,not priors, they are not nearly as informative as they might appear and describe only arough neighborhood for the component means, not their actual values. The same is true ofthe Gamma(3, 0.5) hyperprior (Fig. 15.9).

� � � � � � ��

��

��

��

��

��

��

�

��

Figure 15.9: Gamma(3, 0.5) Distribution

Obviously, we are taking considerable liberties here with good procedure. Normally,such prior information would have to come from a subsampling of the data and we acceptit here only because we are just trying to illustrate MCMC for a complicated mixture.

Given this, we define Model 15.3. The hyperpriors are in lines 11–17. The remainingpriors are not as specific as they should be since we have assumed too much already.The likelihood, BivariateNormalMixture(·), is a distribution available with our assumedMacMCMC syntax which triggers (optional) relabeling. Other MCMC software would dothis differently (if at all).

Our model shows one of the ways that hyperpriors are used in general. They are meantto describe something about the parameters appearing in the likelihood. In this example,that is exactly what they do via mum1, etc. However, their influence is indirect since theyare usually one or more levels removed from yet another distribution (see lines 12–17). Inany real analysis, layers of hyperpriors are where domain expertise would be incorporated.


Model 15.3: Iris Mixture (3 bivariate components)

1 Constants:2 N = 150, // # of points (rows)3 nc = 3, // # of components4 sq = sqrt(N/nc);5 Data:6 pc1pc2[N][2];7 Variables:8 m1[nc], m2[nc], sig1[nc], sig2[nc], rho[nc], wt[nc],9 theta, s, mum1, mum2, mum3, row;

10 Priors:11 s ˜ Gamma(3, 0.5);12 mum1 ˜ Normal(-8, s/sq);13 mum2 ˜ Normal(-6, s/sq);14 mum3 ˜ Normal(-3, s/sq);15 m1[1] ˜ Normal(mum1, s);16 m1[2] ˜ Normal(mum2, s);17 m1[3] ˜ Normal(mum3, s);18 for (i, 1:nc) {19 m2[i] ˜ Normal(5.5, s/sq);20 sig1[i] = s;21 sig2[i] = s;22 rho[i] ˜ Uniform(-1, 1);23 }24 theta ˜ Uniform(0, 1);25 wt[1] ˜ Uniform(0, 0.5);26 wt[2] = theta*(1 - wt[1]);27 wt[3] = 1 - wt[1] - wt[2];28 Likelihood:29 for (row, 1:N) {30 pc1pc2[row][] ˜ BivariateNormalMixture(m1[], m2[],31 sig1[], sig2[], rho[], wt[], nc);32 }33 Extras:34 Monitored: // wt[] cannot be monitored due to relabeling35 m1[], m2[], sig1[], sig2[], rho[];

The solution

The solution is shown, component by component, in Table 15.6. Only the common scalefactor is listed.


Table 15.6: Fisher’s Irises (solution)


m1[1] −7.709 −7.690 −7.870 −7.512m2[1] 5.460 5.456 5.366 5.563rho[1] −0.6506 −0.6578 −0.7960 −0.5064m1[2] −6.177 −6.195 −6.344 −6.049m2[2] 5.272 5.280 5.195 5.383rho[2] −0.6846 −0.68056 −0.8508 −0.4874m1[3] −2.865 −2.856 −2.982 −2.729m2[3] 5.508 5.510 5.409 5.611rho[3] −0.6816 −0.6279 −0.8038 −0.4179sigmas 0.5022 0.5172 0.4670 0.5701

Relabeling produced new mean weights = {0.3222, 0.3444, 0.3333}.Figure 15.10 shows the data along with a fixed set of BivariateNormal contours for

the three components computed using the posterior means from the table above. Posteriormarginals are shown in Figure 15.11 (with same color coding).

-�� -� -� -� -� -� -� -� -��

�

�

�

��

��

I. setosa

I. versicolor

I. virginica

Figure 15.10: Iris Solution: Mean Contours


-� -� -� -� -� -� -��

�

�

�

�

�

�

�

��

��

��

��

��

-�� -�� -�� -�� -�� -�� -�� -��

�

�

�

�

�

�

��

��

��

��

Figure 15.11: Iris Marginals (color-coded)

Finally, note that the marginal for the sigmas is different from their Gamma(3, 0.5)prior confirming that the latter was quite vague, as desired.

Goodness-of-fit

As noted in the Salaries example, relabeling usually outputs a class assignment table.Here, algorithm NORMLH afforded such a list, part of which is shown in Table 15.7.

Table 15.7: Iris: Partial Class Assignments

Datum P(1) P(2) . . .1 5.10589e-21 2.00653e-13 12 3.49418e-20 1.90294e-12 1...

......

...51 0.0131448 0.986855 6.86438e-1452 0.00391125 0.996089 1.09653e-12...

......

...101 0.999867 0.000133153 3.52881e-22102 0.948136 0.0518637 9.86459e-16

......

......


We shall define the classification criterion as being P(k) ≥ 0.5. The original data weresorted by species so this assignment file is easily checked for classification errors. ForModel 15.3, the errors for species {virginica, versicolor, setosa} = {4, 2, 0}, respectively.Given the overlap apparent in Figure 15.8 and the fact that we had only 150/17 ≈ 8.8datapoints per parameter in the likelihood alone, plus several more unknowns besides, wecould hardly ask for better than this. It is more than fair to say that we obtained a good fit.

15.3 Going FurtherDiscrete mixtures are a common form of model and we have not exhausted the possiblestrategies with our three examples. In addition to the various algorithms for relabelingthere are some alternatives for the models themselves. We mention very briefly just twoof them.

First, some mixture models include a latent indicator variable for each datum, oftenwith a Categorical prior. At each iteration, the current value of this indicator is used asan index designating which subpopulation that datum comes from. Thus, at any iteration,the likelihood computation does not involve a mixture. When there are a lot of points,this can lead to very long runtimes and it is not clear that it is better than relabeling. Ifsubpopulations overlap, the credible intervals for these indicator marginals will, of course,be rather wide. For an example, using a Normal mixture with two components, see theEyes example in Reference [5, vol. 2].

Second, with some analyses, it is not known in advance how many components thereshould be. One can do MCMC several times with different numbers of components andcheck goodness-of-fit but there is also the possibility of bringing this selection process intothe model. In that case, one can use RJMCMC (see Sect. 11.3.3). This is a special kind ofMCMC and the reader is referred to the cited reference for discussion and examples.

Regardless of the strategy adopted to analyze a mixture, results will probably be worsethan typically found with non-mixture models. There is simply too much flexibility in amixture model for the posterior to be pinned down as much as one would desire unlessthere is a lot of prior information. In all cases, identifiability issues must be addressedsomehow.

Part IV

Case Study

242

MUCH of this book is devoted to exhibiting the power of Bayesian inference bypresenting the details of models for various data analyses. The narrative style hasbeen primarily pedagogical—a necessarily one-sided conversation. It seemed

only fair, therefore, to reciprocate somewhat by discussing an example from this author’sown experience. The example in this chapter is the first Bayesian analysis that I performedfor a serious purpose (one that I got paid to do). It is also one that made a big impressionon those involved in the task and, especially, on me.

Chapter 16

When Data Don’t Exist

THE reason why the solution to this problem made such an impression on those whowere involved is that it seemed, at first, impossible. The data we had to analyze didnot exist! Therein lies a tale.

16.1 The Problem and the DataSeveral years ago, a team was assembled to do some research related to air traffic control.The occasion was prompted by the recent availability of a large database of aircraft tracks,i.e., 3-D positions as a function of time. These tracks were complete from takeoff to touch-down and, in many cases, were thousands of miles long. Given that scale, the databaserecorded aircraft positions to a precision of about four feet since anything less than thatwas considered “down in the noise” as engineers like to say, especially given the typicalsize of a commercial passenger jet.

However, one of the things one learns in the data-analysis business is that, wheneveranyone goes to the expense of creating a large database, it is likely that someone else willeventually want to use it for a purpose never envisaged at the time of creation. And so itcame to pass that our group was brought together to use this database to analyze aircraftlateral (horizontal) positions as they crossed a runway threshold upon landing. The goalwas to define more robust safety criteria by utilizing new and better data.

None of our group had been specifically aware of the four-foot precision when webegan our work and did not realize what this would mean when taken together with thefact that a typical runway is only 150 feet wide. A plot of such data (Fig. 16.1) came as abit of a shock.

Airport radars rotate at about 13 rpm and the vertical lines in this plot correspond toaircraft seen by the radar when they were at the runway threshold. The points in betweenthe lines represent aircraft not near the threshold when observed. These points are theresult of interpolation, done by our group and not quantized to four-foot precision.

You can appreciate that a plot like this was not what we had expected to see. Weneeded good data to perform our task properly and these data, as well as similar data at

243

CHAPTER 16. WHEN DATA DON’T EXIST 244

-�� -�� -�� -��

��

��

��

��

��

�� (��)

��(��)

� = ��

Figure 16.1: Aircraft Positions at a Runway Threshold

other runways, were not nearly good enough. Moreover, going back and collecting betterdata was out of the question since we did not have the money, the authority or the timeto do such a massive data collection. We were stuck; the data we required simply did notexist yet we were on the hook to deliver the promised analysis. So, what to do?

After a suitable recess for cogitation, there were three schools of thought:

• (Majority opinion) Assume that the observed quantization did not matter to anysignificant extent. Just ignore it.

• (Nearly all the rest) Add random jitter to the original datapoints sufficient to obscurethe quantization so that lines in the plots are not evident. Analyze the jittered data.

• (Yours truly) Try to find a way to obtain the result that unquantized data would haveprovided even though we did not have such data. Think Bayes.

16.2 Bayesian ThoughtsIt would be convenient if the quantization error were small enough that it really did notmatter to the final result but this would have to be proven; it cannot simply be assumed.Therefore, we need a quantization-free result, at least once, as a basis for comparison.

Fortunately, we have some prior information as well as data. Aircraft tracks have longbeen analyzed in various ways and lateral deviation from an intended centerline, x, has


a known distribution. [39] This model boasts a long pedigree and is well validated. Theusual form, NL1(x), is shown in (16.1). (Compendium, pg. 93)

NL1 (x) = wNormal(µ, σ) + (1− w) Laplace(µ, λ) (16.1)

This is a mixture of a Normal and a Laplace distribution with a common mean. It candescribe data from a pure Normal (w = 1) to a pure Laplace (w = 0) distribution or anythingin between. When this model is used with the data shown in Figure 16.1, the M-L result is{mu = 2.697, sigma = 8.124, lambda = 9.010, w = 0.774} with goodness-of-fit as shownin the probability plot below (Fig. 16.2).1 The overall fit is very good although the effectof quantization is also obvious.

Figure 16.2: Raw Runway Data as NL1

Given this much, the approach will be to assume that NL1 would describe unquantizeddata and model subsequent quantization directly. The data to be analyzed then consistof the (discrete) frequency counts within the bins defined by the “lines” in the data. Thelikelihood for these binCounts will naturally be ∼ Multinomial but there are a number ofpreliminary considerations.

1This plot created with Regress+.


https://www.causascientia.org/software/Regress_plus.html


• How many bins are there?

• What is the bin width?

• What is the average distribution of points within the bins?

• How should we compute the vector of bin probabilities?

Ideally, mathematical elegance would suggest letting everything be unknown but thisdoes not work. When the bin width is unknown, then so is the number of bins in whichcase the whole problem becomes ill-defined and MCMC fails (does not converge). Uponreflection, this turned out to be a non-issue. Many of the vertical lines in Figure 16.1 eachcontained several points with lateral values identical to three or more significant figures.This was, no doubt, an artifact of database creation but, still, these were the actual data.What this implied was that the bin width was not unknown; it was known very well,viz., 3.6458 ft (avg. over several bins). Consequently, the number of bins required wasalso known. Defining a bin center as the position of a “line”, it takes 21 bins, starting at−43.4963 ft to span all of the data so, including two empty tails, there are 23 bins. Wenow can describe the dataset as follows:

Table 16.1: Runway Data: Bin Frequencies

0 1 0 0 2 3 1 4 8 16 40 6379 95 90 69 54 21 6 8 2 2 0

To examine the within-bin distribution, all 23 bins were normalized to [−0.5, 0.5] thensuperimposed. A histogram of the result is shown in Figure 16.3.

This histogram appears to be a mixture of two Uniform distributions: a narrow, centralUniform on a Uniform background. Since every parameter of this mixture is a constant,the entire PDF is a constant which will be subsumed eventually into the normalizationconstant of the posterior. For this problem, we do not anticipate any model comparisonsso normalization and the marginal likelihood are not needed. Therefore, we can simplyignore this within-bin distribution and focus our attention on the parameters of the NL1prior.

If we knew the parameters of the NL1 prior, we could obtain each multinomial binprobability by integrating the NL1 PDF across the bin as shown in Figure 16.4. Clearly,we do not know these parameters a priori but the MCMC process will determine themimplicitly just as with any unknown prior. It is precisely this unquantized NL1 distributionthat is the goal of this analysis.

Putting everything together, we get the DAG shown in Figure 16.5. The pseudocode ispresented in Model 16.1.


-�� -�� -��

�

�

�

�

��

�� -��

��

Figure 16.3: Within-bin Distribution

Figure 16.4: Generic Bin Probability


xk

NL1

µ σ λ w

pk

Data

Figure 16.5: Model for Runway-threshold Positions


Model 16.1: Runway Model (latent prior)

1 Constants:2 N = 1, // # of (multinomial) points3 Nbins = 23, // including empty tails4 binTotal = 564,5 LBound = -43.4963, binWidth = 3.6458;6 Data:7 x[N][Nbins]; // bin counts8 Variables:9 p[Nbins], mu, sigma, lambda, w, Ncdf, Lcdf, ub, sum, i, k;

10 Priors:11 w ˜ Uniform(0, 1);12 mu ˜ Uniform(-30, 30);13 sigma ˜ Jeffreys(1, 50);14 lambda ˜ Jeffreys(1, 50);15 Likelihood:16 for (i, 1:N) {17 // get bin probabilities18 ub = LBound;19 sum = 0;20 for (k, 1:(Nbins - 1)) {21 Ncdf = phi((ub - mu)/sigma);22 Lcdf = (ub < mu)*(0.5*exp((ub - mu)/lambda)) +23 (ub >= mu)*(1 - 0.5*exp((mu - ub)/lambda));24 p[k] = w*Ncdf + (1 - w)*Lcdf - sum;25 sum = sum + p[k];26 ub = ub + binWidth;27 }28 p[Nbins] = 1 - sum;29 // the likelihood30 x[i][] ˜ Multinomial(p[], binTotal);31 }32 Extras:33 Monitored:34 mu, sigma, lambda, w; // what we really need!

Here, all of the fixed priors are vague (lines 11–14). In this syntax, the standard NormalCDF, usually symbolized Φ(·), is a built-in function but the Laplace CDF is not. Luckily,the latter has a simple closed form so it can be computed on-the-fly (lines 22–23).

All of this work yields good results for the unquantized NL1 prior. Posterior values


are listed in Table 16.2. Marginals are collected in Figure 16.6.2

Table 16.2: Runway NL1 Parameters


mu 2.727 2.711 1.981 3.456sigma 8.137 8.277 7.353 9.207

lambda 8.883 9.496 5.981 14.333w 0.775 0.768 0.549 0.966

��

��

��

��

��

��

��

��

��

� � � � ��

��

��

��

��

��

��

��

� � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

Figure 16.6: Unquantized NL1 Marginals

There are some large uncertainties here, especially for the weight parameter, w, butthen we had only one multinomial datapoint. In the end, the Bayesian MAP parameters forunquantized data were not much different from the quantized M-L vector so quantizationdid not really ruin the analysis. The majority were right after all. Of course, they didn’tknow that for certain until they saw the Bayesian output. Neither did I.

The next page shows our final picture (Fig. 16.7) comparing the quantized (M-L) andunquantized (MAP) fits along with the data histogram.

2This heterogeneous NL1 mixture is not subject to identifiability issues (see Chap. 15).


-�� -�� -�� -��

��

��

��

��

��

��

�� (��)

��

(Quantized) M-L

(Latent, unquantized) Bayesian MAP

Figure 16.7: NL1 Model With and Without Quantization

Better than a thousand words—and then some.

Part V

Epilogue

253

The Preface stated the proposition that this book would be an informal, introductoryoverview of the practice of state-of-the-art data analysis. Having reached the end, it is timeto consider the credibility of this proposition.

Probably nobody would dispute that the book is about data analysis and it is certainlyinformal—perhaps too much so for some readers. On the other hand, a few might quibblewith the characterization “introductory” especially when one sees integral signs. However,data analysis is inherently mathematical and that cannot be avoided. Also, Appendix Amight seem like it was written in a foreign language. Nevertheless, the examples discussedin Reference [15] and/or the material in any chapter of Reference [4] should be proofenough of the elementary nature of the treatment as presented here. So, if the previouschapters are taken as data, it is probably fair to conclude that this book delivered what waspromised.

That promise was somewhat minimal. As stated earlier, this was not meant to bea textbook and it was not designed to make anyone an expert in data analysis. If you arefaced with the task of analyzing data, you should now know enough to give you a good ideaof what that should involve. In particular, you should appreciate that statistics, albeit stillvery popular, is not the best approach. You can do better. The case study in the previouschapter should encourage you not to give up too easily. Data analysis is both necessaryand extremely important so it deserves a bit of effort. As the saying goes, “Anything worthdoing is worth doing well.”

Another proverb reminds us that, “Well begun is half done.” I shall consider this booka success if readers conclude that the material discussed warrants further study.

Part VI

Appendices

Appendix A

Gibbs Sampling Code

This program is a bare-bones implementation of the example discussed in Section 11.3.2.1

Usually, there would be more “bells and whistles” and, certainly, a fair amount of errorchecking. The intent here is to illustrate just how simple Gibbs sampling can be in themost fortuitous circumstances.

The code shown can be compiled as is with any C++11 compiler on any computer. Thedata file must be in the same directory/folder as the compiled program.

1Runtime = 0.35 s on a mid-2011 iMac.

255

APPENDIX A. GIBBS SAMPLING CODE 256

// Gibbs sampling example (requires C++11 or greater)

#include <iostream>#include <fstream>#include <random>

using namespace std;

default_random_engine prng; // One PRNG to rule them all// C++11 distributions use SCALE parametersnormal_distribution<double> NormalD;gamma_distribution<double> GammaD;

// global variablesdouble mu, tau, sumY;

// const int N = 65; const double y[N] = {-1.7, ..., 1.5};#include "tempData.cpp"

double getMu() {const double tau_mu = 2.0; // given priordouble var = 1/(N*tau + tau_mu), mean = tau*sumY*var;normal_distribution<double>::param_type nparam(mean, sqrt(var));

return NormalD(prng, nparam);}

double getTau() {const double shape = 0.5*N + 2, r = 1.0/3; // given priordouble diff, rate, sum = 0;

for (int i = 0;i < N;i++) {diff = y[i] - mu;sum += diff*diff;

}rate = r + 0.5*sum;gamma_distribution<double>::param_type gparam(shape, 1/rate);

return GammaD(prng, gparam);}

APPENDIX A. GIBBS SAMPLING CODE 257

int main(int argc, const char * argv[]) {int i, j;const int Nsamp = 100000, Nburnin = 1000, Nthin = 10;ofstream fout("NormalTemp.trace.txt");

for (i = 0, sumY = 0;i < N;i++)sumY += y[i];

mu = 0.0; tau = 2.0; // initial statefor (i = 0;i < Nburnin;i++) { // find the posterior

mu = getMu();tau = getTau();

}fout << "mu" << ’\t’ << "sigma" << endl;for (i = 0;i < Nsamp;i++) { // sample posterior with thinning

for (j = 0;j < Nthin;j++) {mu = getMu();tau = getTau();

}fout << mu << ’\t’ << sqrt(1/tau) << endl;

}

return 0;}

Appendix B

Other Model Syntax

This Appendix presents four more examples of model syntax as required by the softwarepackages described in Section 11.4.

B.1 BUGS (WINBUGS, OpenBUGS)A linear, least-squares regression (supplied with the package).

Model B.1: A BUGS Model

1 model line;2 const3 N = 5; # number of observations4 var5 x[N], Y[N], mu[N], alpha, beta, tau, sigma, x.bar;6 data x, Y in "line.dat";7 inits in "line.in";8 {9 for (i in 1:N) {

10 mu[i] <- alpha + beta*(x[i] - x.bar);11 Y[i] ˜ dnorm(mu[i],tau);12 }13 x.bar <- mean(x[]);14 alpha ˜ dnorm(0.0,1.0E-4);15 beta ˜ dnorm(0.0,1.0E-4);16 tau ˜ dgamma(1.0E-3,1.0E-3);17 sigma <- 1.0/sqrt(tau);18 }

258

APPENDIX B. OTHER MODEL SYNTAX 259

B.2 JAGSThe JAGS version of the previous BUGS model (supplied with the package).

Model B.2: A JAGS Model

1 model {2 for (i in 1:N) {3 mu[i] <- alpha + beta*(x[i] - x.bar);4 Y[i] ˜ dnorm(mu[i],tau);5 }6 x.bar <- mean(x[]);7 alpha ˜ dnorm(0.0,1.0E-4);8 beta ˜ dnorm(0.0,1.0E-4);9 tau ˜ dgamma(1.0E-3,1.0E-3);

10 sigma <- 1.0/sqrt(tau);11 }


B.3 MCSimAn MCSim model for a linear, least-squares regression (supplied with the package).

Model B.3: An MCSim Model

1 #-----------------------------------------------------------2 # linear.model3 #4 # Linear Model with random noise added.5 # P = A + B * time + Normal (0, SD_true)6 # Setting SD_true to zero gives the deterministic version,7 # which can be used as a link function in statistical models.8 #9 # Copyright (c) 1993-2008 Free Software Foundation, Inc.

10 #------------------------------------------------------------11

12 Outputs = {y};13

14 # Model Parameters15 A = 0;16 B = 1;17 SD_true = 0;18

19 CalcOutputs { y = A + B * t + NormalRandom(0,SD_true); }20

21 End.


B.4 StanA Stan model for a Bernoulli sample (supplied with the package).

Model B.4: A Stan Model

1 data {2 int<lower=0> N;3 int<lower=0,upper=1> y[N];4 }5 parameters {6 real<lower=0,upper=1> theta;7 }8 model {9 theta ˜ beta(1,1);

10 for (n in 1:N)11 y[n] ˜ bernoulli(theta);12 }

Note: Beta(1, 1) is the same as Uniform(0, 1) but the former is a conjugate prior forthe Binomial (hence, Bernoulli) distribution.

Bibliography

[1] ANDREON, S., AND WEAVER, B. Bayesian Methods for the Physical Sciences.Springer, 2015.

[2] ARNOLD, J. R., AND LIBBY, W. F. Age determinations by radiocarbon content:Checks with samples of known age. Science 110 (1949), 678–680.

[3] BEVAN, A. Measuring dust in core-collapse supernovae with a Bayesian approachto line profile modelling. MNRAS 480, 4 (2018), 4659–4674.

[4] BROOKS, S., ET AL., Eds. Handbook of Markov Chain Monte Carlo. Chapman &Hall/CRC, 2001.

[5] BUGS/WINBUGS/OPENBUGS CONSORTIUM. http://www.openbugs.net/w/examples.

[6] CHESSGAMES.COM. http://www.chessgames.com.

[7] CHESSGAMES.COM. http://www.chessgames.com/chessecohelp.html.

[8] CONOVER, E. Even quantum information is physical. Science News 189, 11 (May2016), 10.

[9] DEVROYE, L. Non-Uniform Random Variate Generation. Springer-Verlag, 1986.

[10] DRYER, I., Ed. Tychonis Brahe Dani Opera Omnia, vol. III. Hauniae In LibrariaGyldendaliana, 1916.

[11] EFRON, B., AND MORRIS, C. Stein’s paradox in statistics. Scientific American 236,5 (1977), 119–127.

[12] EVANS, J. The History and Practice of Ancient Astronomy. Oxford University Press,1998.

[13] FAMIGHETTI, R., Ed. World Almanac. St. Martins Press, 1995-1997.

[14] FELLER, W. An Introduction to Probability Theory and Its Applications, 3rd ed.,vol. 1. John Wiley & Sons, Ltd., 1968.

[15] GELMAN, A., ET AL. Bayesian Data Analysis, 3rd ed. Chapman & Hall/CRC, 2013.

262

BIBLIOGRAPHY 263

[16] GELMAN, A., AND HILL, J. Data Analysis Using Regression and Multi-level/Hierarchical Models. Cambridge University Press, 2007.

[17] GELMAN, A., AND RUBIN, D. B. Inference from iterative simulation using multiplesequences. Statistical Science 7, 4 (1992), 457–472.

[18] GEMAN, S., AND GEMAN, D. Stochastic relaxation, Gibbs distributions, and theBayesian restoration of images. IEEE Transactions on Pattern Analysis and MachineIntelligence 6, 6 (1984), 721–741.

[19] GEORGE, E. I., ET AL. Conjugate likelihood distributions. Scandinavian Journal ofStatistics 20, 2 (1992), 147–156.

[20] GILKS, W. R., RICHARDSON, S., AND SPIEGELHALTER, D., Eds. Markov ChainMonte Carlo in Practice. Chapman & Hall/CRC, 1995.

[21] GOLFCHANNEL.COM. PGA Tour Championahip Leaderboard, 2013, September2013.

[22] GOODMAN, J., AND WEARE, J. Ensemble samplers with affine invariance. Com-munications in Applied Mathematics and Computational Science 5, 1 (2010), 65–80.

[23] GRAHAM-SMITH, F. Eyes on the Sky. Oxford University Press, 2016.

[24] GREEN, P. J. Reversible jump markov chain monte carlo computation and bayesianmodel determination. Biometrica 82, 4 (December 1995), 711–732.

[25] GREENE, W. H., AND HENSHER, D. A. Modeling Ordered [email protected], [email protected], January 2009.

[26] GREGORY, P. Bayesian Logical Data Analysis for the Physical Sciences. CambridgeUniversity Press, 2010.

[27] GREGORY, P. C. A Bayesian analysis of extrasolar planet data for HD 73526. J.Astrophys. 631, 2 (2005), 1198–1214.

[28] GREGORY, P. C., AND FISCHER, D. A. A Bayesian periodogram finds evidence forthree planets in 47 ursae majoris. MNRAS 403 (2010), 731.

[29] HAJIAN, A. Efficient cosmological parameter estimation with Hamiltonian MonteCarlo technique. Phys. Rev. D 75 (April 2007).

[30] HAMMING, R. W. The Art of Probability. Addison-Wesley, 1991.

[31] HEINZ, G., ET AL. Exploring relationships in body dimensions. J. Statistics Educa-tion 11, 2 (2003).

BIBLIOGRAPHY 264

[32] HOGG, R. V., AND CRAIG, A. T. Introductioni to Mathematical Statistics, 5th ed.Prentice-Hall, 1995.

[33] IDRE/UCLA. https://stats.idre.ucla.edu/r/dae/zip/.

[34] JACKMAN, S. Bayesian Analysis for the Social Sciences. John Wiley & Sons, Ltd.,2009.

[35] JASRA, A., HOLMES, C. C., AND STEPHENS, D. A. Markov chain Monte Carlomethods and the label switching problem in bayesian mixture modelling. StatisticalScience 20, 1 (2005), 50–67.

[36] JAYNES, E. T. Probability Theory: The Logic of Science. Cambridge UniversityPress, 2003.

[37] LANDAUER, R. Information is physical. Physics Today 44, 5 (May 1991), 23.

[38] LOCK, R. http://ww2.amstat.org/publications/jse/jse data archive.htm. JSE DataArchive.

[39] LOVE, W. D., MCLAUGHLIN, M. P., AND LEJEUNE, R. O. System and methodfor stochastic aircraft flightpath modeling. U.S. Patent 7,248,949 B2, July 2007.

[40] LUO, X. Constraining the shape of a gravity anomalous body using reversible jumpmarkov chain Monte Carlo. Geophysical Journal International 180, 3 (2010), 1067–1079.

[41] MAJOR LEAGUE BASEBALL. https://www.baseball-reference.com.

[42] MARTIN, J. http://thebullelephant.com/sat-scores-in-northern-virginia-schools-2014/.

[43] MCCREA, R. S., AND MORGAN, B. T. Analysis of Capture-Recapture Data. Chap-man & Hall/CRC Interdisciplinary Statistics, 2014.

[44] MCGRAYNE, S. B. The Theory That Would Not Die. Yale University Press, 2011.

[45] MINARD, C. J. Napolean: Minard plot, November 1869.

[46] NEUGEBAUER, O. A History of Ancient Mathematical Astronomy. Springer, 1975.

[47] NHL.COM. http://www.nhl.com/stats.

[48] NIST. http://webbook.nist.gov/chemistry.

[49] NIST. Fundamental physical constants, 2018.

[50] NORRIS, J., NIST. Calibration of ozone monitors.

BIBLIOGRAPHY 265

[51] PAPASTAMOULIS, P. label.switching: An R package for dealing with the labelswitching problem in mcmc outputs. Journal of Statistical Software 69 (February2016).

[52] RAMOS, E., AND DONOHO, D. http://lib.stat.cmu.edu/datasets/.

[53] RAPER, S. The shock of the mean. Significance 14, 6 (2017), 12–16.

[54] RAUER, H., ET AL. Optical observations of comet Hale-Bopp (C1995 O1) at largeheliocentric distances before perihelion. Science 275 (1997), 1909.

[55] RICHARDSON, S., AND GREEN, P. J. On Bayesian analysis of mixtures with anunknown number of components. J. R. Statist. Soc. B 59, 4 (1997), 731–792.

[56] ROYLE, J. A., AND YOUNG, K. V. A hierarchical model for spatial capture–recapture data. Ecology 89, 8 (2008), 2281–2289.

[57] RUBENSTIEN, R. Y. Simulation and the Monte Carlo Method. John Wiley & Sons,Inc., 1981.

[58] SCHAEFER, B. E. The heliacal rise of Sirius and ancient Egyptian chronology.Journal for the History of Astronomy 31, 2 (May 2000), 149–155.

[59] SEIDELMANN, P. K., Ed. Explanatory Supplement to the Astronomical Almanac.University Science Books, 1992.

[60] SHOEMAKER, A. L. What’s Normal? – Temperature, Gender, and Heart Rate.Journal of Statistics Education 4, 2 (7996).

[61] SIVIA, D. S., AND SKILLING, J. Data Analysis: A Bayesian Tutorial, 2nd ed.Oxford University Press, 2006.

[62] SPORTS REFERENCE LLC. http://www.basketball-reference.com.

[63] STAHL, S. The evolution of the normal distribution. Mathematics Magazine 79(2006), 96–113.

[64] U.S. DEPARTMENT OF ENERGY. http://cdiac.ornl.gov/trends/co2.

[65] U.S. DEPARTMENT OF ENERGY. http://cdiac.ornl.gov/trends/co2/sio-mlo.html.

[66] VERBUNT, F., AND VAN GENT, R. H. Three editions of the star catalog of TychoBrahe. Astronomy & Asrophysics 516 (June 2010).

[67] WIKIMEDIA. http://commons.wikimedia.org.

[68] YAO, W., AND LINDSAY, B. G. Bayesian mixture labeling by highest posteriordensity. Journal of the American Statistical Association 104 (2009), 758–767.

data, uncertainty and inference - causascientia · 2019-05-31 · for deeds do die, however nobly...

Documents