event history models: why r? why sabrer? rob crouchley

CollaboratoryForQuantitativeE-SocialScience




Event History Models: Why R? Why SabreR?

Rob Crouchley





Contents

• Some science • Performance of the available tools for multilevel models• Breaking the technological barrier to adoption (sabreR)• Demo• Performance of parallel sabreR• Conclusions





Some Science: BHPS Data (small dataset)

• Sample of males who were employed and earning a wage at some point over the period 1991-2003 (13 years)

• Gives a total of 5130 individuals with a sequence of responses that occurred somewhere in the 1991-2003 interval

• At the 1st sample point of the survey (1991) there were 2316 individuals of whom 945 of these males had some form of training in the previous 12 months,

• 106 had been promoted in the previous 12 months. The mean of the log of their weekly wage was 5.65 (Sterling)





What is the Effect of Training & Promotion on Wages?

• Suppose we want to disentangle the dependencies between:• Promotion (P=1,0) in the last 12 months (latent var P*)• On the job training (T=1,0) in the last 12 months (latent var T*)• Current wages (W)

ep P* P

et T* T

ew W





Correlated Random Effects Model

up P* P

ut T* T

euw

W e*p

e*t

e*w





Commercial Software for MGLMMs

• Stata: http://www.stata.com/ Standard/Adapt Quadrature, Newton Raphson. See also Stata MP

• SAS PROC NLMIXED: http://www.sas.com/ Standard/Adap Quadrature and Taylor/Laplace expansions, Quasi Newton. See also SAS PROC MPCONNECT and SAS Grid computing

• Limdep: http://www.limdep.com/ Quadrature, Quasi Newton





MGLMMs: Other Systems

• MLwiN: http://www.cmm.bristol.ac.uk/ Laplace approximation and IRLS (also MCMC)

• Gllamm (Stata prog): http://www.gllamm.org/ Stan/Adap Quadrature, Newton Raphson

• aML: http://www.applied-ml.com/





Packages at http://cran.r-project.org/ for GLMMs and MGLMMs

• lmer (http://cran.r-project.org/web/packages/lme4/index.html) Laplace Approx, penalized iteratively reweighted least squares

• npmlreg (http://cran.r-project.org/web/packages/npmlreg/index.html) Quadrature and NPML, EM algorithm





Why Quadrature?• PQL: Parameter estimates tend to be biased for binary dependent

variables with small cluster sizes and high intraclass correlations (e.g. Rodriguez and Goldman, 1995, 2001)

• PQL: does not involve a likelihood, which prohibits the use of likelihood based inference

• Laplace Approximation: The 6th order expansion (Raudenbush et al., 2000) worked as well as 7-point AQ in simulations of a two-level binary dependent variable model

• The precision of GQ and AQ can be increased by simply using more quadrature points

• We can not increasing the degree of the Taylor or Laplace Expansion beyond the 2, 4 or 6 terms allowed for





Simulation Based Methods• Computer intensive alternatives to GQ and AQ include simulation

based approaches such as Markov Chain Monte Carlo (MCMC) (e.g. Gelman et al., 2003) and maximum simulated likelihood (MSL) (Hajivassiliou and Ruud, 1994)

• The hierarchical structure of multilevel models lends itself naturally to MCMC using for instance Gibbs sampling. If vague priors are specified, the method essentially yields maximum likelihood estimates

• Unfortunately, a problem with MCMC is how to ensure that a truly stationary distribution has been obtained for MGLMMs, especially when we have a lot of structural and incidental parameters





In tests, serial sabre out performs other software

lmer: GQ and AQ not yet implemented, REML and ML give Laplace approx answernpmlreg: GQ times as AQ not availableSabre used Portand Group PGF90 7.1-6 Compiler with –FAST (Level 2 optimization)Times are system times (very close to real time in all figures), very little variation between runs R and gllamm interpreted code, SAS?

Example data Obs Cases Vars Size (tab) Method Stata gllamm SAS npmlreg lmer Sabre 1univariate Wages (W) 31022 5285 74 17.1MB AQ (12) 15" 22h23' 3'26" 1h28' 2'03" 1'05"univariate Train (T) 31022 5285 71 17.1MB AQ (16) 11'51" 25h32' 7+days 44'39" 5'51" 50"univariate Prom (P) 31022 5285 72 17.1MB AQ (16) 15'08" 25h32' 7+days 58'37" 4'37" 52"bivariate T & P 62044 5285 143 34.2MB AQ (16x16) na 150+days 30+days na nd 1h42’trivariate W & T & P 93066 5285 217 51.3MB AQ (12x16x16) na 15+yrs 1+yrs na nd 115h45’





• MlwiN (MCMC, IGLS) are 2-25 x slower in univariate 2-level models

• For others see the Sabre sitehttp://sabre.lancs.ac.uk/

Other Sabre comparisons – V small to small sized data sets :





Changes in Substantive Findings Between Models

Models

Homog Indep Dep

Covariate Promo 0.09499 0.06103 0.05288

Coeff 0.00824 0.00599 0.00611in Wage Train -0.00683 -0.00865 -0.00864

Equation 0.00526 0.00396 0.00405

Likelihood -38471.93 -29448.19 -29419.52





Breaking the technological barrier to adoption

Previously• 2X harder to use the NGS than use your local

HPC (private computing facility)Now• It is easier to use the NGS (public computing

facility) than it is to use your local HPC





Enabling Technology for grid computing

All you need is:1. An internet connection2. The installation of our multiR or sabreR

packages for R3. A certificate to identify the client to the host

-- typically a grid certificate





Also

• Users do not need to install or have familiarity of Globus, VDT, gsissh, gsiscp, grid-ftp, grid-proxy tools or any other GRID related software.

• There is very little difference between using the Sabre library from within R on the desktop, and using Sabre for statistical modelling on the grid from within R.





Desktop Vs Grid on the Windows desktop

Serial sabreR

sabre.model.1<sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian“, first.mass=64, first.scale=0.5)

#display resultssabre.model.1

Parallel sabreR

# load previously saved grid session objectload(file=“ncess.demo.session.R")

sabre.model.2<-sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian", first.mass=64, first.scale=0.5, session=ncess.demo.session,

description="here ya go !!")

# recover the results and display themsabre.results(ncess.demo.session,sabre.model.2)





Demo

• rob_sabrer_edit2.mov





Master-Slave (Distributed Memory) Model for MPI as used by Sabre on the NW-Grid

Li, Hi, di, a’si=1,...,1000

MASTERProcess

Slave Processes

Li, Hi, di, b’si=1001,...,2000

Li, Hi, di, c’si=2001,...,3000

Li, Hi, di, d’si =3001,...,4000

a+b+c+dfor L,H and d, etcthen NR

There is no commercial software on the NGS or NW-GRID (licensing and cost issues)





Performance of Parallel SabreRelative performance of Parallel Sabre compared to serial sabre (=100) on example datasets

Seria

l

2 pr

oc

4 pr

oc

8 pr

ocL7 - filled

L7 - lapsed

L8 - filled-lapsedL9 - Union wage

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0Pe

rform

ance

# processors

L7 - filled 100.0 54.4 31.9 20.3

L7 - lapsed 100.0 55.0 32.8 21.4

L8 - filled-lapsed 100.0 58.9 34.5 22.0

L9 - Union wage 100.0 50.2 25.5 13.3

Serial 2 proc 4 proc 8 proc

In the Wage example

5 days becomes 2.75 hours on 48

processors





Why R?• Commercial Tools (Stata, SAS) are of limited use on a public grid, e.g. Stata

MP can not have multiple data sets in memory and neither system provides access to their source code

• There are no plans to install them on the UK National Grid Service (NGS) because of cost/licensing issues

• R is an effective, efficient and easy to use tool for Statistical Modelling• Many existing tried and tested statistical methods already available for R

can easily be modified to exploit the benefits of grid computing• Work flows to support the modelling process are simple to create.• R is easy to install on most popular operating systems (Windows, Unix,

OSX) and can be used directly from a USB memory stick• R includes a programming environment, which when used in conjunction

with our multiR and sabreR packages, automatically provides a data centric scripting tool for grid computing

• There are no licensing issues





Conclusions• This approach makes all the grid middleware invisible

and thus removes the biggest barrier to take up.This approach can provide researchers with more sophisticated statistical modelling tools and help increase their understanding of complex processes and thus help them to undertake more effective research

• Social researchers do not need to let their large scale science agenda using GLMs be set by the developments of the big statistics software houses, like SAS, Stata etc.





stop/end

23

event history models: why r? why sabrer? rob crouchley

Documents