report on project: an ogsa component-based approach to ... · report on project: an ogsa...

18
Data REFERENCE No. RES-149-25-0010 Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general aim of this project was to demonstrate the effectiveness of an Open Grid Services Architecture (OGSA). In particular, our research tries to show how a component-based approach to middleware could deal with some of the large-scale statistical modelling problems currently confronting e-Social Scientists. We applied this approach to integrate the statistical modelling package Sabre, which is used for analysing work/life history data, into ‘R’. Sabre was developed at Lancaster whereas ‘R’ is a free-to-use language and environment for statistical computing and graphics (see http://www.r-project.org/ for details). 2 Objectives The objectives of this project together with the extent of their achievement are as follows: 1. To develop a serial and parallel OGSA implementation of Sabre with extended facilities for multilevel multiprocess modelling. Section 3 shows how this objective has been met. 2. To make the source code of the original and parallel versions of Sabre freely available. Objective 2 was met, see Section 6. 3. To compare the performance of the following packages: (a) the sequential version of Sabre, (b) MLwiN, (c) the Stata program gllamm and the (d) parallel version of Sabre, all using two substantive pieces of research as illustrations. This objective was partially met, see Section 4. In Section 4 we give the reasons why we limited our evaluation of Sabre 4.0 to directly comparable routines in Stata and to the Stata program gllamm. Also, rather than compare the models in just 2 empirical settings we have performed the comparison of Sabre with Stata and the Stata program gllamm on 16 different data sets (of all sizes and complexity). 4. To make the serial and parallel versions of Sabre freely available as R Objects for use in the R statistics and computing system. This objective was met, see Section 3. 5. To explore the possibility of a Web portal to Sabre and Sabre within R. This objective has been met, see Section 4. 3 Methods In this section we describe how we got from using Sabre on the desktop to running Sabre on a remote HPC from within R on the users desktops (see Figures 1 and 2). Desktop SABRE Data Desktop 1

Upload: others

Post on 27-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

SABRE

Data

REFERENCE No. RES-149-25-0010

Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling

1 Background

The general aim of this project was to demonstrate the effectiveness of an Open Grid Services Architecture (OGSA). In particular, our research tries to show how a component-based approach to middleware could deal with some of the large-scale statistical modelling problems currently confronting e-Social Scientists. We applied this approach to integrate the statistical modelling package Sabre, which is used for analysing work/life history data, into ‘R’. Sabre was developed at Lancaster whereas ‘R’ is a free-to-use language and environment for statistical computing and graphics (see http://www.r-project.org/ for details).

2 Objectives

The objectives of this project together with the extent of their achievement are as follows:

1. To develop a serial and parallel OGSA implementation of Sabre with extended facilities for multilevel multiprocess modelling. Section 3 shows how this objective has been met.

2. To make the source code of the original and parallel versions of Sabre freely available. Objective 2 was met, see Section 6.

3. To compare the performance of the following packages: (a) the sequential version of Sabre, (b) MLwiN, (c) the Stata program gllamm and the (d) parallel version of Sabre, all using two substantive pieces of research as illustrations. This objective was partially met, see Section 4. In Section 4 we give the reasons why we limited our evaluation of Sabre 4.0 to directly comparable routines in Stata and to the Stata program gllamm. Also, rather than compare the models in just 2 empirical settings we have performed the comparison of Sabre with Stata and the Stata program gllamm on 16 different data sets (of all sizes and complexity).

4. To make the serial and parallel versions of Sabre freely available as R Objects for use in the R statistics and computing system. This objective was met, see Section 3.

5. To explore the possibility of a Web portal to Sabre and Sabre within R. This objective has been met, see Section 4.

3 Methods

In this section we describe how we got from using Sabre on the desktop to running Sabre on a remote HPC from within R on the users desktops (see Figures 1 and 2).

Desktop

SABRE

Data Desktop

1

Page 2: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

R C client

C serviceSABRE

SOAP over HTTP

SABRE

REFERENCE No. RES-149-25-0010

Figure 1. Sabre on the Desktop

DesktopDesktopHPC

R C client

C service SABRE

SOAP over HTTP

HPC

Figure 2 R, GROWL and Sabre

Objective 1. To develop a serial and parallel OGSA implementation of Sabre with extended facilities for multilevel multiprocess modelling

For a graphical representation of the provision of a parallel version of Sabre on Lancaster’s HPC, see Figure 3.

HPCHPC

SABRE

Figure 3 Sabre on the HPC

Objective 1 involved both the extension of the statistical models in Sabre and the facility to go parallel at the most computationally demanding parts. Sabre was extended in several ways, as follows:

(a) Sabre’s Univariate Random Effect Models (Gaussian Quadrature)

Let i index an individual and t index time, then conditional on a univariate random effect Ȟi, the mean of the response yit is ȝit, where

P it g �1 �K it � vi � ,

where g(.) is the link function, e.g. probit. For Poisson and discrete response models with normally distributed random effects, the marginal likelihood generally does not have a closed form. The standard approach for estimating the parameters in these models involves evaluating the marginal likelihood numerically using Gauss–Hermite quadrature.

2

Page 3: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

The following table compares the univariate random effect models that can be estimated in Sabre 3.1 with those that can be estimated in Sabre 4.0 (the latest release).

Response Binary Ordered Count Continuous Link probit logit c-loglog probit logit c-loglog log identity Sabre 3.1 no yes no no no no yes no Sabre 4.0 yes yes yes yes yes no yes yes

(b) Sabre’s Bivariate and Trivariate Random Effect Models (Gaussian Quadrature)

We now describe which bivariate and trivariate models can be estimated in Sabre 4.0. To do this we use the bivariate model though the generalisation also holds for 3 dimensions. Let yit,yjt be the two responses for an individual and let t index time, then conditional on the bivariate random effects Ȟi, Ȟj the mean of the responses is ȝit , ȝjt. where

P it gi �1 �Kit � vi � , P jt g j

�1 �K jt � v j �,

and gi(.), gj(.) are the link functions of each response. We maximize the integrated likelihood unconditional on the random effects and estimate ıi=var(Ȟi) , ıj=var(Ȟj) and ıij=covar(Ȟi ,Ȟj).

The following Table summarises the bivariate models we can estimate in Sabre 4.0.

Response Binary Ordered Count Continuous Response Link probit logit c-loglog probit logit c-loglog log identity

probit yes yes yes no no no yes yes Binary logit yes yes yes no no no yes yes

c-loglog yes yes yes no no no yes yes probit no no no no no no no no

Ordered logit no no no no no no no no c-loglog no no no no no no no no

Count log yes yes yes no no no yes yes Continuous identity yes yes yes no no no yes yes

The diagonal elements (shaded) have the same link functions for each response. The off diagonal elements occur when the responses have different link functions. As these results apply in 3 dimensions it means we can simultaneously estimate mixed joint models, e.g. those which have binary, count and continuous responses for all points in time in a panel survey.

Sabre uses analytical 1st and 2nd derivatives in estimating model parameters. It would have been too algebraically demanding to go to higher dimensions or more models (e.g. ordered response) with the limited resources we had available to us in this pilot demonstrator project.

Parallel Sabre

The Sabre code has been re-written to call the MPI library to allow it to run in parallel on multiple processors. The particular implementation used at Lancaster on the HPC is MPICH, a free implementation from Argonne National Laboratory. It is fully MPI 1.2 compliant.

3

Page 4: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

Only four routines in the Sabre invariant code, READ, WLS, LSHFAS and LSHACC have been changed to run in parallel but the majority of the work was done beforehand in generalising the code for running within the R environment. In particular, approximately 1000 Fortran write statements were converted to function calls. So, although the implementation of Sabre as an R plug-in was superseded by the development of a Sabre-R library (Objective 4), the majority of the work done in the early part of the project was a necessary first step in developing the parallel version. The R library approach is less restrictive and can be generalised to other software, e.g. Stata and SAS.

Fundamental to the development of Sabre is the use of exactly the same invariant code in both the serial and parallel versions. At the time of writing the invariant code (sabre.f90) contains the code of all of the modelling algorithms, and has 19831 lines while the implementation dependent part has, respectively, 514 lines for the serial version (macdep_serial.f90) and 730 lines for the parallel version (macdep_parallel.f90). The preservation of identical invariant code for both serial and parallel versions was regarded from the outset as an essential requirement in order to permit the ongoing development of its modelling capability. Separate serial and parallel versions were seen as ultimately impractical. The calls to the MPI library therefore exist in the serial version but they are not invoked at run-time if the number of processors used is only one. Dummy MPI routines are provided in the implementation dependent part of Sabre (macdep_serial.f90) to allow Sabre to be compiled and to run (serially) on a system which does not have MPI.

After generalising the Input-Output part of the code a further prerequisite was to re-write key loops in the code of the optimisation algorithms which, although theoretically suitable for running in parallel, could not in fact do so because of dependencies within the loops. Parallel running requires it to be possible to execute the iterations of the loops in any order. This re-organisation is therefore present in the serial version too, but does not affect the algorithms in any significant way.

Finally, a strategy for dividing the work between processors in a way which minimised the overhead in calling the MPI routines was devised, and the appropriate parts of the code were then changed in order to make them processor sensitive.

Objective 4. To make the serial and parallel versions of Sabre freely available as R Objects for use in the R statistics and computing system

Sabre 4.0 will run on a desktop or cluster computer and can be invoked from R. It can also be invoked remotely using Grid middleware, such as Grid Resources on a Workstation Library (GROWL) - see http://www.growl.org.uk. We have provided an Appendix which shows how Sabre can be used from the R environment and how to invoke Sabre as a Grid service.

4

Page 5: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

4 Results

Objective 3:To compare the performance of the sequential version of Sabre and equivalent MLwiN and the Stata program gllamm analyses with that of the parallel version of Sabre using two substantive pieces of research as illustrations

There are a number of other programs that can be used to estimate multiprocess random effects models, e.g. gllamm, MLwiN and HLM. Both MLwiN and HLM, use MQL/PQL (Breslow and Clayton, 1993) in which Taylor expansions are used to linearise the relationship between responses and the linear predictors. Unfortunately, the parameter estimates from PQL tend to be biased for binary dependent variables for short sequences with high intra-class correlations (e.g. Rodriguez and Goldman, 1995, 2001). Furthermore, PQL does not involve the explicit calculation of the likelihood, which prevents the researcher using conventional likelihood ratio tests in model development.

The bias of PQL can be reduced by using a sixth order Laplace approximation for the marginal likelihood, (Raudenbush et al., 2000). However, further improvement of the approximation will require additional algebraic work (Raudenbush et al., 2000) to increase the degree of the Laplace Taylor expansion. The advantage of Gaussian quadrature is that the adequacy of the approximation can be easily increased by using more quadrature points. The need to check whether more points are required in order to get a good approximation to the integral was a limiting feature of quadrature methods. The development of parallel implementations of quadrature based methods implies that this limiting feature has effectively disappeared.

The hierarchical structure of random effects modelling lends itself naturally to MCMC methods, such as Gibbs sampling. Furthermore, if vague priors are specified, the method should essentially yield maximum likelihood estimates. However, there are two problems with this approach: (1) how to ensure that a truly stationary distribution has been obtained, (2) different starting values can lead to different stationary distributions. Because of this we do not compare MCMC methods with quadrature based methods in this report.

Rabe Hesketh et al (2004) contains a comparison of gllamm with MLwiN. In the interests of efficiency, and with limited resources, we restricted our evaluation of Sabre 4.0 to directly comparable routines in Stata and those of the Stata program gllamm. Stata is now the most widely used statistical software package by social scientists in the UK and USA. All the calculations in this report were performed in Stata 9.

Rather than limit ourselves to the 2 comparisons we promised in the original project proposal, we preferred to undertake this comparison on 16 data sets (of different sizes and complexity).This will contrast situations where there are advantages to be gained from going parallel with those where there is little to be gained.

In parallel Sabre, the data set is read by the master processor and then sent to all of the slave processors at the start of the calculations. The data is not sent again. At each iteration the master and slave processors each calculate a component of the likelihood and the slave processors send their component back to the master. The master then combines all the contributions and decides on a set of new parameter values, and so on. The proportion of time spent sending information to and from master and slave (rather than deciding on new parameter values) depends on the data set

5

Page 6: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

size, the number of parameters, the number of slave processors that are being used and the nature of the interconnects.

Because of the fixed overhead in sending the data set to each slave and the time taken to gather contributions to the likelihood, parallel jobs can take longer than serial jobs for simple models on small data sets. For more complex models on large data sets there will be a maximum number of processors beyond which the total run time will not be decreased by adding more processors since each slave processor will have a decreasing share of the computational work but the communication overhead per slave processor for doing that work will remain fixed while the overhead for the master processor will increase in proportion to the number of slaves. In our longer running tests, however, a 20 times speed increase is typically achieved before the processor limit is reached.

This section contains two sets of comparisons. The first is for some examples that will be used in short courses on multilevel multiprocess modelling, and the second example is for a large data set containing duration data that is typically used in applied research.

Multilevel Multiprocess Modelling Examples

The results of the comparison between Stata, gllamm and Sabre (#processors) in terms of CPU time in hours, minutes(‘) and seconds (“) on 15 example data sets (C=cross sectional, L=longitudinal) are listed in the table below. Many of the published data sets in this table are quite small and as such are not very computationally demanding for Sabre. All the examples, exercises, data sets, training materials and software are downloaded from http://sabre.lancs.ac.uk/ .

01"00"

Example Data Obs Vars Kb Stata gllamm Sabre(1) Sabre(2) Sabre(4) Sabre(8) C1 hsb 7185 15 1172 20' 51" 06" 04" 03" 02" C2 hsb 7185 15 1172 25' 59" 03" 02" 02" 02" C3* thaieduc1 8582 4 378 11" 4' 52" 01" 01" 01" 01"

thaieduc2 7516 5 412 C4* teacher1 661 3 22 n/a* 1' 14" 00" 00" 01" 01"

teacher2 650 4 28 C5 racd (dvisits) 5190 21 1090 52" 18' 24" 03" 02" 01" 02"

racd (prescrib) 5190 21 1090 42" 15' 11" 03" 02" 01" 02" C6 visit-prescribe 10380 26 2717 n/a 45hr 10' 2' 21" 1' 11" 36" 20" L1 pefr 34 4 2 00" 29" 00" 00" 01" 01" L2 nls (wage) 18995 20 3859 03" 2hr 12' 27" 15" 08" 05" L3 growth 153 8 14 00" 1' 00" 00" 00" 01" 01" L4 nls (union) 18995 20 3859 2' 02" 30' 04" 05" 03" 02" 02" L5 schiz 1603 8 140 n/a* 2' 24" 00" 00" 01" 01" L6 drvisits 2227 10 242 39" 9' 07" 02" 02" 01" 01" L7 filled 390432 94 367556 59hr 52' 3 months+ 34' 38" 18' 51" 11' 03" 7' 01"

lapsed 390432 94 367556 67hr 31' 3 months+ 29' 41" 16' 20" 9' 45" 6' 21" L8 filled-lapsed 780864 261 2134413 n/a 3 years+ 54hr 29' 32hr 5 ' 18hr 49' 11hr 58' L9 union-wage 37990 25 9683 n/a unexpected failure 18' 21" 9' 13" 4' 41" 2' 26"

KeyC3* and C4* : aggregate timings for both analyses; in each case the second dataset has fewer observationsdue to missing values in the additional variablen/a: Stata 9 can not estimate bivariate random effects models using quadrature n/a*: Stata 9 can not estimate random effects ordered response models using quadratureunexpected failure: the gllamm manual not does rule this bivarite model out, but gllamm crashed just after startingtime+: indicates a lower CPU limit

All the computations reported above were performed on Lancaster University's High Performance Computing facility, which consists of an array of 103 dual-processor Sun-Blade workstations, each having between 1 and 8 gigabytes of memory. They are connected to a fileserver with 1300

6

Page 7: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

gigabytes of disk storage. Sixteen of the workstations have "Myrinet" cards installed to allow very high speed communication between them, supporting parallel programs which distribute large amounts of data. The non-Myrinet connections run at 100 Mbits. Jobs are submitted to the array from the HPC front-end machine through the Sun Grid Engine/Codine queuing system hosted on the file server. This in turn distributes each submitted job to one of the many execution hosts, or holds it until a host becomes available. Users may submit large numbers of individual batch jobs, interactive jobs, or parallel, or message passing programs using either MPI or PVM. In all of the computational comparisons we used workstations without the Myrinet cards. The Sun Blade workstations are 64 bit with 1 GB+ of RAM, they run at 0.9 GHz. The HPC compiler is Sun Fortran 95 8.1We also ran many of the single processor examples on a Celeron 2.8 GHz, 256 MB of RAM, Windows XP. In this case Sabre was compiled using the Intel Fortran 9.0 Compiler. The Celeron generally finishes in ½ the time it takes the HPC processors. The HPC is now 3 years old, and is due to be replaced in April 06. However, it is the relative timings that are important in the Table.

The shaded elements of the Stata column are for the linear model using the Stata command xtreg, to estimate a random effects linear model using MLE. The integrated likelihood for the normal distributed random effects, normal linear model has a closed form. There are currently no Stata commands for bivariate random effect models, so no comparison with Stata is possible, hence the cells with ‘not applicable’ (n/a) in them. The gllamm timings with a plus sign are from an estimated lower limit (based on jobs run on a subset of the data), or from the point where we decided that we could not wait any longer for gllamm to end on its own.

The non-shaded results we quote are for standard quadrature in gllamm. The same number of quadrature points are used in the Stata, gllamm and Sabre comparisons. The number of quadrature points varies by example, depending on the number needed to give a good approximation to the likelihood. We have not used the gllamm adaptive quadrature algorithm in these comparisons. The argument for using adaptive quadrature arises from the fact that Gaussian quadrature can perform badly when there are insufficient locations under the peak of the integrated likelihood. Adaptive quadrature is used to shift the quadrature points in order to improve the adequacy of the approximation. Adaptive quadrature generally requires fewer quadrature points than standard quadrature to provide the same approximation of the likelihood. However, there is very little computational burden in Sabre to increasing the number of quadrature points.

The timings for examples C3 and C4 are the joint timings for two sets of estimates. In both C3 and C4 the second data set has one more explanatory variable and fewer cases. The analysis with the addition variable has some missing values. In both the C3 and C4 data sets, observations with missing values were deleted.

Sabre (1) clearly outperforms gllamm on all the data sets. Stata only outperforms Sabre (1) when it uses the MLE procedure (C1, C2, L1, L2, L3).

Why is Stata slower than Sabre? This is because Stata is distributed in a compiled version that will run on a range of different processors. Stata do not distribute a different compiled version for every processor, and there is typically just one version for each operating system. Why is gllamm slower than Stata? This is because gllamm is an add-on, written in an interpreted language, i.e. the statements are translated to machine code as they run, which happens every time a loop is executed. The same algorithm in a compiled language (and optimised by the compiler to take advantage of the host processor) will run many times faster. There also algorithmic differences too, e.g. Sabre's analytical rather than numeric calculation of the derivatives.

7

Page 8: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

If seconds matter, there can be some saving by going parallel on small data sets. However, looking down the table we see that the first real advantage of using Sabre first appears in example C6 (visit-prescribe), which involved the estimation of a bivariate Poisson model. Gllamm takes nearly 2 days while Sabre (1) only takes 2’ 21” which goes down to 20” with 8 processors. Sabre (1) is nearly 3000 times faster than gllamm in estimating this model..

The crucial illustration occurs with respect to examples L7 and L8. These examples are from a study that provides the first estimates of the determinants of employer search in the UK using duration modelling techniques. It involves modelling a job vacancy duration until either it is successfully filled or withdrawn from the market. For further detail see http://www.lancs.ac.uk/staff/ecasb/papers/vacdur_economica.pdf.

In example L7 we treat the 'filled' and 'lapsed' datasets as if they were independent, both data sets have 390,432 binary observations (at the weekly level) on 12,840 vacancies. For the first risk ('filled') the final response for each vacancy is 1 at the point where the vacancy fills, and similarly for the ('lapsed') risk. At all other weeks the responses are zero. There are 7,234 filled vacancies and 5,606 lapsed vacancies.

The separate 'filled' and 'lapsed' duration models can be fitted in Stata using xtclog, gllamm and Sabre. For each type of risk we used a 26-piece non-parametric baseline hazard with 55 covariates and 12-point Gaussian quadrature. Stata took just under 3 days to estimate each model. After 3 months gllamm had still failed to converge.

The combined dataset, (example L8) has 780,864 observations, each of the 12,840 vacancies being represented twice, with each sequence of vacancy responses ending in a 1 at the point where the vacancy is filled for a 'filled' risk, the 'lapsed' risk is right censored at this point and vice versa for a 'lapsed' risk. Sabre 4.0 and gllamm can estimate a model which allows for a correlation between the risks, this option is currently not available in Stata. Sabre (1) takes about 2.5 days to estimate the correlated risks model, by going parallel on 8 processors this time can be cut down to ½ a day. We estimate that gllamm will take over 3 years on our HPC.

Another Example

This illustrative example uses administrative records covering the duration in employment in the workforce of a major Australian state government to investigate the determinants of quits and separations amongst permanent and temporary workers. In the results we report here we combine the exits due to quits (a voluntary exit) and separations (an involuntary exit). For further detail see http://www.lancs.ac.uk/staff/ecasb/papers/bradley_jeea.pdf.

The data set consists of 3,655,704 binary observations (at the weekly level) on 199,881 individuals, with 14,716 non-zero binary outcomes. The results are for the logit model and 12 quadrature points. The computational times from the different ways of estimating the model are summarised in the Table below (CPU time is reported in minutes).

Data Obs Vars Size Stata gllamm Sabre (1) Sabre (2) Sabre (4) Sabre (8) Sabre (16) Aus 3665704 53 2 Gb 10183' 6 months+ 62' 32' 16' 9' 5'

Stata took just over a week, while Sabre (1) took about an hour, further speed up was achieved by going parallel. There is almost a perfectly linear increase in speed with increasing number of

8

Page 9: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

NODE TOOLSand

Services

REPOSITORIESDATA ARCHIVESSOSIG etc.TEXT MINING

GROWLWeb Service Interfaces

The Grid

CollaborativeToolsCollaborativeTools

AlertingEmail,RSS

etc.

Web Services

P

Globus,Condor,SRB

P WSRP

?

?

P

Oracle DB on NGS

Java CoG

Note: P= this could be a combination of WSRP, Web Services, OGSA-DAI, SRW/U, etc.

CollaborativeTools

REFERENCE No. RES-149-25-0010

processors. We stopped the gllamm version after the programme had run for 6 months.

Objective 5. To explore the possibility of a Web portal to Sabre and Sabre within R

The use of the GROWL client-server middleware architecture and API to permit users to exploit an R environment for controlling parallel Sabre running on the National Grid Service or other Grid resources was described under Objective 4 above. The fact that GROWL uses Web services to communicate commands and data between the client (user’s desktop system) and the server means that other interfaces can be produced. Work is ongoing in the JISC-funded Sakai VRE project (http://www.grids.ac.uk/Sakai) to extend portal frameworks to exploit Web service interfaces in a Service Oriented Architecture, as illustrated in Figure 5 below.

The approach to making Sabre accessible from a Web portal is two-fold: 1. JSR-168 compliant portlets can be written as client interfaces to the GROWL services

which control Sabre. These can be deployed in any standard-compliant framework such as uPortal (popular for institutional solutions) or StringBeans (as currently used on the NGS).

2. GROWL services can be exported via WSRP, an OASIS standard for Web Services for Remote Portlets. The portlet interface hosted in uPortal can in this way be exposed using a WSRP consumer in the Sakai collaboration and learning environment. The consumer is currently being developed in the JISC-funded Sakai VRE project.

Outcomes of this work will be deployed for use in the ESRC-funded CQeSS project and in the NCeSS e-Infrastructure.initiatives

NODE TOOLS and

Services

REPOSITORIES DATA ARCHIVES SOSIG etc. TEXT MINING

GROWL Web Service Interfaces

The Grid

Collaborative Tools

Alerting Email,RSS

etc.

Web Services

P

Globus, Condor, SRB

P WSRP

?

?

P

Oracle DB on NGS

Java CoG

Note: P= this could be a combination of WSRP, Web Services, OGSA-DAI, SRW/U, etc.

Figure 5 Joining it all up

9

Page 10: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

5 Activities

The main presentations arising from this project are:

1. Rob Crouchley, (2004), Collaboratory for Quantitative e-Social Science and The Sabre-R Pilot Demonstrator Project, NCeSS Conference, Manchester

2. Rob Crouchley, (2004), What can e-science offer statistical analysis in social research? Royal Statistical Society Annual Conference, Manchester

3. Daniel Grose, and Audrienne Cutajar Bezzina, (2005), E-Science and Statistical Modelling in Social Research , First Summer School of the ESRC National Centre for Research Methods., Southampton

4. Rob Crouchley, Daniel Grose, Robert Allan, John Kewley, Mark Hayes, (2005), A VRE Programming Toolkit (GROWL) for quantitative e-Social Science, First International Conference on e-Social Science, NCeSS, Manchester

5. Mark Hayes, Lorna Morris, Rob Crouchley, Daniel Grose, Ties van Ark, Rob Allan and John Kewley, (2005), GROWL: A Lightweight Grid Services Toolkit and Applications http://www.allhands.org.uk/2005/proceedings/papers/460.pdf, EPSRC e-Science AHM, Sept 2005, ISBN 1-904425-53-4, Nottingham

6 Outputs

The main working paper outputs (attached) are:

Andrews, M J., Bradley, S., Stott, D., and Upward, R., (2005), Successful employer search? An empirical analysis of vacancy duration using micro data, re-submitted Economica.

Bradley, S., Draca, M., Green, C., Mangan, J., (2005), The effect of relative wages and external shocks on quits and separations from the public sector, for submission to Industrial and Labor Relations Review

Objective 2. To make the source code of the original and parallel versions of Sabre freely available;

The source code of Sabre 4.0 for both serial and parallel versions is available for downloading from the Sabre web site http://sabre.lancs.ac.uk/. The "Downloading & Installing Sabre" link leads to a download form where the user is invited to give brief details so that the distribution of the package can be monitored. After completion of the form a link to the source code in ZIP format is then displayed. Other pages within the "Installing Sabre" section describe the installation dependent parts of the code and any changes which might be necessary before compiling.

Typically only the maximum problem size will need to be redefined by changing one or more of 4 parameters in a header file. Sample compilation instructions for the Sun Forte f90 compiler and the Intel Fortran 90 compiler for Windows are also given.

10

Page 11: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

7 Impacts

1. The development of the GRID-enabled version of Sabre and associated training materials will add to the 'tool kit' of techniques that practitioners engaged in research can utilise. The impact of this development will extend beyond research in the social sciences to, for example, biomedical statistical research.

2. Empirical research that admits (by using this type of software) the full complexity of social phenomenon will be able to provide better evidence based policy. For instance, findings on labour market transition behaviour will contribute to discussions on the design of training programmes and the provision of advice on job search/vocational guidance for the unemployed.

8 Future Research Priorities

Future priorities occur at two levels: (1) those relating to the development of Sabre, and (2) those that relate to the general context of e-Social Science.

(1) Some developments that could be added to Sabre:

1. Model weights 2. Random covariate coefficients 3. Adaptive quadrature 4. Fixed effects procedures for large samples, using sparse matrix procedures 5. Procedures for handling non ignorable missing data.

(2) The general e-Social Science Context: Sabre only provides a very small part of the toolkit needed by e-Social Science. Putting Sabre into the GROWL middleware library is only the beginning. We urgently need to put a range of Data Base Management tools, e.g. SRB and OGSA­DAI into the GROWL library. We also need to write libraries like the one we wrote for R so that researchers can use these tools from their preferred desktop application, be it SPSS, Stata or a browser. This step is needed as there are not many social scientists who will want to manage data on the NGS by programming in SQL, XML or for that matter GT 4+.

The need for Data Base Management tools like the SRB and OGSA-DAI arises from the fact that there are often many different sources of data that need to be manipulated in social research. For example, in the analysis of work history data there are 26 different sources of data in the BHPS, representing each annual wave (1991-2004) and retrospective data obtained in 1992. These multiple sources of data need to be reconciled in order to produce a coherent life or work history at the weekly or monthly level from the point at which the sampled individual enters the labour market. In some cases this goes back 50 or more years from 1991. There is no one best way to do this type of data fusion. The tools need to be in place so that each researcher can do this for them selves.

Furthermore, the work and life history data also has to be supplemented by contextual data, e.g., local labour market conditions, such as the local unemployment rate, and local job vacancies at all time and location points of the life and work history. This contextual data has to be obtained from a variety of sources, e.g., Census, NOMIS. Conducting the data integration on a desk top PC in Stata takes approximately 1 week of (CPU) time. Similar issues would arise in the merging of any cohort study data set. The use of the GROWL middleware tools will require that the user has

11

Page 12: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

permission to place their own versions of the data on the National Grid Service (NGS) or preferably that the ESDS, MIMAS, etc. have grid-enabled versions of the data that they can make available to the researcher.

Words [4559]

12

Page 13: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

References

Breslow, N.E., and Clayton, D., (1993), Approximate inference in generalised linear mixed models. JASA, 88, 9-25.

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B., (2003),. Bayesian Data Analysis, 2nd Edition. Chapman and Hall/CRC, Boca Raton, FL.

Rabe-Hesketh, S., Skrondal, A. and Pickles, A. (2004), GLLAMM Manual. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 160.. Downloadable from http://www.gllamm.org/docum.html

Rodriguez, B., and Goldman, N., (1995), An assessment of estimation procedures for multilevel models with binary responses, JRSS, A, 158, 73-89.

Rodriguez, G., and Goldman, N., (2001), Improved estimation procedures for multilevel models with binary response: a case study. Journal of the Royal Statistical Society, A 164, 339–355.

Raudenbush, S.W., Yang, M.L., and Yosef, M., (2000), Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. Journal of Computational and Graphical Statistics 9, 141–157.

13

Page 14: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

C client

C service

REFERENCE No. RES-149-25-0010

Appendix: GROWL, R and Sabre

The Grid Resources on a Workstation Library (GROWL) offers a generic C++ API for managing and communicating with applications. The design of the API is such that no modification to an existing application code is usually necessary. To use the API a developer has to write a wrapper interface for the application which is then used to access the applications functionality as if it were in the same process space.

GROWL can be represented by the following picture

DesktopDesktop

SOAP over HTTP

HPC C client

C service

SOAPover HTTP

HPC

Figure 4. Representation of the core GROWL components

Additionally, GROWL provides a client server system in which the server can host arbitrary services that have a SOAP interface. Client access to these services is over a secure (PKI/SSL) connection to a single port on the host system. Clients are authenticated to the server using their distinguished name extracted from a certificate provided by a trusted certificate authority, such as the National Grid Service. Client access to particular services can then be granted/ denied based upon client status.

Running Sabre R on Grid resources has required extensive use of the GROWL developers API with parallel Sabre is hosted as a resource such as the National Grid Service using the GROWL server as an intermediary. The GROWL project was funded by as part of the JISC VRE programme, see http://www.jisc.ac.uk/index.cfm?name=vre_growl, and http://www.growl.org.uk.

Since the server is provided as a standalone web service client access to a specific service becomes virtually transparent. Furthermore, significant advantages are gained in having the server as a standalone service. Firstly, the server has persistence (traditional web services typically do not) providing meaningful state for managing clients and services. Secondly, it eliminates any administrative dependency on services such as HTTP(S) allowing many of the associated difficulties associated with institutional firewalls to be overcome.

In the context of the grid, the architecture allows a developer to create client side interfaces to grid facilities hosted as web services. This is significant since any need for knowledge of grid middleware is devolved to developers providing the grid service. In addition, administration/security issues are devolved from the grid developer to the server administrator.

14

Page 15: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

R

R is a language and environment for data manipulation, statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R provides a mechanism for loading optional code (developed in the R language) and binary modules developed in C or Fortran. R also supports a means of combining additional components along with documentation and help facilities into packages, and provides a number of tools for automating this process. There is a web based repository, “The Comprehensive R Archive Network” for R packages (see http://www.r-project.org/). Installation of a package from a repository or other source is mostly automated within R and is a simple task for an R user.

Sabre-R interface

The Sabre extension to R was developed using GROWL components and R scripts combined into an R package (sabreR). In addition, parallel Sabre is hosted on the grid using a GROWL server. The main advantages of adopting this approach are:

1. The Sabre wrapper interface developed using GROWL is identical for both serial and parallel Sabre

2. The GROWL server provides secure/authenticated access to parallel Sabre on the grid by employing the wrapper interface

3. The GROWL server exposes Sabre functionality as a web service, thus eliminating many of the problems commonly associated with institutional firewalls and account management.

4. The user does not require an account on the system hosting parallel Sabre 5. A user can start a grid hosted Sabre session and then terminate the R session without

cancelling the Sabre analysis. They can then recover the session for later use, even on a different client system.

Features

Being able to access Sabre functionality from within R has a number of advantages. In particular, the R user can undertake a Sabre analysis using native R data structures and commands with which they are familiar. This allows the preparation and analysis of Sabre input and output to be integrated into existing methods and work flows that might be already undertaken within the R environment. In addition to this, use of the GROWL API and server allows a user to have multiple concurrent Sabre models within a single R session. Because the GROWL facilities are multi-threaded, control is returned to the user after each sabre command, even if the command is still being processed. These features allow a user to easily study the effects of modifications in model parameters and/or data by comparing estimates from multiple models within the same R session. The latter two features were not in the original project specification, but are a natural outcome of employing GROWL from related projects in which the team at Lancaster and Daresbury are involved.

Interface

The sabreR commands are designed to be similar, in terms of case, naming convention and argument handling, to the native R commands and to those found in R packages. In addition, all

15

Page 16: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

data used in conjunction with a Sabre model is organised using native R data structures. The following demonstrates a typical sabreR session

> library(sabreR) # load the sabreR library> sabre0<-sabre.session(); # create a new sabre model > trade.union<-read.table(``./TradeUnion.table'') # read the data into a data frame > names(trade.union) # show the variates [1] "CASE" "YEAR" "AGE" "EVNO" "SUPR" "HRS" "NOEM" "SEX1" "TU" "PROM" [11] "SC80" > sabre.data(sabre0,trade.union)> sabre.display.variates()

Name Levels Type ________________________________ cons 1 X case 1 X yearageevno

1 1 1

X X X

suprhrs

1 1

X X

noem 1 X sex1 1 X tu 1 YVAR promsc80

1 1

X X

fnoem 5 X fsc80 6 X

> plot(trade.union) # plot the data> sabre.y.variate(sabre0,''tu'')> sabre.factor(sabre0,''noem'',''fnoem'')> sabre.factor(sabre0,''sc80'',''fsc80'')> sabre.display.model()

X-vars Y-var ______________________________ year tu agefnoem fsc80

Univariate model Standard logit

Number of observations = 1633

X-var df = 12

Log likelihood = -1073.0110 on 1621 residual degrees of freedom

> sabre.lfit(sabre0,''year'',''age'',''fnoem'',''fsc80'') # linear fit

Iteration Log. lik. Difference __________________________________________

1 -1131.9093 2 -1073.2798 58.63 3 -1073.0111 .2687 4 -1073.0110 0.1204E-03

16

Page 17: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

5 -1073.0110 0.4700E-09

> sabre.display.estimates(sabre0)

Parameter Estimate Std. Err. ___________________________________________________ year -0.16136E-01 0.54224E-02 age 0.32899E-01 0.69741E-02 fnoem ( 1) -1.3945 .80939 fnoem ( 2) -.78157 .47819 fnoem ( 3) -0.35445E-01 .47920 fnoem ( 4) .14679 .46976 fnoem ( 5) 0.48744E-01 .46787 fsc80 ( 1) .00000 ALIASED [I]fsc80 ( 2) .39780 .29480 fsc80 ( 3) -.17355 .30840 fsc80 ( 4) .60508 .28237 fsc80 ( 5) .49547 .29331 fsc80 ( 6) .51569 .32419

> sabre.case(sabre0,''case'')> sabre.fit(sabre0,''year'',''age'',''fnoem'',''fsc80'')> # NB returns control to user immediately even though> # analysis is still running> sabre.display.estimates(sabre0)

*** Sabre analysis still in progress ***

> # ... some time later ... > sabre.display.iterations(sabre0)

Initial Homogeneous Fit:

Iteration Log. lik. Difference __________________________________________

1 -1131.9093 2 -1073.2798 58.63 3 -1073.0111 .2687 4 -1073.0110 0.1204E-03 5 -1073.0110 0.4700E-09

Iteration Log. lik. Step End-points Orthogonalitylength 0 1 criterion

________________________________________________________________________ 1 -917.86146 1.0000 fixed fixed 6.5174 2 -878.33512 1.0000 fixed fixed 19.983 3 -868.98256 1.0000 fixed fixed 4.6722 4 -867.52529 1.0000 fixed fixed 5.2621 5 -867.20337 1.0000 fixed fixed 11.115

> # ... user can be doing other things within R whilst analysis takes place ...> sabre.display.estimates(sabre0)

Initial Homogeneous Fit:

Iteration Log. lik. Difference __________________________________________

1 -1131.9093 2 -1073.2798 58.63 3 -1073.0111 .2687

17

Page 18: Report on Project: An OGSA Component-Based Approach to ... · Report on Project: An OGSA Component-Based Approach to Middleware for Statistical Modelling 1 Background The general

REFERENCE No. RES-149-25-0010

4 -1073.0110 0.1204E-03 5 -1073.0110 0.4700E-09

Iteration Log. lik. Step End-points Orthogonalitylength 0 1 criterion

________________________________________________________________________ 1 -917.86146 1.0000 fixed fixed 6.5174 2 -878.33512 1.0000 fixed fixed 19.983 3 -868.98256 1.0000 fixed fixed 4.6722 4 -867.52529 1.0000 fixed fixed 5.2621 5 -867.20337 1.0000 fixed fixed 11.115 6 -867.06558 1.0000 fixed fixed 5.9164 7 -866.93034 1.0000 fixed fixed 6.8780 8 -866.93024 1.0000 fixed fixed 11.479 9 -866.93024 1.0000 fixed fixed

Notice how in the example the first argument to all of the sabre commands is a sabre session. This is how sabreR distinguishes between multiple sabre models within a single R session. Furthermore, notice that the first call to sabre.display.estimates resulted in a warning that the analysis was not yet complete. This demonstrates the mutli-threaded nature of the sabreR package.

Finally, use of the sabre.display.iterations allows a user to keep track of each Sabre analysis that is currently running.

How would the above session differ if a parallel sabre analysis was being executed on a grid resource ? The following demonstrates how little additional effort is required :

> nwg<-grid.resource(‘‘~/smith.pem’’,’’~/smith.pem’’,+ ’’growl.lancs.ac.uk:50000’’,’’~/smith.passwd’’) > sabre0<-sabre.session(nwg)# this time create a sabre session with a grid resource > # ..... continue asbefore

In this example, a Grid resource is acquired by passing the Grid.resource function the location of the user’s certificates, a file containing the user’s password and the name of the system hosting the GROWL server. Sabre sessions are created as in the previous example except that the Grid resource is passed to the sabre.session function. Any ensuing Sabre commands are identical those as used when using a local (serial) session of Sabre.

If a user leaves the R session while a grid Sabre session is active, they can return to it later. The sabreR package offers two simple functions for retrieving grid Sabre sessions. These are outlined in the following example.

library(sabreR) > nwg<-grid.resource(‘‘~/smith.pem’’,’’~/smith.pem’’,+ ’’growl.lancs.ac.uk:50000’’,’’~/smith.passwd’’) sabre.current.sessions(nwg)started last command 1 02/01/2006 13:01 02/01/2006 14:27 2 02/01/2006 13:07 02/01/2006 13:54 3 02/01/2006 13:08 02/01/2006 13:57 4 02/01/2006 13:11 02/01/2006 14:03

> sabre0<-sabre.recover.session(nwg,3) # recover session 3

18