towards a virtual research environment for language and literature researchers

11
Future Generation Computer Systems 29 (2013) 549–559 Contents lists available at SciVerse ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Towards a virtual research environment for language and literature researchers Muhammad S. Sarwar a,, T. Doherty b , J. Watt b , Richard O. Sinnott c a Room 246C, Kelvin Building, National e-Science Centre, University of Glasgow, Glasgow, UK b National e-Science Centre, University of Glasgow, Glasgow, UK c Department of Computing and Information Systems, University of Melbourne, Melbourne, Australia article info Article history: Received 16 March 2011 Received in revised form 1 March 2012 Accepted 22 March 2012 Available online 1 April 2012 Keywords: Humanities Language and literature MapReduce HPC Grid ENROLLER abstract Language and literature researchers often use a variety of data resources in order to conduct their day-to-day research. Such resources include dictionaries, thesauri, corpora, images, audio and video collections. These resources are typically distributed, and comprise non-interoperable repositories of data that are often licence protected. In this context, researchers typically conduct their research through direct access to individual web-based resources. This form of research is non-scalable, time consuming and often frustrating to the researchers. The JISC funded project Enhancing Repositories for Language and Literature Researchers (ENROLLER, http://www.gla.ac.uk/enroller/) aims to address this by provision of an interactive, research infrastructure providing seamless access to a range of major language and literature repositories. This paper describes this infrastructure and the services that have been developed to overcome the issues in access and use of digital resources in humanities. In particular, we describe how high performance computing facilities including the UK e-Science National Grid Service (NGS, http://www.ngs.ac.uk) have been exploited to support advanced, bulk search capabilities, implemented using Google’s MapReduce algorithm. We also describe our experiences in the use of the resource brokering Workload Management System (WMS) and the Virtual Organization Membership Service (VOMS) solutions in this space. Finally we outline the experiences from the arts and humanities community on the usage of this infrastructure. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Consider a scenario where a humanist wishes to search for a word, say ‘canny’, in the dictionary to find its meaning; in a thesaurus to look up associated concepts and categories it is found and used in, and in a corpus of work to find the documents containing it. Researchers may also want to see the concordances (context where the term was used) and determine the frequency of occurrence of the word in each found document as a basis for further analysis. The ability to save the different result sets and analysis of those results for later comparison between many different resultant data sets and with different researchers is compelling to the humanities community (and indeed is a challenge faced by many other research domains). This scenario becomes especially interesting and challenging when multiple dictionaries, thesauri and text corpora need to be cross- searched or differences between the textual resources exist. For example, searching for the word ‘canny’ in the Oxford English Corresponding author. Tel.: +44 141 330 2958. E-mail addresses: [email protected], [email protected], [email protected] (M.S. Sarwar), [email protected] (R.O. Sinnott). Dictionary (OED) [1], Scottish National Dictionary (SND) [2] and Dictionary of Older Scottish Tongue (DOST) [3] at the same time will have slightly different results on the definition of the term. When compared with other resources such as the Historical Thesaurus of English (HTE) [4] to look up the related concepts and categories and/or the Scottish Corpus of Text and Speech (SCOTS) [5] and/or the Newcastle Electronic Corpus of Tyneside English (NECTE) [6] the multitude of definitions and their historical context is especially challenging to establish. The problem scales further if the researcher decides to search for multiple, possibly hundreds, of words at once and do all of the mentioned tasks. Currently, language and literature scholars use multiple independent web-based resources to achieve these tasks. Licencing access to multiple resources is commonly required and the end user researchers are left with traditional Internet hopping based research. An interactive research infrastructure that brings together all of the different provider’s data sets together in a seamless and secure environment is thus highly desirable and has been the focus of the JISC funded ENROLLER project [7]. The ENROLLER project began in April 2009 and is tasked with providing such a capability through the establishment and support of a targeted Virtual Research Environment (VRE) implementing secure and seamless data integration and information retrieval system for language and literature scholars. 0167-739X/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2012.03.015

Upload: muhammad-s-sarwar

Post on 25-Nov-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards a virtual research environment for language and literature researchers

Future Generation Computer Systems 29 (2013) 549–559

Contents lists available at SciVerse ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Towards a virtual research environment for language and literature researchersMuhammad S. Sarwar a,∗, T. Doherty b, J. Watt b, Richard O. Sinnott ca Room 246C, Kelvin Building, National e-Science Centre, University of Glasgow, Glasgow, UKb National e-Science Centre, University of Glasgow, Glasgow, UKc Department of Computing and Information Systems, University of Melbourne, Melbourne, Australia

a r t i c l e i n f o

Article history:Received 16 March 2011Received in revised form1 March 2012Accepted 22 March 2012Available online 1 April 2012

Keywords:HumanitiesLanguage and literatureMapReduceHPCGridENROLLER

a b s t r a c t

Language and literature researchers often use a variety of data resources in order to conduct theirday-to-day research. Such resources include dictionaries, thesauri, corpora, images, audio and videocollections. These resources are typically distributed, and comprise non-interoperable repositories ofdata that are often licence protected. In this context, researchers typically conduct their researchthrough direct access to individual web-based resources. This form of research is non-scalable, timeconsuming and often frustrating to the researchers. The JISC funded project Enhancing Repositories forLanguage and Literature Researchers (ENROLLER, http://www.gla.ac.uk/enroller/) aims to address thisby provision of an interactive, research infrastructure providing seamless access to a range of majorlanguage and literature repositories. This paper describes this infrastructure and the services that havebeen developed to overcome the issues in access and use of digital resources in humanities. In particular,we describe how high performance computing facilities including the UK e-Science National GridService (NGS, http://www.ngs.ac.uk) have been exploited to support advanced, bulk search capabilities,implemented using Google’s MapReduce algorithm. We also describe our experiences in the use of theresource brokering Workload Management System (WMS) and the Virtual Organization MembershipService (VOMS) solutions in this space. Finally we outline the experiences from the arts and humanitiescommunity on the usage of this infrastructure.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Consider a scenario where a humanist wishes to search fora word, say ‘canny’, in the dictionary to find its meaning; ina thesaurus to look up associated concepts and categories itis found and used in, and in a corpus of work to find thedocuments containing it. Researchers may also want to see theconcordances (context where the term was used) and determinethe frequency of occurrence of the word in each found documentas a basis for further analysis. The ability to save the differentresult sets and analysis of those results for later comparisonbetween many different resultant data sets and with differentresearchers is compelling to the humanities community (andindeed is a challenge faced by many other research domains).This scenario becomes especially interesting and challengingwhenmultiple dictionaries, thesauri and text corpora need to be cross-searched or differences between the textual resources exist. Forexample, searching for the word ‘canny’ in the Oxford English

∗ Corresponding author. Tel.: +44 141 330 2958.E-mail addresses:[email protected],

[email protected], [email protected] (M.S. Sarwar),[email protected] (R.O. Sinnott).

0167-739X/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2012.03.015

Dictionary (OED) [1], Scottish National Dictionary (SND) [2] andDictionary of Older Scottish Tongue (DOST) [3] at the sametime will have slightly different results on the definition ofthe term. When compared with other resources such as theHistorical Thesaurus of English (HTE) [4] to look up the relatedconcepts and categories and/or the Scottish Corpus of Text andSpeech (SCOTS) [5] and/or the Newcastle Electronic Corpus ofTyneside English (NECTE) [6] the multitude of definitions andtheir historical context is especially challenging to establish. Theproblem scales further if the researcher decides to search formultiple, possibly hundreds, of words at once and do all of thementioned tasks. Currently, language and literature scholars usemultiple independent web-based resources to achieve these tasks.Licencing access to multiple resources is commonly required andthe end user researchers are left with traditional Internet hoppingbased research. An interactive research infrastructure that bringstogether all of the different provider’s data sets together in aseamless and secure environment is thus highly desirable and hasbeen the focus of the JISC funded ENROLLER project [7].

The ENROLLER project began in April 2009 and is tasked withproviding such a capability through the establishment and supportof a targeted Virtual Research Environment (VRE) implementingsecure and seamless data integration and information retrievalsystem for language and literature scholars.

Page 2: Towards a virtual research environment for language and literature researchers

550 M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559

1.1. Related work

VRE-SDM [8], TextGrid [9], TEXTvre [10] and gMan [11] aresome of VRE systems that exist in the arts and humanities domain.VRE-SDM focused on the development of services for sharingand annotating manuscripts [8]. TextGrid involves providing toolsand services for analysis of textual data and support for datacuration over the Grid [9]. TEXTvre builds upon the success ofTextGrid and provides tools for TEI-based resource creation [10].gMan VRE is targeted at Classics and Ancient History researchersand aims to be a general purpose VRE for arts and humanitiesresearchers [11]. While all of these mentioned projects are aimedat arts and humanities researchers, they are not particularlytargeted at language and literature researchers. Furthermore theirfocus is not on supporting federated data access models wheredata providers are autonomous, e.g. as is the case with the OxfordEnglish dictionary, but rather on the amalgamation of tools used fordata processing and analysis associated with humanities researchand/or the establishment of data warehouses where data sets areimported from archives as is the case with gMan.

ENROLLER aims to build a sustainable e-infrastructure forlanguage and literature researchers. Through the ENROLLER work,researchers in the language and literature domain will have accessto large amounts of language and literature data from a single,easy-to-use portal; membership of an international network ofscholars; increased knowledge of digital resources, and directaccess to a portfolio of analysis tools. ENROLLER will also raiseawareness and understanding of e-Science and establish a focalpoint for research for the humanities community. It will allow acommunity of researchers with related aims to collaborate moreeasily, and already fundeddata sets to beused in newcombinationsthat could result in heuristic discoveries. The wider humanitiescommunity will benefit directly from the models developed here.The resulting knowledge transfer will be of benefit to both thehumanities and the e-Science communities as well as to the widercommunity such as publishers, dictionary creators and nationalservices.

This rest of this paper describes the challenges in implementingthe VRE for language and literature researchers and the solutionsput together thus far. Section 2 describes the background and datacollections involved in the project. Section 3 describes the VRE andits overarching requirements. Section 4 describes the design of thevarious components that make up the ENROLLER VRE. Section 5explains the implementation details and outlines the problemsfaced and solutions implemented during the course of the work.Section 6 presents typical use cases in accessing and using thesystem. Section 7 highlights the feedback of the work collectedfrom the language and literature community. Finally Section 8draws conclusions on the work as a whole and areas of futurework.

2. Data sets and formats

The ENROLLER project is currently working with numerousmajor data sets from a variety of data providers. These include:

2.1. The Historical Thesaurus of English (HTE, http:// libra.englang.arts.gla.ac.uk/historicalthesaurus/aboutproject.html)

The HTE contains more than 750,000 words from Old English(c700 A.D.) to the present. HTE has been published by the OxfordUniversity Press since 2009 and offers a new and significantdevelopment in the historical language studies. HTE data iscurrently available in XML format.

2.2. Scottish Corpus of Text and Speech (SCOTS—www.scottishcorpus.ac.uk)

The Engineering andPhysical Sciences Research Council (EPSRC,www.epsrc.ac.uk) and the Arts and Humanities Research Council(AHRC, www.ahrc.ac.uk) funded SCOTS resource offers a collectionof text and audio files covering a period from 1945 to present. TheSCOTS corpus is currently available in a Text Encoding Initiative(TEI, www.tei-c.org)—compliant XML format. Data can also bemade available through a PostgreSql relational database. SCOTScorpus contains over 4 million words of running text.

2.3. Dictionary of Scots Language (DSL—www.dsl.ac.uk/dsl)

The AHRC funded DSL resource encompasses two majorScottish language dictionaries The Scottish National Dictionary(SND) and The Dictionary of Older Scottish Tongue (DOST). DSLdata is currently available in XML format. Scottish LanguageDictionaries (SLD) hosts the data on their servers in Edinburgh.

2.4. Newcastle Electronic Corpus of Tyneside English (NECTE—www.ncl.ac.uk/necte)

The AHRC funded NECTE is a corpus of dialect speech fromTyneside in Northeast England. This corpus aggregates the workof two existing corpora: the Tyneside Linguistic Survey (TLS)created in late 1960s and the Phonological Variation and Changein Contemporary Spoken English (PVC) created in 1994. The NECTEcorpus is encoded in TEI-compliant XML format. The encodeddata is available in four different formats: audio, orthogonal text,phonetic and parts of speech tagged. NECTE corpus contains over500 k (five hundred thousands) words of running text.

2.5. Corpus ofModern ScottishWriting (CMSW—www.scottishcorpus.ac.uk/cmsw/)

The EPSRC and AHRC funded CMSW is a collection of letters(mostly texts and images) from the period 1700 A.D to 1945 A.D.(This is regarded as ‘modern’ writing to the language and literaturecommunity.)

2.6. Oxford English Dictionary (OED—www.oed.com)

The Oxford English Dictionary (OED—www.oed.com) is acommercial resource published by Oxford University Press andis widely regarded as the primary authority on the current andhistoric version of the English language vocabulary.

2.7. The Hansard Collection

The Hansard Collection is a collection of transcribed textsof UK’s parliamentary speeches from the period 1803 to 2005.The Hansard data is available in XML format. Hansard Collectioncontains over 7.5 million XML documents.

All of these data resources collectively represent significantindependent investments and efforts in capturing and cataloguingthe history of the English and Scots languages. It is to be noted thatENROLLER project does not maintain any of the data sets providedby the project collaborators. The Oxford University Press (OUP)maintains OED and DSL is maintained by SLD. Access to OED ismade though an OED SRU service (http://www.oed.com/public/sruservice) while DSL is accessed using a secure web service.

Page 3: Towards a virtual research environment for language and literature researchers

M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559 551

3. A virtual research environment

A VRE is generally regarded as an online environment offeringa set of tools aimed at providing a collaborative researchenvironment for researchers thatmay be geographically dispersed.Typically this involves offering a portfolio of services and datasets through a single, uniform web portal. Successful VREs willtypically minimize the level of detail required by end userresearchers from the underlying technologies and the distributionand heterogeneity of the resources involved.

The requirements of the VRE for this project can be broadlydivided into three major categories:

3.1. Computational requirements

3.1.1. Parsing and indexingBefore any knowledge can be drawn from a given data

collection, where feasible, the data needs to be extracted fromthe collection. Extracted data will typically need to be parsedand indexed. Since every collection follows a different approachin encoding and structuring the data more generally, differentapproaches for data extraction including parsing and indexing arerequired. These need to be targeted to the remote data providerdata models and aligned with a unified and consistent model.

The VRE has been designed to be extensible and allow incorpo-ration of other data sets, and indeed for individual researchers toupload their own data sets of interest to the community. Processesto automate the indexing of uploaded collections are thus highlydesirable.

3.1.2. Simple and cross-collection searchesExecuting simple word searches, multiple word searches and

phrase queries on indexed collections are a basic requirement ofboth the project and the community at large. Queries should beexecuted against any number of available and selected collections.Furthermore support for cross-collection searches on indexedcollections is an essential requirement to this community—sincethis is one of the primary benefits of having a VRE.

Cross-collection search refers to situations where users wishto undertake a search over data existing in multiple differentcollections. For example, a user might want to search for a set of200 words against the thesaurus (HTE) and corpora (SCOTS andNECTE). In this case (and currently implemented in ENROLLER),the user will typically upload a Comma Separated Value (CSV) file,containing the list of words to be searched and select the particularcollections to be searched over.

3.1.3. Bulk searching over the GridBulk search refers to the situations where a researcher wishes

to search multiple, possibly hundreds of words simultaneously onany of the data collections. In this case, it is essential to exploithigh performance computing resources. Ideally this should becompletely transparent to the end user humanities community.It is also desired that the system be able to execute complex andcomputationally intensive linguistic interactions. To support this,two services are thus highly desirable:

3.1.4. Workflow executionWorkflow execution refers to the situation where a user wishes

to perform a series of searches as part of one larger search. In thiscase, a user will typically input either a single word, or upload a filecontaining multiple words/phrases to be searched (for example,search for the term ‘canny’ in the Historical Thesaurus). The usermay also want to search the results of this query against one ormore corpora, e.g. SCOTS. Based upon the results of this, the user

may also wishes to find text concordances samples of the word inuse, with ten or sowords of context for each of thesewords, e.g. thesentence in which it was used, and lastly display the thesaurus-search results, corpus-search results, and concordances againsteach of the words, before finally saving or downloading all or anyof these results. Fig. 1 shows an example of a workflow.

3.1.5. Linguistic analysis toolsEnabling the researchers to be able to perform linguistic anal-

ysis on search results obtained from multiple providers requiresdevelopment and deployment of linguistic analysis tools such asconcordance, frequency analysis and co-location clouds into theVRE, i.e. offering a one stop shop for the language and literatureresearch community.

3.2. Security requirements

3.2.1. Authentication and authorizationThe VRE should support seamless access to multiple data

resources. One way this can be achieved is through the e-Science/e-Research single sign-on paradigm and exploitation ofGrid technologies. Single sign-on should overcome the need forcreating multiple data provider specific username and passwordcombinations. Furthermore, it is essential that where possible,individual users should have minimal exposure (ideally none!)to the associated underlying Grid technologies, e.g. having themacquire and maintain their own X.509 certificates should beavoided.

3.2.2. CommunicationSecure communication channels and security as a whole

are very important to many stakeholders in this community,since they are often dealing with data sets that have beencollected over many decades and/or have considerable intellectualproperty or commercial value. Secure communication channelsshould prevent unwanted eavesdropping and avoid transfer ofconfidential information such as usernames and passwords to andfrom data providers.

3.3. Data deposition and automated indexing

It is required of ENROLLER VRE to provide services for thedeposition of language and literature data collections. Depositeddata collections need to be indexed automatically once they havebeen uploaded and subsequently be available for performingsearch operations.

3.4. Usability requirements

A well-designed easy-to-use search system providing secureand seamless access to the distributed data collections is essential.The user interface of the search system is key. Complex interfacesthat require degrees of IT ‘savviness’ were to be avoided. Theinterface itself should ideally provide user intuitive options andengage the community directly. In this regard, personalization is animportant feature to this community. Thus users of the VRE shouldbe able to personalize their home pages and be able to performcollaborative research by being able to share the results of theirindividual researches.

All of these factors have been taken into the design anddevelopment of the ENROLLER VRE and associated softwareplatform.

Page 4: Towards a virtual research environment for language and literature researchers

552 M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559

Fig. 1. Example of a workflow (using term ‘timid’).

Fig. 2. ENROLLER system architecture.

4. Design

4.1. System architecture

The system architecture has adopted an n-tier system architec-ture. The systemhas been broadly divided into four distinct tiers asshown in Fig. 2. At the top is the presentation layer, which provides

the user interface to the system. The second tier is the messaginglayer which implements methods to communicate with businesslogic and data access components. Business logic and data accesscomponents form the third tier of the system and implement theprocesses and workflow activities of the system. Data access com-ponents interact with underlying persistent data stores. The thirdtier also contains a set of web and Grid services that interact withdistributed data and computational resources, such as the OxfordEnglish Dictionary (OED) and the UK e-Science National Grid Ser-vice (NGS, www.ngs.ac.uk). Persistent data stores form the fourthlayer of the system.

The business logic and data access components are responsiblefor data extraction, parsing and indexing. Information retrieval,transformation and application of linguistic analysis algorithmsare also performed by these components. Fig. 3 shows the flowof activities of parsing and indexing components. Fig. 4 showsthe flow of activities of information retrieval, transformationand language analysis components that the ENROLLER systemcurrently supports.

It is noted that the business logic and data access componentshave been implemented as standalone plug-and-play softwarecomponents to increase the reusability and simplify the mainte-nance of the system.

4.2. Authentication and authorization

The Internet2 Shibboleth [12] framework has been used toprovide user-oriented secure access to the portal. This eliminatesthe need for users to create their own usernames and passwords to

Fig. 3. Parsing and indexing components.

Page 5: Towards a virtual research environment for language and literature researchers

M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559 553

Fig. 4. Information retrieval, analysis and transformation.

Fig. 5. Shibboleth-based-log-in to the portal.

login to the portal. Instead users are redirected to their institutionalhomes for authentication. Once authenticated a digitally signedassertion is returned to the ENROLLER portal including attributesthat they may have which are subsequently used to restrict accessto the portal features. This security-oriented personalization ofportal contents exploits software capabilities from the SPAM-GPproject and is described in [13]. Fig. 5 shows the flow of activitiesof the whole process.

It is worth noting here that the signed assertion that issent back from the Identity Provider (IdP) includes encryptedinformation that is subsequently used as part of the processof creating a secure session in the portal. Ideally this wouldinclude sufficient information to dynamically create anX509 proxycredential for the particular user. This extra information is nottypically made available through the UK Access ManagementFederation (www.ukfederation.org.uk) however. We are workingclosely with the UK Federation and the NGS in this regard toexplore solutions that allow direct translation of SAML assertions,

to create the associated proxy credentials—building on resultsof the SARoNGS project (http://cts.ngs.ac.uk). In the meantime,a targeted ENROLLER IdP has been established at the Nationale-Science Centre for use in the ENROLLER project.

4.3. Bulk searching over the Grid

To support larger scale searches where several thousand queryterms are possible and need to be searched across multiple largescale data resources (although we note that this does not includelicence protected data resources), the project is exploiting theNGS. In particular the project is exploiting the Virtual OrganizationMembership Service (VOMS) solution [14] in accessing theNGS where pooled ENROLLER accounts are used by researchersaccessing these resources through a targeted project portal. Thisincludes use and exploitation of the Workload ManagementSystem (WMS) [15] to provide resource broking-based job

Page 6: Towards a virtual research environment for language and literature researchers

554 M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559

Fig. 6. Grid-based job submission process.

scheduling across all of the NGS nodes. This job scheduling istargeted currently to supporting large-scale searching based uponthe Google’s MapReduce [16] application. A set of job-submissionand status-monitoring services support the job submission, statusmonitoring and output retrieval. All of these capabilities have beeniteratively developed in close co-ordination with the language andliterature community.

4.3.1. Job submissionA job submission service provides the facility for job submission

directly from the ENROLLER portal. Once the user submits aGrid-search request, the Grid-job-submission-service is invokedand parameters for the Grid-search are provided. Parametersfor the Grid-search include the search-terms themselves (or afile including these search terms), the user_id, and an encryptedversion of the MyProxy username and password. The Grid-job-submission-service decrypts this username and passwordand subsequently contacts the MyProxy [17] server to retrievethe necessary proxy certificate of the user who initiated thejob submission process using the provided username/passwordinformation. It is worth noting that the returned credentialsalready include the VOMS attribute certificate extensions (statingwhat role and privileges the end user has as part of the ENROLLERVO) as part of the X509 proxy certificate.

A job-submission-script has been written to launch theMapReduce Java application on the Grid. To support this, the Gridjob submission service generates the required Job SpecificationDescription Language (JSDL) [18] for the job. The job submissionscript itself is also staged to the Grid. Once all of the configurationsare completed successfully, the NGS WMS service is contactedand a request for job submission is made. The WMS automaticallymatches the job requirements with the available pool of resourcesavailable to the ENROLLER VO and schedules the job for executionaccordingly. Upon successful scheduling of job, a job-id is returned.This job-id is then stored in a database for general job tracking andupdates. It is noted that this job-id can have numerous sub-jobsassociated with it, i.e. when bulk jobs are supported. In this casetheWMS service can schedule jobs acrossmultiple distributedNGSresources. Fig. 6 shows the job submission process.

4.3.1.1. Realization of Grid-based job submission. When a job issuccessfully scheduled for execution by the WMS, the staged jobsubmission script is initiated. This script sets up the necessarypaths to indexes and other necessary libraries. After setting up the

paths to the indexes and libraries, the multi-threaded distributedMapReduce service, which itself implements Google’s MapReducealgorithm, is started. Upon successful completion, the applicationoutputs a file containing the search results. Once the job hasfinished execution the WMS clears the job from memory andmakes the output available to the portal (or directly to the user).

4.3.2. Job status monitoring and output retrievalThe job-status-monitoring-service is started as soon as a job

is submitted to the Grid. A job status monitoring-service fetchesthe job-id of the job from a database and continually monitors thestatus of the job. When the job has finished executing, the outputis copied back to local server and the status of the job is updatedin the database. The location of output files is also inserted in thedatabase. Once the output of a job is ready, a download link isprovided to the user to download their job output. The job outputitself is based on an XML format. Fig. 7 shows the flow of activitiesthrough the system.

4.4. Issues

In the realization of this system, numerous issues andchallenges have arisen which we describe here. VOMS proxycredentials are generated and typically remain valid for a periodof 12 h. This means that if a job is going to take more than12 h to run, the proxy credentials will expire and the job willnot complete. One solution to this problem is to re-generate theproxy credentials if they are near to run out of time and the jobis still running. In order to implement this solution, the MyProxyusername and password of the user needs to be saved in anencrypted format in a database for subsequent re-generation ofthe proxy credentials. An alternative to this of course is to create aproxy credential with a much longer time period than required.This is a security danger however and has not been adopted. Itis noted that software from the Proxy Credential Auditing project(PCA, www.nesc.ac.uk/hub/projects/pca) is exploiting case studiesfrom the ENROLLER project to address precisely these kinds ofissues.

5. Implementation

In the course of the ENROLLER project, we have largelyadopted an agile approach to software development based on rapid

Page 7: Towards a virtual research environment for language and literature researchers

M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559 555

Fig. 7. Grid job status monitoring and output retrieval process.

Fig. 8. Basic search interface.

prototyping and component-based software engineering for thecurrent project artefacts. Maven has been used to manage thelifecycle of the project. For the user interface and VRE itself wehave adopted the Liferay portal [19]. Liferay provides a platformfor creating both the user interface components of the VRE andtools to support the back end provisioning and support of services.Liferay itself is a JSR286 [20] compliant platform that makes ita perfect candidate for creating and deploying JSR268 compliantportlets. Ajax [21] has also been used to develop interactive web2.0 compliant user interfaces. All of the communication is doneover https to keep the flow of information secure. Fig. 8 showsthe current (basic) user interface of the search system. A moreadvancedGrid search interface is also available that supports largerscale searches including uploading of files with relevant searchterms.

In this interface, the design has been to deliberately offer aGoogle-like look and feel. The users simply enter the terms theyare interested in and the resources they wish the search to runover (through selecting the appropriate boxes). More complexinterfaces have also been developed, e.g. to reduce the time periodover which the user is interested—only search for terms from the18th century for example. Themajority of the end users exploit thebasic search capability however.

As mentioned previously, participating data collections areheterogeneous in nature therefore devising an identical parsingand indexing algorithm for all resources is not possible. The StAXAPI [22] has been used to parse the XML documents. The JDBCAPI [23] is used to interact with relational databases. The LuceneAPI [24] has also been chosen, to index the parsed data. Lucenewasselected due to its flexible and powerful indexing and searching

Page 8: Towards a virtual research environment for language and literature researchers

556 M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559

Table 1Time to completion using variable number of threads and search terms.

Number of threads Time to completion (s) Number of search terms

1 93 254 24.5 251 159.5 1001 1814 10002 419.444 10004 245.5 10008 154 1000

16 177.22 100032 177.4 100064 171.23 1000

128 132.44 1000

Fig. 9. Performance analysis.

capabilities. Moreover since the advanced Grid-based searchesrequire data to be placed over the Grid, Lucene-based indexes canbe archived and copied over the Grid easily using GridFTP [25].This practice results in using the same index on local serversand on the Grid and produces identical search results. Whensimple searches are performed, indexes placed on local servers aresearched and when a Grid-based search or workflow execution isinvoked, indexes distributed over the Grid are used.

A distributed search application based upon the Google’sMapReduce algorithm has been written and exploited over theGrid for larger scale searches. This is a multi-threaded applicationand is responsible for carrying out searches across the datacollections available on the Grid. Search results are saved ina file in XML format. The actual job submission service itselfuses the Globus Toolkit (GT) [26]—a necessary requirement wheninteracting with facilities such as the NGS.

Employing the MapReduce showed significant improvementsin the search application’s performance. Performance tests wereconducted using the ScotGrid [27] high performance computingfacility of the NGS focusing on assessment of scalability andthroughput for the variable number of threads and search terms.Table 1 and Fig. 9 shows the time taken to perform the search usingvariable number of threads for 1000 search terms. Itwas noted thata 1000-word-serial-search took 1814 s to completion and the samesearch took 132.44 s when using 128 threads. A detailed accountof performance results is discussed in [28].

Fig. 10 shows the results of the executing of search for theterms ‘‘canny’’ and the associated results returned from the SCOTSresource, HTE, NECTE and DOST resources.

The results of the HTE are shown in more detail in Fig. 11.In particular, this shows how multiple variant meanings of theterm blue have been used throughout the centuries (as a colourdescriptor, for drunkenness, for sadness etc.) along with theperiods of usage for variant of the term. The synonyms of blue

when used to mean drunk are shown in Fig. 11 along with thecategory, subcategory, time and the period of usage.

When a user submits the Grid-based search request (throughthe advanced search portlet), the search terms from an uploadedCSV file are extracted. A Globus-based Grid-job-submission-service is initiated to run the job on the Grid. The JSDL contentsare generated using functions from the jLite API [29] library. Theinterface to the advanced Grid-search is very similar to the basicsearch given above with the additional option to upload a file ofterms. It is noted that although it was a clear requirement for endusers to not to want to know/deal with the intricacies of runningjobs on the Grid, they are interested in seeing how their searchesrun on multiple distributed resources. A job status and monitoringportlet has been developed for this purpose as shown in Fig. 12.

6. Use cases

This section presents some typical scenarios to understand theinteraction between various kinds of services and how the endusers have been using the system.

6.1. Login

A user who wants to use the system, accesses the ENROLLERportal via a web browser. Upon reaching the portal, they areredirected to the UK Access Management Federation Where-Are-You-From (WAYF) service where they are prompted fortheir home institution. Once the user has selected their homeinstitution they are redirected to the institution’s login page.After successfully providing their username and password forauthentication, they are redirected to the ENROLLER portal wherea signed Security Assertion Markup Language (SAML) assertion isused to allow them access and to build up the portal session. Atthis point the user’s authorization attributes (encoded as part of theeduPersonEntitlement attribute) are loaded into the portal and usedto configure the portal contents, i.e. the portlets they are allowedto see/invoke.

6.2. Simple and cross-collection searches

Users exploit the ENROLLER portal’s basic search interface toperform single word, multiple words and phrase searches acrossany number of available collections. The search queries are madeagainst the indexed data collections and ultimately returned to theportal. Users are able to download the results as CSV files for theirown local use.

Search results of a simple search can be further used as an inputto cross-collection searches. For example, searching for a word‘excellence’ in SND and in HTE produces lists of synonyms. Theselists of synonyms can be further used as input to searches of theSCOTS corpus, the NECTE corpus and the Hansard collection. This isthe typical scenario used to feed the bulk search service. As before,data can be downloaded locally by researchers and used with theirown local analysis tools, e.g. tools used for variant word spellings.

6.3. Bulk search

If a researcher wants to search for tens or hundreds of wordsat once, instead of typing all the words into the query box theycan upload a file of search terms. At present this has to be in CSVformat. The system will automatically extract the words from thisfile and search them against the selected collections. Bulk searchesare supported over the Grid. In this case, the indexed data aredistributed over the Grid for rapid searching. As noted only non-licenced data sets can be distributed and used like this.

Page 9: Towards a virtual research environment for language and literature researchers

M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559 557

Fig. 10. Results from basic searching using ENROLLER VRE.

Fig. 11. HTE Results for term blue when used to describe drunkenness. (For interpretation of the references to colour in this figure legend, the reader is referred to the webversion of this article.)

Page 10: Towards a virtual research environment for language and literature researchers

558 M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559

Fig. 12. Job status tracking.

6.4. Workflows

At present the ENROLLER system provides a workbench whereworkflows can be manually driven. To explain this, considerthe typical scenario where a researcher decides to search forall the relevant entries of the word ‘timid’ in HTE and thenwants to cross search the search results from HTE in ScottishCorpus to find all the documents that match against each of theinput words. They may also wish to find all the concordancesfor each of the words. Augmenting this scenario to incorporatecycles of interactions where thesaurus-search results, corpus-search results, concordances for the words are used to define andshape future searches is a key scenario. At each stage the user isable to download the data locally, manipulate it and/or process itto help shape their future searches.

Itmaywell be the case that the definition and enactment of suchscenarios could well exploit established workflow environmentsand associated tools. However, at present the user community hasadopted the solutions put forward and are not yet requesting suchenhanced capabilities.

6.5. Data deposition and automated indexing

The ENROLLER VRE facilitates the deposition of languageand literature data collections from individuals or organizations.Currently users of the VRE can upload data in plain text format ofup to 50 MB in size by using the ENROLLER portal’s upload tool.Upon successful upload the collection is marked as available forindexing by the automated indexing engine, which is currentlyunder development.

7. Feedback from the community

The project has been specifically organized to be communitydriven. Email lists for networks of scholars in this field areestablished and used for updates and community feedback. Theproject also developed and rolled out a wiki as part of the VRE,however we have found this has made less of an impact thanoriginally hoped/expected.

As part of the work itself, two colloquia have been organized,one in April 2010 and other in February 2011. At each of these over30 academics and researchers from various institutions aroundthe UK and Europe participated and were shown the system andsubsequently allowed to drive the system according to their ownresearch needs and requirements. The overall response from thecommunity has been extremely encouraging and all users wereable to run large-scale searches and undertake research that theycould not easily do otherwise, i.e. without Internet-based hoppingfrom resource to resource. Participants gave numerous usefulcomments and suggestions for the further development of services

and sustainability of this infrastructure in the longer run. Wenote that this user community also included the data providersthemselves. Among those data providers, VARIENG, the FinnishCenter of Excellence for the Study of Variation, Contacts andChange in English (http://www.helsinki.fi/varieng/), expressedtheir desire to deposit Helsinki Corpus [30] to ENROLLER project.We believe that the success of such efforts demands an inclusivemodel to help shape the resources and capabilities.

8. Conclusions and future work

Through the ENROLLER work, researchers in the language andliterature domain now have access to large amounts of languageand literature data from a single, easy-to-use portal; membershipof an international network of scholars; increased knowledge ofdigital resources, and direct access to a portfolio of analysis tools.

The work is very much on-going however and numerous otherchallenges remain to be addressed. These include the developmentof enhanced data playgrounds where researchers can run queriesand generate results that can subsequently be used by othersas part of their own research or kept longer term for futureusage. Data provenance is a key requirement that this communityare keen on—knowing that they are dealing with the accuratehistorical resources and results from those resources.

Automated indexing of deposited data collections are furtheritems of work that we are also currently looking to support.We note that there are a huge number of researchers who havehistorically significant digital resources with no place to maintainthis long term. The Helsinki corpus is one such example, whichis maintained by the University of Helsinki. They have expressedtheir desire to include this resource into the ENROLLER pool ofsearchable resources. Further work has also recently been funded(by JISC) looking at extensions to the existing system to includeextended versions of the Hansard Parliamentary speeches, Scottishwords and place-names.

More work on the project as a whole is available at www.gla.ac.uk/enroller with the VRE itself available at https://enroller.nesc.gla.ac.uk.

Acknowledgements

We gratefully acknowledge funding from JISC for the workdescribed in this paper. We also wish to thank all project partnersfor their input including Jean Anderson, Johanna Green and MarcAlexander.We also thank the network of scholars for their help andsupport in shaping the ENROLLER VRE.

References

[1] Oxfor English Dictionary. Available http://www.oed.com.[2] Scottish National Dictionary. Available http://www.dsl.ac.uk.[3] Dictionary of Older Scottish Tongue. Available http://www.celtscot.ed.ac.uk/

dost/.

Page 11: Towards a virtual research environment for language and literature researchers

M.S. Sarwar et al. / Future Generation Computer Systems 29 (2013) 549–559 559

[4] The Historial Thesaurus of English. Available http://libra.englang.arts.gla.ac.uk/WebThesHTML/homepage.html.

[5] The Scottish Corpus. Available http://www.scottishcorpus.ac.uk/.[6] NewCastle Electronic Corpus of Tyneside English. Available http://research.

ncl.ac.uk/necte/.[7] ENROLLER project. http://www.gla.ac.uk/enroller/.[8] VRE-SDM project. http://bvreh.humanities.ox.ac.uk/VRE-SDM.[9] Heike Neuroth, Felix Lohmeier, Kathleen Marie Smith, TextGrid—virtual

research environment for the humanities, The International Journal of DigitalCuration 6 (2) (2011) (Proceedings of the 6th International Digital CurationConference, Chicago, USA, December 2010).

[10] T. Blanke, M. Hedges, Humanities e-science: from systematic investi-gations to institutional infrastructures, in: e-Science (e-Science), 2010IEEE Sixth International Conference on, 7–10 December 2010, pp. 25–32.http://dx.doi.org/10.1109/eScience.2010.34.

[11] T. Blanke, L. Candela, M. Hedges, M. Priddy, F. Simeoni, Deploying general-purpose virtual research environ-ments for humanities research, Philosoph-ical Transactions of the Royal Society A 368 (2010) 3813–3828.

[12] The Internet2 Shibboleth framework. http://shibboleth.internet2.edu.[13] J. Watt, R.O. Sinnott, T. Doherty, J. Jiang, Portal-based access to advanced

security infrastructures, in: UK e-Science All Hands Meeting Conference,Edinburgh, September 2008.

[14] R. Alfieri, R. Cecchini, V. Ciaschini, L. dell’Agnello, A. Frohner, A. Gianoli, K.Lörentey, F. Spataro, VOMS, an Authorization System for Virtual Organizations,in: Lecture Notes in Computer Science, vol. 2970/2004, 2004, pp. 33–40.http://dx.doi.org/10.1007/978-3-540-24689-3_5.

[15] Workload Management System. http://glite.web.cern.ch/glite/wms/.[16] Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on

large clusters, Communications of the ACM 51 (2008) 107–113.[17] MyProxy. Available http://grid.ncsa.illinois.edu/myproxy/.[18] Ali Anjomshoaa, Fred Brisard, Michel Drescher, Donal Fellows, An Ly,

Stephen McGough, Darren Pulsipher, Andreas Savva, Job SubmissionDescription Language (JSDL), specification, version 1.0, 2005, Availablehttp://www.ogf.org/documents/GFD.56.pdf.

[19] Liferay portal. Available http://www.liferay.com/.[20] JSR286. Available http://jcp.org/en/jsr/detail?id=268.[21] Ajax. Available http://www.ajax.org/#home.[22] StAX API. Available http://stax.codehaus.org/Home.[23] Java Database Connectivity (JDBC) API. Available http://www.oracle.com/

technetwork/java/javase/tech/index-jsp-136101.html.[24] Lucene API. Available http://lucene.apache.org/java/docs/index.html.[25] I. Mandrichenko, W. Allcock, T. Perelmutov, GridFTP v2 protocol description,

2005. www.ogf.org/documents/GFD.47.pdf.[26] Globus toolkit. Available http://www.globus.org/toolkit.[27] Scottish grid service. http://www.scotgrid.ac.uk/overview.html.[28] Muhammad S. Sarwar, M. Alexander, J. Anderson, J. Green, Richard O. Sinnott,

Implementing MapReduce over language and literature data over the UKNational Grid Service, in: Emerging Technologies, ICET, 2011 7th InternationalConference on, 5–6 September 2011, pp. 1–6.http://dx.doi.org/10.1109/ICET.2011.6048475.

[29] jLite. Available http://code.google.com/p/jlite/.[30] Helsinki Courpus. Available

http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/.

Muhammad S. Sarwar (Sulman) is a Research Associate ofthe UK National e-Science Centre (NeSC) at the Universityof Glasgow. He is involved with ENROLLER project. Ear-lier he has worked for Carrier Telephone Industries Pvt.Ltd., KPSoft Ltd. and ITI Life Sciences as a Developer andSoftware Engineer. He holds an M.Sc. in High PerformanceComputing from University of Edinburgh and an M.Sc. inComputer Sciences from Quaid-i-Azam University, Pak-istan. His interests are HPC, Grid computing, Parallel andDistributed software development, Information Retrievaland related technologies.

T. Doherty is a Research Assistant at National e-ScienceCentre at the University of Glasgow. He has specializedin fine-grained privilege management solutions for Gridmiddleware and Grid portlets, including semantic tech-nologies.

He has developed a suite of Grid portlets to providea data analysis environment for social science research.He is currently providing a complete e-infrastructure so-lution for a social population simulator to run on the UKNational Grid Service. His other works include ATLAS codedistribution and TAG Skimming (used for event-metadata

distributed analysis), web application development for AMI (ATLASMetadata Inter-face) and development for ATLFAST (ATLAS Fast Simulator).

J. Watt is a Research Associate and Technical Director atthe National e-Science Centre at the University of Glas-gow. Since 2002he hasworked onmanyprojects on accesscontrol, user management and security for UK e-Research,including DyVOSE, nanoCMOS, EuroDSD and GLASS, spe-cializing in implementing Privilege Management Infras-tructures using software such as PERMIS and Shibboleth.John has also authored Web portlets to streamline theoperation of these infrastructures in the OMII-UK SPAM-GP project. He also helped create the first Masters-levelcourses in Grid Computing available at a UK university in

2004, in collaboration with Glasgow’s Department of Computing Science.

Richard O. Sinnott is the eResearch Director at theUniversity of Melbourne. He has a Ph.D. in ComputingScience, an M.Sc. in Software Engineering and a B.Sc. inTheoretical Physics. His research interests are broad, andhe has published over 140 papers across a wide array ofcomputer science research areas. Of late he has focusedon the challenges in supporting collaborations, especiallythose where finer-grained security is required. He hasbeen a principal/co-investigator on an extensive portfolioof research projects in a wide variety of research domains,from the clinical, biomedical, biological, engineering,

social, geospatial sciences, through to the arts and humanities.