p2p-based resource discovery in dynamic grids allowing multi-attribute and range queries

23
P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries Agustín C. Caminero , Antonio Robles-Gómez, Salvador Ros, Roberto Hernández, Llanos Tobarra Dpto. de Sistemas de Comunicación y Control, Universidad Nacional de Educación a Distancia, Madrid, Spain article info Article history: Received 9 November 2012 Received in revised form 17 June 2013 Accepted 6 August 2013 Available online 23 August 2013 Keywords: Resource discovery Grids Summarization Routing indices Scalability abstract A key point for the efficient use of large grid systems is the discovery of resources, and this task becomes more complicated as the size of the system grows up. In this case, large amounts of information on the available resources must be stored and kept up-to-date along the system so that it can be queried by users to find resources meeting specific requirements (e.g. a given operating system or available memory). Thus, three tasks must be performed, (1) information on resources must be gathered and processed, (2) such pro- cessed information has to be disseminated over the system, and (3) upon users’ requests, the system must be able to discover resources meeting some requirements using the pro- cessed information. This paper presents a new technique for the discovery of resources in grids which can be used in the case of multi-attribute (e.g. {OS = Linux & memory = 4 GB}) and range queries (e.g. {50 GB < disk-space < 100 GB}). This technique relies on the use of content summarisation techniques to perform the first task mentioned before and strives at the main drawback found in proposals from literature using summarization. This drawback is related to scalability, and is tackled by means of using Peer-to-Peer (P2P) techniques, namely Routing Indices (RIs), to perform the second and third tasks. Another contribution of this work is a performance evaluation conducted by means of simulations of the EU DataGRID Testbed which shows the usefulness of this approach com- pared to other proposals from literature. More specifically, the technique presented in this paper improves on the scalability and produces good performance. Besides, the parameters involved in the summary creation have been tuned and the most suitable values for the presented test case have been found. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Grids are highly heterogeneous and variable systems which allow the coordinated use of computer resources located all over the world [1]. One of their main characteristics is their large size, such as Worldwide Large Hadron Collider Computing Grid [2], which has around 170 computing centers in 36 countries, or the European Grid Community (EGI) [3] with 330 re- source centers with 399,300 cores in more than 56 countries. Their large size is one of their main advantages (since they provide a huge computing power that is of real interest for the research community) but it yields many difficulties since it is often difficult to manage such large and distributed system. 0167-8191/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.parco.2013.08.003 Corresponding author. Address: ETSI Informática, UNED, C/Juan del Rosal, 16, 28040 Madrid, Spain. Tel.: +34 91 398 9468; fax: +34 91 398 9383. E-mail addresses: [email protected] (A.C. Caminero), [email protected] (A. Robles-Gómez), [email protected] (S. Ros), [email protected] (R. Hernández), [email protected] (L. Tobarra). Parallel Computing 39 (2013) 615–637 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

Upload: llanos

Post on 23-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Parallel Computing 39 (2013) 615–637

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate/parco

P2P-based resource discovery in dynamic grids allowingmulti-attribute and range queries

0167-8191/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.parco.2013.08.003

⇑ Corresponding author. Address: ETSI Informática, UNED, C/Juan del Rosal, 16, 28040 Madrid, Spain. Tel.: +34 91 398 9468; fax: +34 91 398E-mail addresses: [email protected] (A.C. Caminero), [email protected] (A. Robles-Gómez), [email protected] (S. Ros), roberto@sc

(R. Hernández), [email protected] (L. Tobarra).

Agustín C. Caminero ⇑, Antonio Robles-Gómez, Salvador Ros,Roberto Hernández, Llanos TobarraDpto. de Sistemas de Comunicación y Control, Universidad Nacional de Educación a Distancia, Madrid, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 9 November 2012Received in revised form 17 June 2013Accepted 6 August 2013Available online 23 August 2013

Keywords:Resource discoveryGridsSummarizationRouting indicesScalability

A key point for the efficient use of large grid systems is the discovery of resources, and thistask becomes more complicated as the size of the system grows up. In this case, largeamounts of information on the available resources must be stored and kept up-to-datealong the system so that it can be queried by users to find resources meeting specificrequirements (e.g. a given operating system or available memory). Thus, three tasks mustbe performed, (1) information on resources must be gathered and processed, (2) such pro-cessed information has to be disseminated over the system, and (3) upon users’ requests,the system must be able to discover resources meeting some requirements using the pro-cessed information. This paper presents a new technique for the discovery of resources ingrids which can be used in the case of multi-attribute (e.g. {OS = Linux & memory = 4 GB})and range queries (e.g. {50 GB < disk-space < 100 GB}). This technique relies on the use ofcontent summarisation techniques to perform the first task mentioned before and strivesat the main drawback found in proposals from literature using summarization. Thisdrawback is related to scalability, and is tackled by means of using Peer-to-Peer (P2P)techniques, namely Routing Indices (RIs), to perform the second and third tasks.

Another contribution of this work is a performance evaluation conducted by means ofsimulations of the EU DataGRID Testbed which shows the usefulness of this approach com-pared to other proposals from literature. More specifically, the technique presented in thispaper improves on the scalability and produces good performance. Besides, the parametersinvolved in the summary creation have been tuned and the most suitable values for thepresented test case have been found.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Grids are highly heterogeneous and variable systems which allow the coordinated use of computer resources located allover the world [1]. One of their main characteristics is their large size, such as Worldwide Large Hadron Collider ComputingGrid [2], which has around 170 computing centers in 36 countries, or the European Grid Community (EGI) [3] with 330 re-source centers with 399,300 cores in more than 56 countries. Their large size is one of their main advantages (since theyprovide a huge computing power that is of real interest for the research community) but it yields many difficulties sinceit is often difficult to manage such large and distributed system.

9383.c.uned.es

Page 2: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

616 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

The size of information to describe a real world grid system such as the aforementioned ones is difficult to obtain withoutdirect access to the actual grid. The reason for this is that the size of the information depends on several aspects related to theactual implementation of the grid, among others, on the way how machines are grouped (e.g. stand-alone, or forming clus-ters of machines), on the software used to monitor the resources (e.g. Ganglia [4], Nagios [5],. . .), on the software used toimplement the grid (e.g. Globus [6], Advanced Resource Connector (ARC) [7],. . .),. . .For example, the size of the informationof one desktop computer with two cores used in the experiments presented in [8] (which is running Globus and Ganglia) isaround 11 KB, and it is the same as for the cluster with 88 cores used in the same work – the differences would be the figures,for example, the parameter FreeCPUs would show 2 for the desktop computer and 88 for the cluster.

Since some of this information is not publicly available (especially the number of clusters or machines the grid is madeof), only raw estimations can be calculated. Considering this, if we assume that the aforementioned EGI uses machines with 2cores each and the same software configuration as [8], the size of the information would be ca. 2 GB. If this information hasto be stored at each of the 330 resource centers EGI is made of (similarly to [9]), then the size of the total information of theEGI would be ca. 724 GB. But this is just a mere estimation and may differ from the actual figure.

One of the tasks directly affected by the size of the system is the discovery of resources, which deals with finding com-puting resources meeting a set of requisites, both static and dynamic, such as a given operating system, or available amountof memory. For high numbers of available resources (as in the aforementioned examples), information on resources may beso huge that specific techniques must be developed to accomplish this task.

Several resource discovery techniques such as [10,9] use content summarization in order to reduce the amount of infor-mation on resources (but not its quality) used in the discovery process. In them, all the domains in the grid share their localsummaries with each other (that is, a copy of the summary of a domain is kept at all the other domains in the system), whichmay yield scalability issues related to the amount of information to keep at each domain and the effort related to search it,along with costs when storing, disseminating and updating this information. For example, in [9], the size of the summaries ofall the domains in the system add 120 GB, and they have to be spread over the 1000 domains their simulated system is madeof (which increase the total amount of information existing in the system), kept up-to-date and be processed in the searchfor resources. Even more, all the domains of the system must be updated every time the summary of one of them changes –and this must be done in a timely manner, otherwise discovery decisions would be made based on obsolete information andwould yield wrong results. Our technique significantly reduces the amount of information spread through the system andthe number of domains affected by changes in order to improve on the scalability of the system.

In this paper we propose a new technique to perform resource discovery in grids based on Peer-to-Peer (P2P). This tech-nique can perform multi-attribute queries and range queries for numerical attributes. Recall that a multi-attribute query is aquery asking for resources with more than one pair hattribute, valuei, for example {OS = Linux & memory = 4 GB}, and rangequeries are queries asking for resources whose features are in a range of values (e.g. {50 GB < disk-space < 100 GB}).

This technique uses information summarization, and extends proposals from literature using summarization, such as[10,9], because this work (1) provides an efficient way to disseminate and query summarized information over the systembased on peer-to-peer (P2P), namely Routing Indices (RIs) [11], (2) adapts the summarization technique to the RIs by means ofcreating different types of summaries (called n-level summaries), and (3) presents a metric (called goodness function) neededby RIs to guide the query process. Even more, this paper presents a performance evaluation based on the EU DataGRID Test-bed [12] that shows the better scalability and good performance of the proposed technique compared to proposals from lit-erature. EU DataGRID Testbed has been used in this evaluation because it is a real testbed widely used in research, includingsimulations [13,10,14–20].

This paper is structured as follows: Section 2 provides related work in the fields of resource discovery in grids and contentsummarization. Section 3 describes a scenario for resource discovery in a grid system. Section 4 presents our resource dis-covery proposal, including the summarization technique, the dissemination and search technique, and the way how they arenested. Section 5 presents the simulation experiments based on the EU DataGRID Testbed [12] conducted to show the use-fulness of our work. Finally, Section 6 concludes the paper and presents guidelines for future work.

2. Related work

Several systems have been developed for information discovery in grids over the years, some of them have been reviewedin [21]. One of the most popular is Globus Monitoring and Discovery System (MDS) [22]. MDS allows users to discover whatresources are considered part of a Virtual Organization (VO) and to monitor those resources. MDS has scalability issues,which this proposal tackles by (1) using summaries rather than actual information to reduce the amount of information sentthrough the system, and (2) using RIs to organize the dissemination of summaries and the search for information.

Several techniques use bitmap trees to represent the availability of resources and to perform the resource discovery,among others [23,24]. In these techniques, domains in the grid keep several bitmaps that summarize the features of re-sources in the local domain and in the children nodes of the tree. In order to use the system, users’ queries must be translatedinto query bitmaps, which are sent between the nodes of the system in the search for resources. This system requires that allthe domains must share the same structure of bitmaps in order to work properly, which means that all the domains mustshare the same attributes (such as operating system or available memory), the same values for each attribute (such as Linux,XP, Vista, MacOS), and the same positions in bitmaps for each value (for instance, the first position in the operating system

Page 3: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 617

bitmap means Linux, the second means XP,. . .). Besides, each different attribute has an independent bitmap, which makesmulti-attribute queries more complicated. Our technique, presented in Section 4, does not require such a tight coordinationbetween domains, because each domain can decide which parameters it wants to propagate to the other domains. Besides,this technique can handle multi-attribute queries easily since all the attributes are handled together.

Sun et al. [25] present a Resource Category Tree (RCT) to perform discovery of resources based on numerical primary attri-butes (PAs) of resources, where each node of the tree represents a specific range of PA values. Similarly, Lee et al. [26] presenta technique to handle multi-core machines in desktop grids. This technique handles resource availability based on a limitedset of numerical parameters, such as CPU speed, number of cores, and memory availability. As opposed to these works, ourproposal has been designed to perform searches based on any kind of attributes (numerical and text), and range queries fornumerical attributes.

Kim et al. [27] present a technique that allows searches for numerical and text attributes including range queries, and isbased on a Content Addressable Network (CAN). Similarly to us, each peer does not forward its information to all the peers,only to a subset of them. Its main drawback is that it does not allow lower and upper bounds on the range queries, only lowerbounds are allowed – e.g. a query like {50 GB < disk-space < 100 GB} is not allowed, although a query like {50 GB < disk-space}is. Our works extends this by means of providing a technique that allows such queries, thanks to which better definition ofrequirements of jobs can be performed. This in turn leads to better utilization of resources and more versatility for the jobspecifications.

Some approaches to resource discovery are based on a domain ontology in order to find concepts that are related to eachother and separated by a semantic distance (e.g. a user searching for a computing resource whose operating system is Linuxmay also be interested on a computing resource whose operating system is Unix). Among others, [28–30] can be found.

The work presented here has two main building blocks, namely summarization and RIs. Regarding the use of summaries,it has already been studied for resource discovery in grids in [10]. That work presents the use of summaries based on Cobweb[31] for resource discovery in grids, and has a clear drawback: all the domains in the system must have a copy of the sum-maries of all the other domains in order to proceed with the discovery process, thus limiting the scalability of the proposal.This work has been extended in [9], which includes network-aware algorithms to perform resource discovery, but still thedrawback remains. If all the domains in the system must access the summaries of all the other domains in the system, thismay affect scalability when the number of domains in the system becomes too large or when information on resourceschanges – even if summarized information is used. Our technique tackles this issue by (1) using RI to organize the wayhow summaries are spread though the system and how the search process is conducted, (2) creating different types of sum-maries (called n-level summaries), and (3) using our goodness function to guide the search process.

Several approaches have been proposed for the management of summaries in P2P systems. The technique proposed byHayek et al. [32] is also based on Cobweb but the summarization used in our work (presented in [10,9]) performs a morein-depth summarization with three different steps (pruning, leafing and filtering, as explained in Section 4.1). Michelet al. [33] propose a technique based on hash sketches in which each (different) key is summarized independently – as op-posed to ours, which performs aggregation of all the keys together. Each key also has a different directory peer entrustedwith the aggregation of all the possible values for that key, all over the network, which leads to fault tolerance and perfor-mance problems, along with difficulties in multi-attribute queries. As Michel et al., Cardosa et al. [34] create resource bundlesthat aggregate each key independently. These are based on statistical data, rather than current data. Considering statisticalinformation to create summaries is also an interesting guideline for future work.

A number of approaches for summarization using ontologies have been proposed, among others [35–37], which is not thecase for our proposal. Apart from information discovery in distributed systems, the use of summaries has been studied inother fields of research, for instance, database management [38], video coding [39–41], or visualization of web pages[42]. These applications demonstrate the usefulness of reducing the amount of information (but not the quality of such infor-mation) when performing several tasks.

Regarding RIs, this is a widely used P2P technique, initially presented in [11] and used to improve different aspects of gridcomputing in [43,44,14], among others. In [43] a simple grid information service is implemented, but this only keeps infor-mation on the number of processors and amount of memory of the computing resources at each domain. In [44], authorspropose a technique for resource discovery based on RIs that use bit vectors to summarise the information of resources.As opposed to it, we use summaries with semantics, which can be easily modified with parameters. In [14], a grid meta-scheduler is presented that takes into account computing and network capabilities to make meta-scheduling decisions,but again it only considers numerical attributes. The work presented in this paper improves them by means of allowingthe discovery of any kind of information, both numerical and text. Also, our system supports multi-attribute and range que-ries, which makes it suitable for real grids.

Apart from RIs, another widely used technique for information dissemination and search in P2P is Distributed Hash Tables(DHT) [45] but the structure of a DHT is very tight, which is not suitable for dynamic data [44].

3. Scenario for resource discovery in grids

The technique presented in this work provides resource discovery in grids by means of applying P2P techniques. In a grid,a number of administrative domains are connected with each other through the Internet. Each administrative domain (for

Page 4: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

618 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

instance, a university or research center) has several entities (as depicted in Fig. 1 [14]), namely users, computing resources,a broker (such as [8]), a Grid Information Service or GIS (for instance MDS [22]), and a resource monitor (such as Ganglia [4]).The relations between domains in the grid are guided by the P2P technique, namely Routing Indices.

The resource discovery process works as follows. When a user needs a computing resource meeting a set of requirementsto conduct a research experiment (also called a job), he or she queries the broker (step 1 in Fig. 1). Then the broker will pro-ceed with a selection procedure involving the local GIS and resource monitor. First, a list of the available resources is re-trieved from the GIS (step 2). Then, current features of these resources are retrieved from the resource monitor (step 3).If there is a suitable resource for this job in the local domain, the job will be allocated to that resource. Otherwise, a resourcein another domain may be required. In this case, the broker in the local domain will have to start a search procedure to deter-mine which domain should be chosen to forward the job to (step 4). For this, the broker uses summaries of the features of thecomputing resources in other domains, which are organized using RIs. These summaries have been created and propagatedby the broker of each domain.

In order to provide a fault-tolerant system, the broker should be replicated so as to avoid the single point of failure prob-lem. Besides, in the case that the number of resources in a domain grows too high, they can be split in two or more subdo-mains, each one with the structure mentioned above, where communications between subdomains are performed in thesame way as inter-domain communications (which is the topic of interest of this paper). This way, a fully scalable systemcould be constructed. Even more, in the case of failures in brokers, reconfiguration strategies such as [46] can be used in or-der to reduce the effects of failures over the availability of the system.

In order to allow this technique to work efficiently, one assumption must be ensured: the resource monitor should pro-vide exactly the same measurements in all the domains. Otherwise, no comparison between resources available within dif-ferent domains can be made.

Furthermore, several points must be taken into account because of the fact that this architecture involves the coordinateduse of different administrative domains, some of these being related to authentication, authorization, and accounting. Forthese, techniques used to provide security to P2P systems such as [47], can be used. Moreover, trust and reputation mustbe dealt with when resources are accessed from a third party and also to ensure that the information shared by a domainreflects its current capabilities. For this, techniques such as [48,49] may be of interest.

Over this scenario, we present a new technique to perform resource discovery, which is detailed in next section.

4. Proposal for scalable resource discovery in grids

The technique for the efficient resource discovery in grids presented in this paper uses two concepts. First, summaries ofresource features are used in order to reduce the amount of information sent through the system. Other proposals from lit-erature [10,9] also use summaries to perform information discovery in grid systems, but they share a drawback, which is thefact that all the domains in the system must access the summaries of all the other domains. To tackle this limitation, thiswork proposes the use of Routing Indices (RIs) to organize the way how summaries are propagated and how the search is

Fig. 1. Resource discovery process.

Page 5: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 619

conducted. Thanks to the combination of these two techniques, the amount of information shared through the system andthe number of domains affected by changes are minimized, thus improving on the scalability of the proposal. The techniquepresented here allows multi-attribute and range queries – this is also detailed in this section.

4.1. Summarization technique

The summarization technique is based on a clustering algorithm called Cobweb [31] – an incremental approach for hier-archical conceptual clustering. Cobweb process data tuples and creates a tree in which each node is called a category. Eachdata tuple is assigned to a category, thus a category contains different information on the data tuples assigned to it and itschildren. This information depends on the type of attributes, these being text or numerical. For text attributes, such as oper-ating system, a category keeps the aggregated probability for each attribute to have each possible value (such as Windows,Linux, or Mac) for all the data tuples assigned to it and its children. For numerical attributes, such as available memory, acategory keeps the aggregated mean and standard deviation for all the data tuples assigned to it and its children.

As a result of applying the Cobweb clustering, a category tree is created that defines the database, which may be similar tothe tree presented in Fig. 2 (created based on the data depicted in Table 1). Based on this tree, a summary of the database canbe created by applying 2 steps: pruning and leafing. The first step means that branches are pruned at a given depth, so therewill be no branch deeper than the chosen depth. Therefore, the leaves of the pruned tree will be the leaves of the original treeat a lower depth than the chosen depth, along with the categories of the original tree at the chosen depth. After that, thesecond step is performed, consisting of choosing the leaves of the pruned tree as the summary of the tree. It can be seen thatthe lower the selected depth, the more concise the summary is. Additionally, since categories are made of the probabilitiesfor each attribute to take each value, the probabilities under a given threshold can be filtered out. This third step (namedfiltering) would create a more summarized version of the database. Once the summary has been created, the next step isthe propagation of the summary to other domains, so that they can perform the resource discovery efficiently.

When using summaries to perform resource discovery, some errors may occur. For example, consider the resources de-picted in Table 1, and the hierarchy tree depicted in Fig. 2, for the case when the local domain (let us call it d1) only shares thetop category (depth 0) with the other domains, and no probabilities are filtered out. In this case, if another domain (let us callit d2) receives a query asking for fOS� name ¼ MACg and fVO ¼ HEPg; d2 may decide that the probability for d1 to haveresources meeting the given requirements is PðOS� name ¼ MACjC0Þ � PðVO ¼ HEPjC0Þ ¼ 0:0625 (the dataPðOS� name ¼ MACjC0Þ ¼ 0:25 does not appear in the figure for the sake of clarity). If d2 eventually chooses d1 to forwardthe query to, this decision would be inaccurate, since d1 has no resources with fOS� name ¼ MACg and fVO ¼ HEPg (as canbe seen in Table 1). This problem can be tackled by merging attributes that are highly correlated (e.g. these attributes fre-quently appear together in queries), and this is among the future work.

The process of creating the Cobweb tree, building and propagating the summary, receiving the summaries from other do-mains, along with the discovery of resources is performed by a broker present at each domain. For further details on the cre-ation of summaries, please see [10,9].

Each domain may share its summary with all the domains in the system (as in [10,9]), or a subset of them. Determiningthe size of this set of domains is important since it affects the discovery process and the scalability of the approach. Havinglocal information on all the domains may improve the discovery process but may yield scalability issues since the amount ofinformation being forwarded through the system may be rather large (even if summaries are used), and keeping such infor-mation up-to-date may be difficult, since updates must be forwarded all across the system.

Fig. 2. Example of a Cobweb tree.

Page 6: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Table 1Sample resource description.

OS-name VO CPU model LRMS Arch.

MAC Space Opteron PBS x64Windows HEP Pentium3 PBS x64Linux Air Corei5i7 SGE i386Solaris Sea Corei5i7 SGE i386

Fig. 3. Peer-to-peer relations between several administrative domains.

620 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

To tackle these issues, we propose the use of Routing Indices (RIs) to guide the dissemination of summaries over the sys-tem and the resource discovery process, which are explained in the next section.

4.2. Routing Indices (RI)

The concept of Routing Indices (RI) [11] is used in order to organize summaries of domains and to forward queries toneighbors that are more likely to have the required resources. Forwarding decisions use the local RI value of neighboringdomains, rather than selecting neighbors at random or by flooding the network by forwarding the query to all neighbors.

Routing Indices (RI) were initially developed for document discovery in P2P systems [11], and have also been used toimplement a grid meta-scheduler service in [14] or a simple grid information service in [43], as mentioned in Section 2.The goal of RIs is to help users find documents with content of interest across potential P2P sources efficiently.

The RI represents the availability of data of a specific type in the neighbor’s information base. A version of RI called Hop-count Routing Index (HRI) [11] is used, which considers the number of hops needed to reach a datum. This implementation ofHRI calculates the aggregate quality of a neighbor domain, based on a summary of the features of the resources existing inthat domain.

Page 7: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Fig. 4. HRI for peer P1.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 621

HRI have been used as described in [11]: in each peer, the HRI is represented as an M � N table, where M is the number ofneighbors and N is the horizon (maximum number of hops) of the Index: the nth position in the mth row keeps the summa-ries of the domains that can be reached going through neighbor m, within n hops. As domains get farther away from the localpeer, their summaries will be more concise – they will be smaller. This means that each domain only keeps precise informa-tion on the domains which are closer in the topology.

In order to adapt summaries to RIs to create an operational resource discovery technique, a number of developments havebeen made, namely n-level summaries, range queries, and goodness function, which are presented the next.

4.2.1. N-level summariesIn order to adapt summaries to be used within RIs, we have developed what we call n-level summaries. In n-level sum-

maries, the size and precision of the i-level summary of a domain is higher than that of the j-level summary if i < j. As pre-sented in the Section 4.1, summaries are created in three different steps, namely pruning, leafing, and filtering. This is howthe level 1 summaries are created. Summaries of level i onward (for i P 2) are created based on the level i� 1 summary, byfiltering the attribute–value pairs whose probability is below a threshold – obviously, this threshold must be higher than thethreshold used for the creation of the level i� 1 summary, otherwise there would be no difference between summaries ofdifferent levels for the same local domain.

The level 1 summary of a domain is created locally, whilst the summaries level 2 onward are created on a neighbor do-main. This is why summaries of level 2 onward cannot be created like level 1 summaries: the Cobweb tree of a domain is notsent anywhere, so pruning or leafing can only be done locally.

As an example, the HRI of peer P1 is presented in Fig. 4 (for the topology depicted in Fig. 3). This figure shows SðpiÞ, whichis the level 1 summary of peer pi, whilst SðSðpiÞÞ is the level 2 summary of peer pi, and so on. Fig. 5 shows the way how sum-maries are created and propagated. Each domain creates its level 1 summary and propagates it to its direct neighbors. On thereception of this level 1 summary, the neighbors create the level 2 summary and propagate it to their respective neighbors.In Fig. 5, the level 1 summary of peer P4 is created in P4 itself and propagated to P2; the level 2 summary of P4 is created inP2 and propagated to P1 and P5.

A key point in resource discovery is range queries, that our proposal handles as explained in next section.

Page 8: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Fig. 5. Propagation of summaries.

622 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

4.2.2. Range queriesAn important feature of a resource discovery proposal is its ability to perform range queries. An example of range query is

{1 GB < available-memory} or {50 GB < disk-space < 100 GB}. This type of queries can be performed in numerical attributeseasily but require the implementation of an ordering feature to use it in other types of attributes.

The way how we implement range queries is as follows. Recall that the categories created using the summarization tech-nique keep the aggregated mean and standard deviation of numerical attributes. Thus, on the reception of a range query, abroker checks the overlap between the range appearing in the query and the range created using the mean and standarddeviations appearing each category in each summary. For example, if the query includes the term {1 GB < available-mem-ory < 5 GB}, the range of this query is (1,5). Regarding the category range, if the category shows {available-memory,mean = 5 GB, sd = 1 GB}, the category range would be (4,6), which is calculated by adding and subtracting the standard devi-ation from the mean. So, there is no need to manually partition any attribute range, since the limit values for ranges are auto-matically generated. This is another advantage of our technique.

When calculating the overlap, the higher it is, the more probabilities that the checked domain has a resource with therequired feature. This is done for each category of each summary. This process would be similar for non-numerical attributes,but in this case an ordering feature would be necessary in order to define the proper order of values for these attributes (e.g.for the operating system attribute, Windows 7 would be higher than Windows 98).

Algorithms 1 and 2 present the way how this is done. Using the result of applying these algorithms for a multi-attributequery is straight-forward, just by multiplying the overlap of each attribute (if there are more than one attribute ranges in thequery) or the probability of the single value attributes of the query when calculating the category quality.

Algorithm 1. Range queries algorithm: Summary check

1: Let q = a query for a resource meeting some set of requirements2: Let Sd = summary of domain d3: Let cat = a category in Sd

4: Let catQ = the quality of category as returned by Algorithm 25: Let maxCat = keeps the maximum value returned by Algorithm 26: Input: q; Sd

7: Output: maxCat8: maxCat :¼ 0.09: for all cat in Sd do10: {Check each category, done as explained in Algorithm 2}11: catQ :¼ CheckCategory (cat; q)12: if(catQ > maxCat) then13: maxCat :¼ catQ14: end if15: end for16: return maxCat

Page 9: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 623

Algorithm 2. Range queries algorithm: Category check

1: Let q = a query for a resource meeting some set of requirements2: Let q range = the range included in the query, defined by qLowLimit; qHighLimit3: Let q range legth = the length of q range, defined as qHighLimit � qLowLimit4: Let cat range = the range present in the category, defined by catLowLimit; catHighLimit5: Let cat range legth = the length of cat range, defined as catHighLimit � catLowLimit6: Let mean = the mean of a numerical attribute in a category7: Let sd = the standard deviation of a numerical attribute in a category8: Let cat = a category in the summary of domain d9: Input: q; cat10: Output: the quality of cat with regard to the requirements of q11: catLowLimit = mean� sd12: catHighLimit = meanþ sd13: if ((catLowLimit ¼> qLowLimit) && (catHighLimit <¼ qHighLimit)) then14: {The category range is totally contained in the query range}15: return 116: else if ((catHighLimit < qLowLimit) jj (catLowLimit > qHighLimit)) then17: {The category range and the query range are not overlapped at all}18: return 019: else if ((catLowLimit <¼ qLowLimit) && (catHighLimit ¼> qHighLimit))then20: {The query range is totally contained in the category range}21: return (q range legth=cat range legth)22: else if((catLowLimit <¼ qLowLimit) && (qLowLimit <¼ catHighLimit <¼ qHighLimit))then23: {The category range is partially overlapped with the query range}24: return (ðcatHighLimit � qLowLimitÞ=q range legth)25: else if ((qLowLimit <¼ catLowLimit <¼ qHighLimit) && (catHighLimit ¼> qHighLimit)) then26: {The category range is partially overlapped with the query range}27: return (ðqHighLimit � catLowLimitÞ=q range legth)28: end if

The final key point of our technique is the goodness function [11]. Next section explains the goodness function developedfor this work.

4.2.3. Goodness functionThe goodness function [11] is needed to decide which of the neighbors of a domain is more likely to have resources meet-

ing some set of requirements. This function is calculated for each neighbor domain using the summaries stored in the localHRI, and represents the probability of each neighbor domain of having a resource with the needed requirements.

The function developed for this work is:

goodnessðpiÞ ¼X

j¼1...H

ProbðHRIi;j;ReqÞ ð1Þ

where pi is the neighbor domain being checked; H is the HRI horizon and provides an upper bound on the distance (numberof hops) searched [11] (it would be 3 for the HRI in Fig. 4); HRIi;j refers to the summaries of the domains that can be reachedthrough the neighbor peer pi at j hops; Req is the set of requirements that must be met; and Prob refers to the probability thatthe summaries contained in the fi; jg position of the HRI meet the set of requirements Req, and is calculated as the maximumprobability for Req of all the categories stored in this cell of the HRI, and includes both single value attributes (e.g.fOS� name ¼ MACg) and range attributes (e.g. {50 GB < disk-space < 100 GB}).

Once the goodness function is calculated for all the neighbor domains, the local domain chooses the neighbor domainwhose goodness function yields the highest value. So, for the HRI depicted in Fig. 4, P1 should calculate the goodness functionfor P2 and P3.

4.2.4. Example of useIn order to better illustrate the use of the technique presented in this paper, this section presents an example in which the

broker in domain P1 in Fig. 3 receives a query from a user searching for a computing resource whose virtual organization(VO) is Air. No local resource matches this requirement, so the broker in P1 must choose between P2 and P3 to forwardthe query to. For this example, HRI has a horizon of 2, and summaries are created using depth 0. The threshold for the 1-levelsummary is 0.25, which raises to 0.35 for the level 2 summary.

Page 10: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Table 2Resource description for the example of use.

OS-name VO CPU model LRMS Arch.

Linux Air Opteron PBS x64Windows Space Pentium3 PBS x64Solaris Sea Corei5i7 SGE i386MAC Air Corei5i7 SGE i386

Table 3Calculation of HRIs.

Peer P2 P3

1 Hop ProbðHRI2;1;VO ¼ AirÞ ¼ MaxðProbðVO ¼ AirjP2ÞÞ ¼ 0:25 ProbðHRI3;1;VO ¼ AirÞ ¼ MaxðProbðVO ¼ AirjP3ÞÞ ¼ 0:52 Hops ProbðHRI2;2;VO ¼ AirÞ ¼ MaxðProbðVO ¼ AirjP4Þ;

ProbðVO ¼ AirjP5ÞÞ ¼¼ Maxð0;0:5Þ ¼ 0:5ProbðHRI3;2;VO ¼ AirÞ ¼ MaxðProbðVO ¼ AirjP6Þ;ProbðVO ¼ AirjP7ÞÞ ¼ Maxð0;0:5Þ ¼ 0:5

Fig. 6. EU DataGRID Testbed 1.

Table 4Calculating goodness function.

goodnessðP2Þ ¼ 0:25þ 0:5 ¼ 0:75

goodnessðP3Þ ¼ 0:5þ 0:5 ¼ 1

624 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

Table 1 presents the features of the resources in domains P2; P4 and P6 and Table 2 presents the features of the resourcesin domains P3; P5, and P7. Considering these data, the summaries of P2; P4 and P6 consist of one category, and yieldPðVO ¼ AirÞ ¼ 0:25 each one. For the other peers, their summaries consist of one category that yields PðVO ¼ AirÞ ¼ 0:5 eachone. For peers P4; P5; P6, and P7, the summaries stored at P1 are their level-2 summaries, whose threshold is 0.35. So, thePðVO ¼ AirÞ for P4 and P6, which originally was 0:25, is filtered out. Considering this, Table 3 shows the calculations neededand the final results obtained, whilst Table 4 shows the process of calculating the goodness function for P2 and P3. It can beseen that P1 will choose the peer P3 to forward the query to, and this is because it has more resources with VO ¼ Air than P2.

This example shows a query containing one attribute, but it can be seen that the technique also works for multi-attributequeries and range queries.

Another important issue of RIs is the search technique, which essentially works as follows. When a user wants to run ajob, he or she submits a job request to the local broker. If there is no resource meeting the requirements of the job in the localdomain (let us call it Pi), the local broker starts a discovery process. First, the local broker will calculate the goodness functionof all the direct neighbors of the local domain using its HRI, and will forward the query to the chosen neighbor peer. Theneighbor peer (let us call it Pj) will in turn check its local resources. In the case that there are suitable resources, the broker

Page 11: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Table 5Domains specifications.

Peer ID Adm. Domain (Location) # Resources Depth # Users

0 RAL (UK) 135 5 121 Imp. College (UK) 170 7 162 NorduGrid (Norway) 56 4 43 NIKHEF (Netherlands) 64 5 84 Lyon (France) 41 4 125 CERN (Switzerland) 187 5 246 Milano (Italy) 16 3 47 Torino (Italy) 10 2 28 Rome (Italy) 20 3 49 Padova (Italy) 4 1 2

10 Bologna (Italy) 217 6 12

Fig. 7. Peer-to-peer topology.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 625

of Pj domain will return the resource chosen to the broker of Pi, which in turn will forward this information to the user. Ifthere are no suitable resources in Pj domain, the broker of Pj will calculate the goodness function for its direct neighbors(excluding Pi since it is the source of the query), and will proceed in the same way as Pi. If Pj does not have suitable resourcesand it has no other neighbor except Pi (or all the neighbors have been checked and the query has been bounced back to Pj),the broker in Pj will bounce the query back to Pi, which in turn will forward it to the second best neighbor peer. For moredetails in the search technique, please see [14].

5. Experimental results

In order to illustrate the usefulness of this work, a performance evaluation has been conducted in which our proposedtechnique is compared with the technique presented in [10,9], where a copy of all the summaries is stored in all the domainsin the system. Furthermore, a scenario where no summaries are created and all the information is stored in all the domains inthe system (e.g. no summarization or intelligent data dissemination and query is performed) is also presented.

This evaluation has been conducted by means of simulations. In order to perform experiments efficiently, these must berepeatable and controllable. So, we chose simulations as a way to achieve this repeatability and controllability. We have sim-ulated a grid system, and the simulation tool used is GridSim [13]. A scenario based on the EU DataGRID Testbed has beencreated, as shown in Fig. 6 [12]. EU DataGRID Testbed has been used in this evaluation because it is a real testbed widely usedin research, including simulations [13,10,14–20].

The topology shows eleven administrative domains across Europe, each of them with the structure shown in Fig. 1. Eachdomain in the system has a number of computing resources (adding 922 resources) and users (adding 100 users), which arepresented in Table 5. This table also shows the depth of the Cobweb tree for each domain, and shows values between 1 and 7(the work in [9] uses depths between 1 and 4, so our resources configurations are more heterogeneous).

Page 12: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 8. Number of completed jobs for 2 requirements per job.

(a) HRI

(b) NO HRI

Fig. 9. Number of completed jobs for 4 requirements per job.

626 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

Page 13: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 10. Number of hops for completed jobs for 2 requirements per job.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 627

These simulation features are similar to the ones used in [10] – where the summarization technique was originally pre-sented. Each user has 15 jobs, which add 1500 jobs, similarly to [14] (and 50% more jobs than in [9], where the summari-zation technique is extended). Each job has a number of requirements, from 2 to 4, chosen at random (similarly to[10,9]). Finally, each resource has a number of features, such as operating system or available memory, which are chosenat random (similarly to [10,9]).

We have performed the evaluation presented in this work using random features in order to present higher variability onthe types of jobs and resources that are employed, similarly to [10,9]. By doing so, we perform a thorough evaluation whichshows the better performance of our approach. The main reason for this is the fact that real resources and jobs have similarfeatures, e.g. researchers submit sets of experiments with the same requirements (for instance, the same experiment withdifferent input parameters). Similarly, resources also have similar features (e.g. all the machines in a cluster share the sameoperating system, CPU, memory, disk,. . .). So, using random features allows us to create a simulation environment which istougher than an evaluation based on actual features.

It can be seen that our simulation setup has been designed to be similar to the scenarios where the summarization tech-nique was originally presented [10,9], or in the cases where there are differences, our scenario imposes heavier restrictionsthan the scenarios used in those works.

Boundaries between administrative domains are shown in circles in Fig. 6. Hence, the connectivity structure leads to theP2P topology depicted in Fig. 7. In this work the physical topology is used as the P2P topology (similarly to [14]), which maynot be totally efficient. A key point in the use of RIs is constructing the topology of the system, as stated in [11]. In the casethat the topology of the system is not efficiently created, low performance in terms of too many messages forwarded throughthe network (which leads to an increase in the latencies when performing searches) can be obtained. In [11] several topol-ogies are studied, namely power law [50], tree and tree with cycles, where power law is defined as a network where fewnodes have a significantly higher connectivity than the rest. In this work, a tree topology is used, where links between peerscorrespond to physical links between domains. A better way of creating the topology of the system should take into accountresources features in order to group together domains whose resources share similar features, and this is a guideline for fu-ture research.

Depth and threshold parameters for summary creation for this scenario are tunned in this paper. Moreover, a scenariowhere summaries and HRIs are used (labeled as HRI in figures) is compared to a scenario in which summaries are copiedin all the domains in the system (labeled as NO HRI). The NO HRI is similar to the technique presented in [10,9]. Furthermore,

Page 14: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 11. Number of hops for completed jobs for 4 requirements per job.

(a) HRI

(b) NO HRI

Fig. 12. Number of queries for completed jobs for 2 requirements per job.

628 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

Page 15: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 13. Number of queries for completed jobs for 4 requirements per job.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 629

a scenario where no summaries are created, and all the information is copied in all the domains (labeled as DATA) is alsopresented. Besides, different number of requirements (from 2 to 4) for each job request have been tested, although resultsfor 3 requirements are not presented since they are essentially similar to 2 requirements.

In order to avoid jobs being forwarded between domains without end and consuming resources (e.g. network bandwidthwhen they are transmitted, computing power when they are processed) in the case that resources are not efficiently discov-ered, a time-to-live (TTL) is used, which has been set to the number of domains in the system (which is 11, similarly to [14]).This value for the TTL is not a limitation of this approach, we decided to use TTL = 11 in the experiments in order to leaveenough room for the HRI and NO HRI techniques to work properly, and avoid jobs being rejected because of a poorly chosenTTL – which would have jeopardized the results of this work. The TTL can be set to any value, without heavily affecting theperformance of the technique presented in this work. As we will discuss later, this fact is supported by Figs. 10 and 11, show-ing the number of hops for completed jobs for 2 and 4 requirements per job. It can be seen that for both cases the number ofhops is clearly lower than the TTL, so in a production environment a lower TTL, independent from the number of domains inthe system, can be used. The use of a TTL to limit the lifetime of queries going through the system is similar to the approachused by the Routing Information Protocol (RIP) [51], which uses a TTL of 15.

For HRI, a 2 hops horizon is used, and the creation of a level 2 summary is performed by increasing 0.1 the threshold usedin the level 1 summary. So, if the level 1 summary of a domain is created using a 0.1 threshold, the level 2 summary of thatdomain is created using a 0.2 threshold. This 0.1 value has been chosen because it is high enough to allow differences in sizebetween each level summary, but on the other hand it is low enough to allow summaries stay meaningful.

A number of statistics are presented to illustrate the usefulness of this work, among others number of completed jobs,number of queries per completed job, size of summaries kept across the system, and number of domains affected by a var-iation in a resource, among others. These are divided in two groups for better explanation, and are presented in next sections.

5.1. System performance

First, results regarding system performance are presented. This includes statistics such as number of completed jobs,number of hops needed to find resources, and number of queries per completed job, among others.

First statistic to be presented is the number of completed jobs (out of the total 1500 jobs) when varying the number ofrequirements per job, and the depth and thresholds of summaries, for HRI and NO HRI. These results are presented in Figs. 8and 9. This statistic is not presented for the DATA experiments because it yields 100% of completed jobs. The reason for this isthat all the domains in the system have all the information, this is, no summarization is performed. So, when a broker re-ceives a job having requirements that cannot be fulfilled by a resource in the local domain, a domain having such resourceis found with no error possibility.

Page 16: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 14. Precision for 2 requirements per job.

630 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

As Figs. 8 and 9 depict, the number of completed jobs clearly depends on the depth and thresholds of summaries, this is, itdepends on how summarized information on resources is. The more summarized information on resources is (with highthreshold and low depth), the less completed jobs there are.

Comparing HRI and NO HRI, it can be seen that they present similar results for 2 requirements per job (Fig. 8). As thenumber of requirements per job is increased, differences between them also increase to some extent. The largest differencehappens when jobs have 4 requirements, depth 1 and threshold 0.1 are used (Fig. 9). In this case, NO HRI outperforms HRI by400 jobs. The reason for this behavior is that for NO HRI all the domains can access a copy of the summaries of all the otherdomains in the system – which negatively affects its scalability. So, when a domain receives a job requesting resources whichare not available in the local domain, this job can be efficiently forwarded to another domain having such resource. For HRI,scalability is a major issue, so each domain only has summarized information on a subset of the domains in the system (notall of them), thus forwarding jobs to another domain having the necessary resources becomes a more difficult task.

Figs. 10 and 11 present the number of hops needed to find a resource for the jobs that have been successfully executed.Again, this statistic is not presented for DATA since a suitable resource is found for all the resources within 1 hop because allthe domains have a copy of the full information of all the resources in the system. It can be seen that when jobs have 2requirements, neither depth, threshold, nor the information dissemination and search policy (HRI or NO HRI) affect this,as Fig. 10 depicts. This is because jobs have only two requirements, so there are many suitable resources.

Page 17: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 15. Precision for 4 requirements per job.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 631

As the number of requirements is increased, more hops are needed to find suitable resources. For a given number ofrequirements, the threshold affects the number of hops, but only when depth is low. When depth is increased (close to4), then thresholds have few influence on the hops needed to find resources. This is because for low depths, categories inthe summaries have many fattribute;value; probabilityg tuples having probabilities under the threshold – these tuples arefiltered out during the summarization process. When the depth is increased, tuples have higher values for the probabilities(as can be seen in Fig. 2), so the threshold does not have a noticeable impact on the summary creation. When depth is lowerthan 3, the variation of thresholds is more important than for higher depths, and for depth 4 the threshold does not affect thisstatistic. For 4 requirements per job (presented in Fig. 11), slight differences can be seen. This means that resources can befound with limited effort for both HRI and NO HRI.

It can also be seen that for 2 and 4 requirements per job, the number of hops needed per completed job is never higherthan 2.5, clearly lower than the TTL which has been set to 11, the number of domains in the system. Hence, a lower TTL(independent from the number of domains in the system) could be used in a production environment at no performancecosts.

Figs. 12 and 13 present the average number of queries needed to find a resource. This statistic has been calculated divid-ing the total number of queries used in the system (including those used for the jobs that were completed, and those used forthe jobs that were rejected) by the number of jobs that were actually completed. This statistic is useful because all the que-

Page 18: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 16. Comparing summary sizes with non summarized data size.

632 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

ries in the system had to be transmitted over the network and analyzed by brokers, and these processes consumed resourcesof the system (e.g. bandwidth, CPU time). Again, for depths lower than 3, the variation of thresholds is more important thanfor higher depths, and for depth 4 the threshold does not affect this statistic. It can also be seen that, for all the cases, NO HRIperforms slightly better than HRI. The most noticeable difference between HRI and NO HRI is around 0:6 when jobs have 4requirements, and a depth of 4 is used for any threshold –this is depicted in Fig. 13.

Next statistic to be presented is the precision. In this work, precision refers to the ability of a domain to forward a job toanother domain that actually has a suitable resource, and is calculated as follows:

precision ¼ ExecutedArriving

ð2Þ

being Executed the number of jobs arriving from another domain that are executed in the local domain, and Arriving is thenumber of jobs arriving from another domain. Jobs being bounced back from another domain are not taken into account inthis statistic. When the precision is high, this means that jobs are forwarded to the right domain for their execution. Figs. 14and 15 present the precision, when varying the number of requirements per job, and the depth and thresholds of summaries,for HRI and NO HRI. For 2 requirements, neither depth nor threshold clearly affect the precision, but for 4 requirements anddepths lower than 3, the variation of thresholds is more important than for higher depths, and for depth 4 the threshold doesnot affect the precision.

Page 19: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) Level 1 categories (HRI and NO HRI)

(b) Level 2 categories (HRI)

Fig. 17. Comparing category sizes with non summarized data size.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 633

In this statistic, NO HRI presents better precision than HRI. This is a direct consequence of the aforementioned limitationof HRI with regard to the number of neighbor domains on which each domain keeps information.

Considering the results presented in this section, HRI is outperformed by the other proposals. Still, HRI yields reasonableresults and presents better scalability than the other proposals, as next section details.

5.2. Scalability

In this section statistics regarding the scalability of the approach are presented. These include among other statistics, sizeof summaries, and domains affected by changes.

The first statistic in this group is the amount of information existing in the system for the resource discovery purposes,and is depicted in Fig. 16. This is calculated by adding the size of the information that all the domains in the system store. Inthe case of HRI, this is calculated by adding the size of the HRI for all the domains in the system. Then, the size of the HRI ofone domain is calculated as the addition of the sizes of the summaries the HRI of this domain keeps. For example, for the HRIof domain p1 depicted in Fig. 4, the size of this HRI is calculated by adding the sizes of the level 1 summaries of p2 and p3, thelevel 2 summaries of p4; p5; p6, and p7, and so on. In the case of NO HRI, all the domains have the level 1 summaries (the only

Page 20: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

(a) HRI

(b) NO HRI

Fig. 18. Bytes per resource ratio.

634 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

type of summary existing in this scenario) of all the other domains in the system. In the case of DATA, all the domains havethe full information (not summarized) of all the other domains.

As Fig. 16 depicts, the higher depth is, the larger summaries are, but in any case they are a fraction of the actual infor-mation not summarized at all (the DATA line in the figure). Furthermore, HRI outperforms NO HRI in all the cases, and dif-ferences increase as depth increases. The largest difference is around 30% – achieved for depth 4. For HRI, depth andthreshold have lower importance as for NO HRI, meaning that differences between summary sizes for different depthsand thresholds are lower for HRI.

Fig. 17 presents a comparison between the smallest information unit sent through the system in all the scenarios. This isthe smallest piece of information that brokers at each domain will have to forward to the other domains in the system inorder to keep them up-to-date. This unit refers to the size of categories in the case of HRI and NO HRI, or the size of the infor-mation of a single resource in the case of DATA. For HRI there are two types of categories, level 1 and level 2 categories –since we have a 2 hops HRI.

It can be seen that for some parameters, a category may be larger than actual data on a resource – this happens for depth1 and threshold 0.1 for both HRI and NO HRI. The reason is that a category contains the probability for each attribute of aresource to have each of the possible values. Thus, in the case that the threshold is low, there will be many tuplesfattribute;value; probabilitygwhose probability is close to 0, which increase the size of the summary. Furthermore, a categoryincludes information on more than one resource, which makes it larger – and more meaningful.

Besides, for threshold 0.5, there is little difference between the level 1 and level 2 summaries. Recall that a level 2 sum-mary is created based on the level 1 summary by increasing the threshold by 0.1. Thus, if the level 1 summary was createdusing a 0.5 threshold, the level 2 summary is created by applying a 0.6 filter over the level 1 summary. When a 0.5 thresholdis in place, only tuples with high probability remain, most of which still remain in the level 2 summary. It can also be seenthat thresholds have a noticeable influence on the size of categories for depths lower than 3, but for depth 4 their influencebecomes almost negligible.

Page 21: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

Fig. 19. Number of domains each domain must keep up-to-date.

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 635

Fig. 18 presents the ratio of bytes per resource in the system. This statistic is calculated by adding the size of all the sum-maries and dividing it by the number of resources in the system (recall that there are 922 resources). This statistic refers tothe amount of bytes the information of a resource is kept in, in other words, it refers to the amount of bytes needed to rep-resent one resource. Similarly to other statistics presented here, thresholds only affect for depths lower than 3, but for depth4 threshold variations become almost negligible. It can also be seen that when HRI is used, this ratio is lower than for NO HRI.Thus, with HRI a category will represent information of several resources in the system – thus it has more meaning.

When the information of a resource changes, this change must be propagated to a subset of the other domains in the sys-tem in order to keep them up-to-date – this task being performed by the broker of the local domain. Fig. 19 presents thenumber of domains that must the updated in the event of a change in one domain when using HRI – for NO HRI and DATA

all the domains in the system must be updated, and are not represented in the figure. It can be seen that, in the worst case, achange in NG_Broker (representing the NorduGrid node in the topology showed in Fig. 7) affects 9 domains in the system.On the other hand, a change on CE_Broker (representing the CERN node) only affects 2 other domains in the system. Onaverage, a change in one domain affects 5:09 domains, 46:28% of the domains affected by changes using the othertechniques.

Considering the results presented in this section, several points can be concluded. First, threshold variations affect statis-tics for depths lower than 3, and become almost negligible for higher depths. So, a depth of 3 can be concluded to be a rea-sonable value for this parameter. With regard to the threshold, the value 0.1 yields more completed jobs, similar summarysizes and higher precision than the other values, though more hops are needed to find resources. On the other hand, the va-lue 0.25 yields less completed jobs and lower precision than 0.1 but smaller categories, which improves on the scalability ofthe proposal.

Finally, these statistics conclude that HRI is more scalable than the other proposals, since less information is sent throughthe system and fewer domains must be kept up-to-date.

6. Conclusions

We propose a new technique to perform resource discovery in grids which uses information summarization, and extendsproposals from literature using summarization, such as [10,9], because this work (1) uses an efficient and scalable way todisseminate and query summarized information over the system based on peer-to-peer (P2P), namely Routing Indices(RIs) [11], (2) adapts the summarization technique to the RIs by means of creating different types of summaries (called n-level summaries), (2) presents a way to conduct multi-attribute and range queries based on summarized data, and (4) pre-sents a metric (called goodness function) needed by RIs to guide the query process. Moreover, this paper presents a perfor-mance evaluation based on the EU DataGRID Testbed [12] that shows the better scalability and good performance of theproposed technique compared to proposals from literature.

Considering the results presented in this paper, a combination of the HRI and NO HRI proposals would be of interest inreal systems, such as National Grid Initiatives, NGI [52]. This is, within a NGI, information can be disseminated according tothe NO HRI technique, and relations between different NGIs can be implemented using HRI. This way, an scalable resourcediscovery service could be implemented. This would be similar to Internet routing, where there are intra autonomous system(intra-AS) routing algorithms such as Routing Information Protocol (RIP) [51], and inter autonomous system (inter-AS) algo-rithms such as Border Gateway Protocol (BGP) [53].

Regarding future work, the use of statistical data rather than actual data for the creation of summaries is going to be stud-ied. This way, tendencies can be more efficiently captured, and this may be of interest for dynamic attributes, such as avail-able memory. Besides, work on studying the influence of merging attributes that are highly correlated (e.g. these attributesfrequently appear together in queries), is also among the future work. In this work the physical topology is used as the P2P

Page 22: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

636 A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637

topology, which may not be totally efficient. Work on developing algorithms to dynamically create topologies based on thesummaries of domains is among the future research.

Another guideline for future work is related to the storage of raw (that is, not summarized at all) static information com-bined with the use of summarized information for dynamic attributes, all of them disseminated using RIs. In this work wehave considered that static data should be summarized along with dynamic data. The reason for this is that if static data arenot summarized, the size of the static information that should be stored all over the system may be too large – when the sizeof the system grows in terms of domains, computing resources or static attributes. This is true even if RIs are used to guidethe information dissemination and even if this information never becomes obsolete. This way, having static informationstored over the system may incur in high latencies when searching through it. The use of summarized information dissem-inated using RIs implies that (1) the amount of information stored is lower, and (2) the resources needed to store, search, andtransfer such information are reduced.

In any case, having raw static information stored over the system combined with the use of summarized information fordynamic attributes is an interesting guideline for future research. Considering this, the technique we propose can be usedalongside other techniques to handle static information separately. This way, users can combine both techniques, for staticattributes on the one hand and on the other hand dynamic ones using our technique. This way, we allow users to decide themore suitable technique based on their needs, which increases the versatility of our proposal.

Acknowledgments

Authors would like to acknowledge the support of the following European Union projects: RIPLECS (517836-LLP-1-2011-1-ES-ERASMUS-ESMO), PAC (517742-LLP-1-2011-1-BG-ERASMUS-ECUE), EMTM (2011-1-PL1-LEO05-19883), MUREE(530332-TEMPUS-1-2012-1-JO-TEMPUS-JPCR), and Go-Lab (FP7-ICT-2011-8/317601). Furthermore, we also thank SpanishMinistry of Science and Innovation for the Project TIN2008-06083-C03/TSI, and the Community of Madrid for the supportof E-Madrid Network of Excellence (S2009/TIC-1650).

References

[1] I. Foster, What is the Grid? – a three point checklist, GRIDtoday 1 (6) (2002).[2] LCG (LHC Computing Grid) Project, Web page at <http://lcg.web.cern.ch/LCG>, Date of last access: August 20, 2013.[3] European Grid Infrastructure (EGI), Web page at <http://www.egi.eu/>, Date of last access: August 20, 2013.[4] M.L. Massie, B.N. Chun, D.E. Culler, The Ganglia distributed monitoring system: design, implementation, and experience, Parallel Computing 30 (5–6)

(2004) 817–840.[5] W. Miah, Monitoring scientific computing infrastructure using Nagios, Tech. Rep. RAL-TR-2010-002, SFTC Rutherford Appleton Laboratory, 2010.[6] I.T. Foster, Globus Toolkit Version 4: Software for Service-Oriented Systems, in: Proc. of the Intl. Conference on Network and Parallel Computing (NPC),

Beijing, China, 2005.[7] M. Ellert, M. Grnager, A. Konstantinov, B. Kónya, J. Lindemann, I. Livenson, J.L. Nielsen, M. Niinimäki, O. Smirnova, A. Wäänänen, Advanced resource

connector middleware for lightweight computational grids, Future Generation Computer Systems 23 (2) (2007) 219–240.[8] L. Tomás, A.C. Caminero, O. Rana, C. Carrión, B. Caminero, A gridway-based autonomic network-aware metascheduler, Future Generation Computer

Systems 28 (7) (2012) 1058–1069.[9] R. Brunner, A.C. Caminero, O.F. Rana, F. Freitag, L. Navarro, Network-aware summarisation for resource discovery in P2P-content networks, Future

Generation Computer Systems 28 (3) (2012) 563–572.[10] A.C. Caminero, E. Huedo, O. Rana, I.M. Llorente, B. Caminero, C. Carrión, Summary creation for information discovery in distributed systems, in: Proc. of

the 19th Intl. Conference on Parallel, Distributed and Network-based Processing (PDP), Ayia Napa, Cyprus, 2011.[11] A. Crespo, H. Garcia-Molina, Routing indices for peer-to-peer systems, in: Proc. of the Intl. Conference on Distributed Computing Systems (ICDCS),

Vienna, Austria, 2002.[12] W. Hoschek, F.J. Janez, A. Samar, H. Stockinger, K. Stockinger, Data management in an international data grid project, in: Proc. of the First Intl.

Workshop on Grid Computing, Bangalore, India, 2000.[13] A. Sulistio, U. Cibej, S. Venugopal, B. Robic, R. Buyya, A toolkit for modelling and simulating data grids: an extension to GridSim, Concurrency and

Computation: Practice and Experience 20 (13) (2008) 1591–1609.[14] A. Caminero, O. Rana, B. Caminero, C. Carrión, Network-aware heuristics for inter-domain meta-scheduling in Grids, Journal of Computer and System

Sciences 77 (2) (2011) 262–281.[15] H. Stockinger, F. Donno, E. Laure, S. Muzaffar, P. Kunszt, Grid data management in action: experience in running and supporting data management

services in the EU DataGrid project, in: Proc. of the Computing in High Energy Physics (CHEP), La Jolla, USA, 2003.[16] P. Kunszt, E. Laure, H. Stockinger, K. Stockinger, Advanced replica management with reptor, in: Proc. of the Intl. Conference on Parallel Processing and

Applied Mathematics, Lecture Notes in Computer Science, Czestochowa, Polan, 2004.[17] G. Avellino, S.B.B. Cantalupo, F. Pacini, A. Terracina, A. Maraschini, D. Colling, S. Monforte, M. Pappalardo, L. Salconi, F. Giacomini, E. Ronchieri, D. Kouril,

A. Krenek, L. Matyska, M. Mulac, J. Pospisil, M. Ruda, Z. Salvet, J. Sitera, M. Vocu, M. Mezzadri, F. Prelz, A. Gianelle, R. Peluso, M. Sgaravatto, S. Barale, A.Guarise, A. Werbrouck, The first deployment of workload management services on the EU DataGrid Testbed: feedback on design and implementation,in: Proc. of the Computing in High Energy Physics (CHEP), La Jolla, USA, 2003.

[18] W.H. Bell, D.G. Cameron, A.P. Millar, L. Capozza, K. Stockinger, F. Zini, Optorsim: a grid simulator for studying dynamic data replication strategies,International Journal of High Performance Computing Applications 17 (4) (2003) 403–416.

[19] A. Caminero, O. Rana, B. Caminero, C. Carrión, Improving grid inter-domain scheduling with P2P Techniques: a performance evaluation, in: Proc. of theSeventh Intl. Conference on Grid and Cooperative Computing (GCC), Shenzhen, China, 2008.

[20] A.C. Caminero, O.F. Rana, B. Caminero, C. Carrión, Network-aware peer-to-peer based grid inter-domain scheduling, in: Proc. of the Intl. Conference onGrid Computing & Applications (GCA), Las Vegas, USA, 2008.

[21] P. Trunfio, D. Talia, H. Papadakis, P. Fragopoulou, M. Mordacchini, M. Pennanen, K. Popov, V. Vlassov, S. Haridi, Peer-to-peer resource discovery in grids:models and systems, Future Generation Computer Systems 23 (7) (2007) 864–878.

[22] K. Czajkowski, C. Kesselman, S. Fitzgerald, I.T. Foster, Grid information services for distributed resource sharing, in: Proc. of Intl. Symposium on HighPerformance Distributed Computing (HPDC), San Francisco, USA, 2001.

[23] L.M. Khanli, S. Kargar, FRDT: footprint resource discovery tree for grids, Future Generation Computer Systems 27 (2) (2011) 148–156.[24] R.-S. Chang, M.-S. Hu, A resource discovery tree using bitmap for grids, Future Generation Computer Systems 26 (1) (2010) 29–37.

Page 23: P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

A.C. Caminero et al. / Parallel Computing 39 (2013) 615–637 637

[25] H. Sun, J. Huai, Y. Liu, R. Buyya, RCT: a distributed tree for supporting efficient range and multi-attribute queries in grid computing, Future GenerationComputer Systems 24 (7) (2008) 631–643.

[26] J. Lee, P.J. Keleher, A. Sussman, Decentralized resource management for multi-core desktop grids, in: Proc. of Intl. Parallel & Distributed ProcessingSymposium (IPDPS), Atlanta, USA, 2010.

[27] J.-S. Kim, B. Nam, M.A. Marsh, P.J. Keleher, B. Bhattacharjee, A. Sussman, Integrating categorical resource types into a P2P desktop grid system, in: Proc.of Intl. Conference on Grid Computing (Grid), Tsukuba, Japan, 2008.

[28] F. Heine, M. Hovestadt, O. Kao, Towards ontology-driven P2P grid resource discovery, in: Proc. of the Fifth Intl. Workshop on Grid Computing (GRID),Pittsburgh, USA, 2004.

[29] S. Andreozzi, S. Burke, F. Ehm, L. Field, G. Galang, B. Konya, P.M. Maarten Litmaath, J. Navarro, GLUE Specification v. 2.0, Tech. Rep., 2009, <http://forge.ogf.org/sf/projects/glue-wg>.

[30] J. Li, Grid resource discovery based on semantically linked virtual organizations, Future Generation Computer Systems 26 (3) (2010) 361–373.[31] D.H. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2 (1987) 139.[32] R. Hayek, G. Raschia, P. Valduriez, N. Mouaddib, Summary management in P2P systems, in: Proc. of the Intl. Conference on Extending Database

Technology (EDBT), Nantes, France, 2008.[33] S. Michel, M. Bender, N. Ntarmos, P. Triantafillou, G. Weikum, C. Zimmer, Discovering and exploiting keyword and attribute-value co-occurrences to

improve P2P routing indices, in: Proc. of the Intl. Conference on Information and Knowledge Management (CIKM), Arlington, USA, 2006.[34] M. Cardosa, A. Chandra, Resource bundles: using aggregation for statistical wide-area resource discovery and allocation, in: Proc. of the 28th Intl.

Conference on Distributed Computing Systems (ICDCS), Beijing, China, 2008.[35] L. Hennig, W. Umbrath, R. Wetzker, An ontology-based approach to text summarization, in: Proc of the Intl. Conference on Web Intelligence and

Intelligent Agent Technology Workshops (WI-IAT), Sydney, Australia, 2008.[36] I. Yoo, X. Hu, I.-Y. Song, A coherent biomedical literature clustering and summarization approach through ontology-enriched graphical representations,

in: Proc. of the Eighth Intl. Conference on Data Warehousing and Knowledge Discovery (DaWaK), Krakow, Poland, 2006.[37] L.F.F. Garcia, J.V. de Lima, S. Loh, J.P.M. de Oliveira, Using ontological modeling in a context-aware summarization system to adapt text for mobile

devices, in: Proc. of the First Active Conceptual Modeling of Learning (ACM-L) Workshop, Tucson, USA, 2006.[38] R. Saint-Paul, G. Raschia, N. Mouaddib, Database summarization: the SaintEtiQ system, in: Proc. of the 23rd Intl. Conference on Data Engineering

(ICDE), Istanbul, Turkey, 2007.[39] G. Evangelopoulos, A. Zlatintsi, G. Skoumas, K. Rapantzikos, A. Potamianos, P. Maragos, Y. Avrithis, Video event detection and summarization using

audio, visual and text saliency, in: Proc. of the Intl. Conference on Acoustics, Speech and Signal Processing (ICASSP), Washington, DC, USA, 2009.[40] S. Liu, M.X. Zhou, S. Pan, W. Qian, W. Cai, X. Lian, Interactive, topic-based visual text summarization and analysis, in: Proc. of the 18th Conference on

Information and Knowledge Management (CIKM), Hong Kong, China, 2009.[41] D. Simakov, Y. Caspi, E. Shechtman, M. Irani, Summarizing visual data using bidirectional similarity, in: Proc. of the Conference on Computer Vision and

Pattern Recognition (CVPR), Anchorage, USA, 2008.[42] B. Jiao, L. Yang, J. Xu, F. Wu, Visual summarization of web pages, in: Proc. of the 33rd Intl. Conference on Research and Development in Information

Retrieval, Geneva, Switzerland, 2010.[43] D. Puppin, S. Moncelli, R. Baraglia, N. Tonellotto, F. Silvestri, A grid information service based on Peer-to-peer, in: Proc. of the 11th Intl. Euro-Par

Conference, Lisbon, Portugal, 2005.[44] M. Marzolla, M. Mordacchini, S. Orlando, Peer-to-peer systems for discovering resources in a dynamic grid, Parallel Computing 33 (4–5) (2007) 339–

358.[45] F.A. Memon, D. Tiebler, F. Dürr, K. Rothermel, Optimized information discovery using self-adapting indices over distributed hash tables, in: Proc. of the

29th Intl. Performance Computing and Communications Conference (IPCCC), Albuquerque, USA, 2010.[46] A. Robles-Gómez, A. Bermúdez, R. Casado, Efficient network management applied to source routed networks, Parallel Computing 37 (3) (2011) 137–

156.[47] A. Gupta, L.K. Awasthi, A containment-based security model for cycle-stealing P2P applications, Information Security Journal: A Global Perspective 19

(4) (2010) 191–203.[48] P. Wu, G. Wu, A reputation-based trust model for P2P systems, in: Proc. of Intl. Conference on Computational Intelligence and Security (CIS), Beijing,

China, 2009.[49] C. Gu, S. Zhang, H. Feng, Y. Sun, A novel trust management model for P2P network with reputation and risk evaluation, in: Proc. of Intl. Conference on E-

Business and E-Government (ICEE), Guangzhou, China, 2010.[50] L.A. Adamic, R.M. Lukose, A.R. Puniyani, B.A. Huberman, Search in power-law networks, Tech. Rep., HP Labs, 2001.[51] C. Hedrick, Routing information protocol, internet request for comments 1058, 1988.[52] National Grid Initiatives (NGI), Web page at <http://www.egi.eu/about/ngis/>, Date of last access: August 20, 2013[53] Y. Rekhter, T. Li, S. Hares, A border gateway protocol 4 (BGP-4), Internet request for comments 4271, 2006.