ip-fp6-015964 aeolus algorithmic principles for building e ... · search, web archiving,...

49
IP-FP6-015964 AEOLUS Algorithmic Principles for Building Efficient Overlay Computers Deliverable D3.1.1 Distributed data management: State-of-the-art survey and algorithmic solutions Responsible Partner: Max-Planck Institute for Informatics (D) Report Preparation Date: September 2006 Contract Start Date: 01/09/05 Duration: 48 months Project Co-ordinator: University of Patras (EL)

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

IP-FP6-015964

AEOLUS

Algorithmic Principles for Building Efficient Overlay Computers

Deliverable D3.1.1

Distributed data management:

State-of-the-art survey and algorithmic solutions

Responsible Partner: Max-Planck Institute for Informatics (D)Report Preparation Date: September 2006

Contract Start Date: 01/09/05 Duration: 48 monthsProject Co-ordinator: University of Patras (EL)

Page 2: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management
Page 3: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Contents

1 Introduction 1

2 Application Areas and Requirements 22.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Load Characteristics and Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . 62.3 Quality-of-Service Requirements . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Data Placement 123.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Structured Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 DHTs for Global Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 DHTs and SONs for Web Search . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Initial AEOLUS Results: MINERVA∞ for Scalable P2P Web Search . . . . 20

4 Caching and Replication 234.1 Data Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Dynamic Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Proactive Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Self-Tuning Configuration and Adaptation 305.1 Initial AEOLUS Results: Practical Proactive Replication . . . . . . . . . . 31

6 Conclusion and Outlook 32

i

Page 4: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management
Page 5: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

1 Introduction

This report gives a survey on the state of the art in distributed data management for globalcomputing. It reviews prior work, puts them into perspective, and provides a systematiccategorization of technical issues and major research directions.

Distributed data management in global computing spans many different facets with awealth of existing research results. The work can be differentiated along three dimensions:

• System architectures: The global network and the local node architectures canvary widely from carefully planned and administered Grid architectures to self-organizing peer-to-peer systems, from file sharing to service-oriented computing,or from distributed database management to Web content management and contentdelivery networks.

• Data models: The data itself can exhibit different degrees of structure, unifor-mity or diversity, semantic annotations, and data-quality properties. Data modelsthat capture these aspects to different extents range from text documents and Webpages, over semistructured data with metadata attributes and annotations usingXML or RDF formats, all the way to rigorously typed data like relational databasesor schematic XML data.

• Application areas: There are numerous applications for global computing, bothexisting and anticipated ones. The various application areas exhibit fairly differ-ent characteristics and thus pose different requirements. Important areas are Websearch, Web archiving, publish-subscribe services, scientific workbenches and collab-oration, sensor networks, management of personal data spaces and social networks.

It is impossible to cover this huge variety of architectures, data models, and applicationareas, and do justice to the diversity of approaches. Therefore, this report concentrates ona particular kind of system architecture, namely, peer-to-peer (P2) systems [144], whichare characterized by the absence of central planning and administration and rather rely onself-organization principles. They offer the potential for being scaled up to millions of com-puters, conceptually forming a single virtual overlay computer. On the other hand, theyface the challenge of the high dynamics of P2P networks caused by evolving characteris-tics of nodes, by node failures, or by so-called churn, the phenomenon that nodes join andleave the network at high rates without giving prior notice. Our overview and discussionof distributed data management will be driven by the goals of achieving ultra-scalabilityand robustness to high dynamics in a P2P setting.

In terms of data models, we will try to abstract from this aspect as much as possible,using primarily the simplest Web-page model for illustration. As for application areas, wewill discuss their requirements in the following section, and will later focus on four majorareas as drivers for the technical discussion of data management methods.

1

Page 6: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

A key theme that we consider in this report is the need for quality-of-service (QoS)guarantees, at both system-internal and application-perceived levels. Building a globaloverlay computer based on the P2P paradigm that merely delivers “best-effort” behavioris not an easy task, but would fall short of the high ambitions of the AEOLUS project. Abest-effort global computer could exhibit frequent outages, lose computational results oreven primary data, and may, despite good typical performance, occasionally degrade belowacceptable levels in terms of application-perceived delays or throughput deterioration.The AEOLUS goal is to understand what guarantees can be given for a P2P-based globalcomputer, and which mechanisms and strategies one should employ in order to provideguaranteed QoS. The work towards this ambitious goal will utilize analytic modeling aswell as experimental methods like simulations and measurements.

The rest of this report is organized as follows. Section 2 discusses requirements and theneed for quality-of-service guarantees for different applications areas. Section 3 discussesdata placement without redundancy, with emphasis on structured forms of global overlaynetworks. Section 4 introduces data redundancy and discusses the issues and approachesfor caching and replication. Section 5 discusses the issue of how to make data managementin a global overlay computer self-tuning and adaptive with respect to evolving conditions;we particularly consider the tuning of the degree of replication and the placement ofreplicas. The report concludes with an outlook on the next steps in this WP of theAEOLUS project.

2 Application Areas and Requirements

2.1 Applications

As already mentioned in the introduction of this report, we pursue ways towards a P2P-based global overlay computer that supports a variety of data-centric application areassuch as:

• Web search,

• Web archiving,

• Publish-subscribe services,

• sensor networks,

• scientific workbenches and collaboration,

• personal data spaces and social networks.

In the following we briefly discuss each of these areas, its workload characteristics, andits requirements.

Web search: A conceivable, very intriguing application for a P2P-based global overlaycomputer is Web search [144, 38, 94, 149, 115, 3, 157, 7, 15, 16, 165, 93, 81, 113, 116,

2

Page 7: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

125, 17, 77]. The functionality would include search for names and simple attributes offiles, but also Google-style keyword or even richer XML-oriented search capabilities. Itis important to point out that Web search is not simply keyword filtering, but involvesrelevance assessment and ranking search results. We envision an architecture where eachpeer has a full-fledged search engine, with a focused crawler, an index manager, and a top-k query processor. Each peer can compile its data at its discretion, according to the user’spersonal interests and data production activities (e.g., publications, blogs, news gatheredfrom different feeds, Web pages collected by a thematically focused crawl). Queries canbe executed locally on the small-to-medium personalized corpus, but they can also beforwarded to other, appropriately selected, peers for additional or better search results.

For this application, the P2P paradigm has a number of potential advantages overcentralized search engines with very large server farms: 1) The load per peer is orders ofmagnitude lower than the load per computer in a server farm, so that the P2P-based globalcomputer could afford much richer data representations, e.g., utilizing natural-languageprocessing, and statistical learning models, e.g., named entity recognition and relationlearning. 2) The local search engine of each peer is a natural way of personalizing searchresults, by learning from the user’s explicit or implicit feedback given in the form of querylogs, click streams, bookmarks, etc. In contrast, personalization in a centralized searchengine would face the inherent problem of threatening privacy by aggregating enormousamounts of sensitive personal data. 3) The P2P network is the natural habitat for collab-orative search, leveraging the behavior and recommendations of entire user communitiesin a social network. A key point is that each user has full and direct control over whichaspects of her behavior are shared with others, which ones are anonymized, and whichones are kept private.

Web archiving: Today, virtually all Web repositories, including digital libraries andthe major Web search engines, capture only current information. But the history of theWeb, its lifetime over the last 15 years and many years to come, is an even richer source ofinformation and latent knowledge. It captures the evolution of digitally born content andalso reflects the near-term history of our society, economy, and science [56, 101, 20, 19].Web archiving is done by the Internet Archive [13], with a current corpus of more than 2Petabytes and a Terabyte of daily growth, and, to a smaller extent, some national librariesin Europe. These archives have tremendous latent value for scholars, journalists, and otherprofessional analysts who want to study sociological, political, media usage, or businesstrends, and for many other applications such as issues of intellectual property rights.However, they provide only very limited ways of searching timelines and snapshots ofhistorical information. A conceivable “killer application” for P2P-based global computingwould be to implement comprehensive Web archiving in a completely distributed manner,utilizing the aggregated resources of millions of decentralized computers, and to provideexpressive and efficient “time-travel” querying capabilities for both temporal snapshotsearch and the analysis of timelines of specific topis.

3

Page 8: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Publish-subscribe services: In addition to search on demand where an explicitlyissued query returns qualifying data items, many modern applications require that usersbe automatically alerted about newly published data that match the user’s implicitly orexplicitly specified thematic interests. This is known as the publish-subscribe (PubSub)paradigm [51]. In the database system research literature, subscriptions are also referredto as continuous queries [53, 45]: conceptually queries are continuously repeated to findnew matches as quickly as possible, but efficient implementations use clever ways of query-predicate indexing so that new data items can identify the queries for which they qualify.In information retrieval, PubSub is also known as information filtering [160, 152], and thedistributed-systems and middleware research community uses the term event posting [134,87, 139] for the Pub part and considers simpler forms of queries for the Sub part. PubSubwould be a perfect candidate for a P2P-based distributed approach, as both subscriptionsand newly posted data naturally arise in a decentralized manner [150, 65, 32, 9, 8, 155].Nevertheless, today’s successful PubSub applications are typically implemented in a server-oriented, largely centralized manner, especially for data items like news (feeds) or blogs.

Sensor networks: Such networks combine small devices that measure and monitorreal-world phenomena such as temperature or people in office buildings, cars on highways,water levels or pollution indicators in rivers and lakes, or the avalanche danger level onmountain slopes, to give a few prominent examples. In terms of scale, the biggest currentexample is probably the monitoring and real-time analysis of IP packets in Internet routers.Sensors can be stationary or mobile, or even part of mobile components that form an ad-hoc network without pre-configured infrastructure; cars on highways is an example ofthe latter. In addition to sensors, some devices may also serve as actuators as part offeedback loops or other control purposes. Many applications of sensor networks requirethe aggregation of values that are reported by individual sensors, in order to monitordanger levels and other thresholds [42, 96, 75]. Again, this is a perfectly decentralizedsetting for which a P2P-based global-computing approach seems to be the most naturalmethod of choice.

Scientific workbenches and collaboration: In e-science the management of largerepositories of scientific data has received most attention in the last few years. Addi-tional issues in this context are data integration, data-quality assurance, and data cu-ration. Typically, server-oriented or carefully configured Grid-oriented architectures arebeing pursued for these purposes. From the viewpoint of the individual scientist, how-ever, the big e-science repositories are only the starting point; scientists often want toextract appropriately chosen data fragments into their workspaces or electronic work-benches. The workspaces then form the basis for long-lived computations, hypothesestesting, simulations, and other ways of data mining and knowledge discovery. Typi-cally, all these computations produce new, derived data, which needs to be managed,too [147, 148, 49, 50, 62, 61, 156, 119, 86, 24, 71, 67]. Often, the scientific work proceedsthrough a number of analysis stages and constitutes entire workflows. In some cases, thecomputations are coupled with real-life experiments in the physical world, and the lab

4

Page 9: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

journals should then be kept in electronic form, too. Moreover, many leading projects innatural and other sciences involve geographically distributed teams and call for collabora-tive work and exchange of measurements, lab journals, and derived data. All this requirescorresponding support by the underlying computer infrastructure. Given the scale of mil-lions of scientists on this planet with certainly more than hundred thousand projects andhighly dynamic collaborations, this would be an ideal application for a P2P-based globaloverlay computer.

Personal data spaces and social networks: Scientists are one class of people whooften maintain extensive data on their personal computers or notebooks. But there aremany other categories like journalists, marketing and financial analysts, consultants, etc.,and even the “common Internet user” at least manages significant amounts of email data[109]. While simple email management is a common service today, a better service wouldactually consider also the data to which email refers and thus integrate also email attach-ments, file versions, and many other elements of the users’ electronic desktops. A trulycompelling and comprehensive service would go even further by automatically classifyingand organizing all relevant data items and automating many aspects of the users’ workprocesses. This vision is sometimes referred to as a semantic desktop or personal informa-tion manager [46, 110, 67, 31, 11, 48, 31]. The most promising architectural paradigm fora comprehensive solution, with ultra-high scalability, reliability, and availability, would bethe P2P-based global computer. Needless to say that strong security and privacy shouldbe prime issues in such a setting. But compared to server-based central approaches, P2P-based overlay computers have the potential for being much less vulnerable to load bursts,attacks, and sabotage.

The above is already challenging when we limit the means for data sharing amongusers to explicit email. But users often want to share more information and knowledge,they want to collaboratively create added value, and generally aim for dynamic knowledgefusion. These processes are typically embedded in social networks like groups of mutuallytrusted users, but these social groups may quickly grow and are highly dynamic in theirmemberships and interaction patterns. In applications like Web search, users may harvestthe recommendations, user-provided annotations, and query and click histories, to obtainbetter search results. All this needs to be managed, and flexible control over privacywith strict enforcement of personal policies are vital elements of such a data-sharing andknowledge-fusion vision. Fundamental research in P2P-based global computing seems tobe the most promising route towards this vision.

In the rest of this report, we will focus on four out of the above six application areas,namely, Web search, Web archiving, scientific collaboration, and personal data spaces.

5

Page 10: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

2.2 Load Characteristics and Tradeoffs

The various application areas introduced above exhibit very different workload charac-teristics. This situation leads to trade-offs that need to be considered in the design ofthe P2P-based global overlay computer. For example, it may be impossible to supporthigh update rates in a network with millions of computers and high dynamics with therequirement that replicas of data items should always be perfectly consistent. But if theupdate rate is low enough it may be possible even for extremely high rates of queries.A fundamental discussion of how to deal with these trade-offs is beyond the scope ofthis state-of-the-art report. In the course of the Aeolus project we aim at a deeper un-derstanding of trade-offs. In particular, we want to devise ways of supporting differentapplication areas, with divergent characteristics and requirements, on the same globaloverlay computer, possibly by means of dynamic and automatic reconfigurations. In thefollowing we discuss the most important workload properties that will lead us into theabove considerations at a later stage of the project.

Network size: Networks for the introduced applications can range from thousands ofnodes to many millions. Obviously, the goal of Aeolus is to provide unlimited scalability,but in the presence of trade-offs some designs may appeal to small-to-medium networksonly and serve their purpose for applications of that scale. Among our example applica-tions, scientific collaboration may often involve only hundreds or thousands of nodes. Onthe other hand, an entire e-science network may involve millions of scientists and students,and the smaller-scale shared workbenches are dynamically created and embedded into thelarger network. Web search and Web archiving naturally call for very large networks. Forpersonal data spaces and social networks, the scale may be even larger, as such networksinclude masses of individual users, whereas Web archiving can be seen as a collaborativeendeavor of organizations rather than individuals.

Network dynamics: Dynamics includes the evolving behavior of individual nodes,but also the fact that in large systems component failures occur frequently. Failures canbe transient, leading to outages of nodes but with eventual recovery, or permanent withthe risk of losing data. In addition, P2P networks may also exhibit high churn, in thesense that nodes can join and leave the network at very high rates, without explicit notice.Among our application areas, Web archiving and scientific collaboration would typicallyinvolve less chaotic dynamics than Web search and, especially, social networks.

Data volume: A commonality among all four application areas is that they involvevery large amounts of data, certainly in the Terabyte to Petabyte range, with strongtrends of further growth. Web search seems to be the least demanding application in thisregard, as it only stores the current Web content. Web archiving clearly requires ordersof magnitude higher storage capacity to index also the history of Web pages. Finally, forscientific workbenches and personal workspaces, the individual user’s storage requirementsmay be relatively low, but the aggregated data volume over all participating nodes mayeven dwarf the size of a comprehensive Web archive.

6

Page 11: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Network Network Data Update Varia- Trust andsize dynamics volume rate bility misbehavior

Web search high high high low high lowWeb archiving high medium very high medium low mediumScientific collab. medium low very high very high medium lowPersonal data high very high high high medium high

Table 1: Properties of Application Areas

Retrieval vs. update rates: Read-only requests are obviously easier to deal withthan updates to data. For example, replication may be simpler and much more aggressivein a read-only setting. Thus, the fraction of updates and their arrival rate are measuresof complexity for the global overlay computer. Among our four application areas, Websearch has only batch updates and is almost-read-only. For Web archiving, the updatestake the specific form of inserting new (or newly found) versions of crawled or posteddata items; however, the rate of these operations may be very high. Finally, for scientificcollaboration and personal data spaces, new data may be produced at high rates, withsignificant fractions being made sharable among users. For scientific collaboration, updatesmay even involve sophisticated consistency checking; so it is probably the most demandingapplication with respect to updates.

Variability and burstiness: Workloads can also be characterized in terms of theirvariability. This refers to properties such as CPU time or memory demand per work re-quest, and it should also capture the arrival patterns of work requests. The easy case iswhen requests exhibit relatively uniform properties, with little variation. High variance,on the other hand, incurs much more demanding conditions for the underlying computerinfrastructure. In particular, the variance of arrival patterns and extreme forms of bursti-ness are difficult to cope with. In these regards, Web search and scientific collaborationcan be expected to exhibit the highest degrees of variability, whereas Web archiving isa process that could, in principle, be carefully planned for load smoothing, and personalworkspaces should benefit from the spatial and temporal aggregation and correspondingsmoothing of many users each of which incurs relatively little work.

Degrees of trust and misbehavior: A last criterion that discriminates the appli-cation areas in their complexity are the degrees of trust among peers and, conversely,the expected degrees of misbehavior. Trust may range from unconstrained sharing overcleverly designed reputation assessment to bilateral or small-group service contracts. Mis-behavior may range from egoistic “free-rider” behavior over different forms of cheatingall the way to full-fledged attacks and attempts of large-scale manipulation. In this WP,we will initially disregard these issues, as they are the primary subject of other SPs. Weexpect to use results of the other SPs at a later project stage.

We summarize the trends from the above discussion in a table that shows the impor-tance of the different aspects for our four application areas.

7

Page 12: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

2.3 Quality-of-Service Requirements

Quality of Service (QoS) is a notion that is used in widely varying contexts and meanings.The origin of the term QoS is in multimedia systems, regarding the guaranteed deliveryof media streams such that they can be displayed at the receiving client(s) in a glitch-free manner. However, the notion has been broadened and includes aspects that wouldclassically fall into the area of performance modeling and fault-tolerant systems such asthroughput or availability [68, 105, 23, 102]. In this subsection, we therefore systematicallydiscuss what aspects we want to include under the general term of guaranteed QoS forP2P-based global computing.

Two overriding properties that all good solutions should provide are scalability andefficiency. We define these as follows.

Scalability: A system design should scale up in the following sense. When the load ofthe overall system increases by a factor of N , it should be possible to grow the system byincreasing its resources by a factor of O(N) and sustain the increased load while providingthe same or only moderately degraded QoS (in terms of response time, availability, etc.) asthe original system. Moderate degradation may be unavoidable, for example, for networklatency and response time of network-dominated operations, but should be limited to afactor of O(logN) or less. In the context of P2P-based global computing, the resourcesthat we would consider increasing is the number of nodes in the network (as opposed toincreasing the processing power, memory, etc. of each node - which is also an option thatwe would leverage, but is not in the focus of global computing).

Efficiency: Scalability does, in full generality, not necessarily imply efficiency. Wecan build a scalable system that, for a given design point with load X, is less efficient thananother system that performs much better for load X but does not scale to say 1000X.Efficiency measures the total resource consumption of the entire system for a given load,taking into consideration CPU time, disk I/O, memory consumption, and network traffic.Typically, the throughput of a system is inversely proportional to its resource consumption.Efficiency is often measured at the level of system-internal components, either per nodeor restricted to a particular algorithmic method within the system.

More specific notions of QoS, in terms of which we would like to specify and provideguaranteed system behavior, fall into the following four areas:

• Dependability,

• Performance,

• Result quality,

• Robustness.

8

Page 13: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Dependability:

This includes the proven correctness of the deployed algorithms and software, which isbeyond the scope of Aeolus, and also non-functional properties like reliability and avail-ability.

Reliability is the time until a system exhibits a failure that prevents it from furtherworking. In data management, this is the time until some data items become unrecoverablylost or damaged, e.g., because of the simultaneous failure of several disks whose datacontents cannot be recovered from other storage media. Reliability can be modeled as arandom variable; we can analyze it in its stationary behavior, e.g., in terms of the mean-time-to-data-loss (MTTDL), or in its transient behavior, e.g., in terms of the probabilitythat we will lose data in the next hour. We can consider the reliability of a component,e.g., a node, a link, a disk, an index manager, a replica of a data item, or the reliability ofthe entire system. By having redundancy and replacing permanently failed components,the entire system can achieve much higher reliability than its components.

Availability is the probability that a component or a data item is accessible at arandomly chosen timepoint. Like reliability, a natural model, given the stochastic natureof failures, is by a random variable. We can consider stationary availability, which isessentially a limit for an infinite time horizon, or transient availability, which considersa finite time horizon. Stationary availability is often given by the “number of nines”;for example, five nines denote an availability of 99.999 percent, which translates into anexpected downtime of about 5 minutes per year. For analyzing and assessing availability weshould also consider transient component failures, that is, failures that lead to temporaryoutages of a component but can be repaired so that the component is brought up againto its normal operation. Repair would often involve recovery of data from redundantstorage. For the analysis, we need to consider the time-to-failure and the time-to-repair,or their stationary expectations, mean-time-to-failure (MTTF) and mean-time-to-repair(MTTR). Key to high availability are reliable components for high time-to-failure, fastrecovery for short time-to-repair, and redundancy of components and data items by meansof replication.

Performance:

This includes resource consumption as well as latency measures. Both can refer to system-internal operations or to user-perceived operations and events. Examples of the formerare network bandwidth consumption and round-trip time for messages at the IP or TCPlevel; examples of the latter are total CPU time for entire business processes and responsetimes of interactive steps such as Web search operations.

Throughput denotes the number of successfully completed operations per time unitand is tpyically inversely proportional to resource consumption, at least for simple op-erations that compete for only one resource type. For complex operations and multipleresource types, throughput is a function of the resource consumption of the bottleneck

9

Page 14: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

resource. System-internally throughput can refer to operations such as disk accesses persecond or messages per second; user-oriented throughput measures refer to operations suchas queries per second.

Response time denotes the timespan between the begin and the end of a requestedoperation. For user-perceived operations this covers the period between the user issueingthe request, e.g., submitting a query to a search engine via http get or post, and theuser seeing the result on the screen or on her mobile phone. When the system or somesystem component cannot execute all requests immediately in order to avoid unlimitedresource congestion, response times include waiting times, so-called queueing delays. Oncea request is processed by the system, the execution time, aka. service time, is proportionalto the resource consumption of the requested operation (for non-preemptable operations).Thus, response time is the sum of queueing time and service time. Under high resourceutilization, queueing delays dominate respone time. This is why load balancing is socrucial, keeping the utilization of the “hottest” resource as low as possible. Responsetime is a random variable that exhibits natural fluctuation. Often, performance reportshighlight mean response time, but its variance and higher moments are equally important.In carefully tuned and administered server systems, service level agreements refer to thequantiles of the response time distribution.

Performability combines the aspects of availability and performance, by consideringthe impact of temporary outages on throughput and response time [137]. Even though asystem may have enough redundancy to be virtually 100 percent available (or very closeto 100 percent), the fact that some components exhibit transient failures and undergorecovery all the time does reduce the total resources of the system that are available at arandomly chosen timepoint and thus adversely affect the system’s performance capacity.Performability captures performance measures conditioned by the probability distributionof the resource availability.

Result quality:

Like many areas of computer science, distributed data management incurs trade-offs be-tween performance and the result quality of the system’s operations. In optimization al-gorithms we may trade off the quality of approximations for faster execution, and likewise,search and other data-centric operations can be accelerated if we tolerate some (hopefullyminor) degradation of the quality of the operations’ results.

Data recall denotes the fraction of (relevant) results that a search or lookup operationreturns, relative to the total results that it could return if it ran exhaustively. In manyP2P applications, such as file sharing, users are willing to accept less than hundred percentrecall; and consequently, many P2P systems employ methods such as bounded floodingthat cannot guarantee best possible recall.

Data freshness measures to what extent returned results are guaranteed to be up-to-date or could possibly be stale. This aspect is a particular concern in distributed systems

10

Page 15: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

that make intensive use of caching and replication, so that search and other operationsmay often encounter out-of-date copies. Detecting these situations and ensuring perfectfreshness may have an undue cost in large-scale P2P systems, and this calls for intelligentlydesigned compromises, ideally with quantifiable guarantees.

There are many other facets of data and result quality, such as precision of searchresults that may yield false positives (e.g., irrelevant results for a query), accuracy of dataaggregations that utilize sampling, approximations, or have numerical errors, authenticityand provenance of information that may be genuine or copied, and possibly distorted,from other sources, authority of the data source, e.g., the Web site or organization thatprovides the data, and so on.

Robustness:

This refers to the system’s ability to cope with dynamics, misbehavior, and attacks. Themeasures of interest include various kinds of confidence in the correctness, recall, or ac-curacy of the result returned by an operation, trust between two peers and network-widereputation of a given peer, resilience to exceptional behavior, etc. Some of these arewell understood, whereas others, most notably, trust and reputation, lack good theoriesbut are crucially important for self-organizing P2P systems, especially for social networks[63, 100, 43]. Among the well understood measures is resilience to failures, but it is usedin different connotations. One is the maximum number of simultaneously failing compo-nents (e.g., replicas of a data item) that the system dan sustain without malfunctioningor the maximum number of properly behaving (e.g., non-cheating) components in Byzan-tine agreement protocols. Another one is the maximum failure rate of components thatthe system an sustain without losing data or becoming (temporarily) unavailable or be-coming inconsistent. Here failures can refer to transient component failures or to peersspontaneously leaving the network without notice.

In this WP, the emphasis will be on dependability, performance, and result quality prop-erties. Robustness properties, particularly trust and reputation, are addressed in otherSPs.

The mechanisms for achieving high QoS guarantees are the same for all four aspects– dependability, performance, result quality, and robustness: redundancy of componentsand data. This redundancy can be provided in two major ways:

• replication and caching,

• overlay network topologies.

Replication and caching refer to entire nodes, thus providing redundancy of their stor-age, computational, and communication resources, or to replicas of data items. Typicallywe speak of replication when data items are assigned to a node for a longer time period

11

Page 16: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

and of caching when we keep an item at a node for a shorter time period and may dis-card the item anytime. However, conceptually there is no real difference between dynamicreplication and caching. Both can choose an item to be replicated at a node either ondemand, when it is clear that the node benefits from having the item locally available, orproactively or speculatively. The latter corresponds to the notion of data disseminationin a global overlay network.

When we consider network links as the unit of redundancy, we arrive at overlay networktopologies. Conceptually, you can think of every overlay network as a bare-minimumbasic network that just ensures connectivity among nodes and an additional network thatimproves performance. The latter provides redundancy of links. Examples would be ringnetworks that use additional routing entries for message acceleration, connecting nodes bymultiple overlay networks with different topologies, or the additional clustering of nodesinto neighborhoods of “semantic overlay networks” providing fast routes for closely relatedand thus frequently communicating nodes.

In this WP we emphasize node and data redundancy, and will devote less discussionto the issue of overlay network topologies as this is the subject of other SPs.

3 Data Placement

3.1 Introduction

In this section we consider the issue of data placement without redundancy: given a P2P-based global overlay computer and a (possibly evolving) set of single-copy data items, onwhich peers should we place these data items so as to balance the access load, providescalable throughput, and ensure good response time. This model disregards caching andreplication, which are the topic of the next section.

The model here could consider also dynamic migration of data items. For networkdynamics (peers joining and leaving), this is unavoidable and will be addressed. Forevolving workload such as changing popularities of data items, we will postpone thisdiscussion until the next section where we have the additional mechanism of dynamicreplication and can discuss much more powerful solutions. The approaches discussed hereare not tied to a particular data model. We will mostly assume that each data item hasa unique key and is located by exact-match search with regard to this key only.

In P2P networks we have two fundamental choices for data placement, depending onthe autonomy of peers:

• In networks with autonomous peers, every peer can compile its own data, and we as-sume that the global computer does not have control over these peer-specific choices.Thus, data is placed in an arbitrary bottom-up manner. As we disregard cachingand replication in this section, there is nothing else to say on this paradigm. Note,however, that this architecture is quite natural for application areas like P2P Websearch with personalized contents and search strategies for individual users. Further

12

Page 17: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

note that such systems are likely to use additional data structures for directories,summaries, statistics, etc., and these can, in principle, be freely placed, effectivelybeing treated as data items in an overlay of the second kind introduced next.

• In networks in which peers make resource commitments, they become storage nodeswithin a global overlay computer, and the topology of the global overlay influencesthe placement of data items. These architectures come in two flavors, with unstruc-tured overlays and with structured overlays. We will briefly consider the first class,and will then focus on the second class for the rest of this report.

For P2P-based overlay networks with peers making resource commitments, the effi-cient lookup of data items is a fundamental problem that has been tackled from variousdirections. Early (but nevertheless popular) systems like Gnutella rely on unstructurednetworks in which a peer forwards messages to all known neighbors. Typically, these mes-sages include a time-to-live (TTL) tag that is decreased whenever the message is forwardedto another peer. Even though studies show that this message flooding (or gossiping) worksremarkably well in most cases, there are no guarantees that all relevant nodes will even-tually be reached. Additionally, the fact that numerous unnecessary messages are sentinterferes with our goal of an efficient and scalable architecture. We will subsequentlyconcentrate on structured networks. Note that this does not rule out randomized tech-niques, but we prefer algorithmically founded random placement schemes over bottom-upplacements that do not follow any scheme at all.

3.2 Structured Overlay Networks

All structured overlay networks are based on the principle of resource virtualization: theymap resource identifiers like keys of data items or node addresses onto a virtual addressspace and then allocate virtual ids onto peers. This way the storage management andsearch algorithms can be implemented on top of a structured overlay network withouthaving to know about physical network properties. The virtualization infrastructure canalso take care of re-mappings when peers join or leave the network.

Structured overlay networks have been discussed in the literature in three generations:

• the first generation with basic overlays that support exact-match key lookups and ascalable virtualization infrastructure,

• the second generation with additional features regarding faster routing or fault tol-erance, and

• the third generation that support also advanced operations such as range queries orstring matching operations.

The first generation of structured overlay networks is mostly based on distributed hashtables (DHTs) and related techniques. Chord [145] uses consistent hashing for mapping

13

Page 18: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

nodes as well as data items onto a virtual ring, and then adds a logarithmic numberof routing-table entries to each peer for network efficiency. Consistent hashing [78, 79]provides efficient incremental re-hashing when the target domain of hash function changes,for example, when nodes fail/leave or when new nodes join the network. Pastry [133]and Tapestry [166] are based on Plaxton trees [123]: nodes are assigned to random ids,and a constant number of neighbor links are created for each node based on commonprefixes of their ids, effectively constituting an embedding of randomized trees in thenetwork structure. CAN [128] uses a d-dimensional partitioning of the virtual id spaceand organizes links between neighbors according to a d-dimensional torus topology. Highlyrelated to all these approaches is also the earlier work on scalable distributed data structures(SDDSs), such as LH∗ [92] or Snowball [153], but that work did not consider the problem ofheavy churn (and rather focused on scalability with regard to network growth). All of theabove mentioned methods provide fast and scalable lookup of data items and localizationof nodes, either in time O(log n) or O(n1/d) where n is the number of nodes in the network;and no peer needs to maintain routing information that requires space larger than O(log n).Good reference points for the first generation of structured overlay networks are Chordand Pastry; their prototype software has been widely adopted in the research community.We describe both approaches in more detail below.

The second generation considered a much wider variety of network topologies includingbutterfly, hypercube, and various kinds of trees and tries [6, 4, 5, 99, 103, 140, 76, 127,74, 112, 91]. Moreover and more importantly, it added deeper considerations on fault tol-erance, churn handling, latency issues, and interoperability among multiple, possibly het-erogeneous, P2P networks. For fault tolerance, systematic replication or error-correctioncoding were added and woven into the overlay network itself [114]. For example, for Chord,a simple but effective method is to replicate the data items of a node on its successor orsuccessors in the virtual ring structure; the hash function ensures that no load imbalancesare created and that failure modes of successive nodes are largely independent. For churnhandling, adaptive message timeout policies and strategies for judicious choice of neigh-boring peers have been studied [130, 89, 131, 90]. For low latency of request routing,routing tables of Chord-style overlays are enhanced by nodes that exhibit a recent historyof short IP round-trip times; these additional neighbor links are dynamically adjusted asthe network characteristics evolve over time [39]. Finally, for interoperability several pa-pers proposed steps towards reference architectures and their alignment with the emergingstandards for P2P infrastructure [1, 7, 131], most notably, the JXTA framework [66]. Agood reference point for the second generation of structured overlay networks is P-Grid[4], discussed in more detail below.

The third generation of structured overlay networks has been aiming to provide effi-cient support for more versatile and complete operations on top of or as integrated partof the basic overlay infrastructure. The main motivation has been to support much richerapplications beyond the classical file-sharing case, for example, database-system function-alities [22, 23, 25, 121, 72, 81]. An operation that has received significant attention is range

14

Page 19: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

queries [82, 136, 135, 40, 30, 122]. This is of importance not just for database systems, butfor all applications that refer to time attributes, for example, Web archiving and time-travel Web search. The approaches advocated in the literature typically suggest DHTvariants based on order-preserving hash functions. This goes a long way, but has limita-tions in reconciling load balancing with (zero-tuning) self-organization. Another class ofoperations that researchers have started to investigate in the context of P2P systems arestring operations like prefix, suffix, and substring matching [8]. It seems generally fair tosay that this current generation of P2P data management is an ongoing endeavor, likelyto see more variations and new attempts on the above and further operations in the nextfew years.

Example Chord

Chord [145, 146] is a distributed lookup protocol for efficient localization of virtual re-sources. It provides the functionality of a distributed hash table (DHT) by supporting thefollowing lookup operation: given a key, it maps the key onto a node. For this purpose,Chord uses consistent hashing [78, 79]. Consistent hashing tends to balance load, sinceeach node receives roughly the same number of keys. Moreover, this load balancing workseven in the presence of a dynamically changing hash range, i.e., when nodes fail or leavethe system or when new nodes join.

Chord not only guarantees to find the node responsible for a given key, but also can dothis very efficiently: in an N -node steady-state system, each node maintains informationabout only O(log N) other nodes, and resolves all lookups via O(log N) messages to othernodes. These properties offer the potential for efficient large-scale systems.

The intuitive concept behind Chord is as follows: all nodes pi and all keys ki aremapped onto the same cyclic ID space. In the following, we use keys and peer numbersas if the hash function had already been applied, but we do not explicitly show the hashfunction for simpler presentation. Every key ki is assigned to its closest successor pi inthe ID space, i.e. every node is responsible for all keys with identifiers between the IDof its predecessor node and its own ID. For example, consider Figure 1. Ten nodes aredistributed across the ID space. Key k54, for example, is assigned to node p56 as its closestsuccessor node.

A naive approach of locating the peer responsible for a key is also illustrated: sinceevery peer knows how to contact its current successor on the ID circle, a query for k54

initiated by peer p8 is passed around the circle until it encounters a pair of nodes thatstraddle the desired identifier; the second in the pair (p56) is the node that is responsible forthe key. This lookup process closely resembles searching a linear list and has an expectednumber of O(N) hops to find a target node, while only requiring O(1) information aboutother nodes.

To accelerate lookups, Chord maintains additional routing information: each peer pi

maintains a routing table called finger table. The m-th entry in the table of node pi

15

Page 20: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Figure 1: Chord Architecture

contains a pointer to the first node pj that succeeds pi by at least 2m−1 on the identifiercircle. This scheme has two important characteristics. First, each node stores informationabout only a small number of other nodes, and knows more about nodes closely following iton the identifier circle than about nodes farther away. Secondly, a node’s finger table doesnot necessarily contain enough information to directly determine the node responsible foran arbitrary key ki. However, since each peer has finger entries at power of two intervalsaround the identifier circle, each node can forward a query at least halfway along theremaining distance between itself and the target node. This property is illustrated inFigure 2 for node p8. It follows that the number of nodes to be contacted (and, thus, thenumber of messages to be sent) to find a target node in an N -node system is O(log N).

Chord implements a stabilization protocol that each peers runs periodically in thebackground and which updates Chord’s finger tables and successor pointers in order toensure that lookups execute correctly as the set of participating peers changes. But evenwith routing information becoming stale, system performance degrades gracefully.

Chord can provide lookup services for various applications, such as distributed filesystems or cooperative mirroring. However, Chord by itself is not a full-fledged globalstorage system, and it is not a search engine either as it only supports single-term exact-match queries and does not support any form of ranking.

Example Pastry

Pastry [133] is a self-organizing structured overlay network that uses a routing schemabased on prefix matching. Each node is assigned a globally unique 128-bit identifierfrom the domain 0..2128 − 1, in the form of sequences of digits with base 2b where b is aconfiguration parameter with typical value 4. Like Chord, Pastry offers a simple routing

16

Page 21: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Figure 2: Scalabe Lookups Using Finger Tables

method that efficiently determines the node that is numerically closest to a given key, i.e.,which is currently responsible for maintaining that key.

To enable efficient routing in an N -node network, each peer maintains a routing tablethat consists of dlog2bNe rows with 2b − 1 entries each, where each entry consists of aPastry identifier and the contact information (IP address, port) of the numerically closestnode currently responsible for that key. All 2b − 1 entries in row n represent nodes witha Pastry identifier that shares the first n digits with the current node, but each with adifferent n + 1-st digit (2b − 1 possible values).

The prefix routing now works as follows. For a given key, the current node forwardsthe request to that node from its routing table that has the longest common prefix withthe key. Intuitively, each routing hop can fix one additional digit toward the desired key.Thus, in a network of N nodes, Pastry can route a message to a currently responsiblenode with less than dlog2bNe message hops.

Example P-Grid

P-Grid [4] is a peer-to-peer lookup system based on a virtual, distributed search tree. Eachpeer stores a partition of the overall tree. A peer’s position is determined by a binary bitstring (called the path) representing the subset of keys that the peer is responsible for.

P-Grid’s query routing approach is as follows. For each bit in its path, a peer storesa reference to at least one other peer that is responsible for the other side of the binarytree at that level. Thus, if a peer receives a request regarding a key it is not responsiblefor, it forwards the request to a peer that is “closer” to the given key. This process closelyresembles the prefix-based routing approach taken by Pastry.

17

Page 22: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

The peer paths are not determined a priori but are acquired and changed dynamicallythrough negotiation with other peers. In the worst case, for degenerated data key distribu-tions, the tree shape no longer provides an upper bound for search cost; rather the searchcost may grow up to linear depth in the network size. However, it can be shown by theo-retical analysis that for a (sufficiently) randomized selection of links to other peers in therouting tables, probabilistically the search cost in terms of messages remains logarithmic,independently of the lengths of the paths occurring in the virtual tree.

3.3 DHTs for Global Storage

From the viewpoint of the overlay infrastructure, global storage can be seen as an appli-cation, layered on top of a DHT or other structured overlay network. However, it is ageneric and highly versatile application that itself deserves prime attention. Several pro-posals have been made in the literature for building global file systems on top of a P2Poverlay network; the most prominent examples that have also led to major prototypingand significant experimental work are Oceanstore [132, 84] and Total Recall [21].

Oceanstore (actually, its prototype implementation coined Pond) [132, 84, 129] is builton top of Tapestry [166]. It virtualizes file ids (or file names) and assigns them to networknodes in a randomized manner. For efficient lookup, Plaxton trees are the mechanismthat Tapestry provides in the overlay infrastructure. As an additional lookup accelerator,Oceanstore gives each node a staged set of Bloom filters, one filter for each distance level,for efficient probabilistic location of files that reside at topologically nearby nodes. Forfault tolerance, error-correcting code (ECC) blocks are computed and stored at separatenodes; more specifically, Reed-Solomon codes are used to this end. As the reconstructionof corrupted file blocks is an expensive operation with ECCs alone, full-content blocks areadditionally cached/replicated at additional nodes. Updates are handled by a no-overwriteversioning approach for all files, and concurrent updates are handled by a conflict resolutionmethod that can be made application-driven by appropriate hooks into Oceanstore. Forexample, latest-update-wins could be a conflict-resolution policy but more sophisticatedpredicate-based policies are supported as well. All aspects of the conflict resolution forupdates and the fault tolerance by ECCs are managed by a specifically trusted core set ofnodes, the so-called “inner ring” of Oceanstore. This resembles the super-peer architecturethat most commercial P2P systems have adopted for MP3 and other file sharing. Strictlyspeaking, these are not perfectly scalable and completely self-organizing architectures,as super-peers are different from normal peers and are assumed to be more carefullyadministered than the average personal computer on the Internet.

Oceanstore is a landmark piece of research in the field of P2P systems, but its func-tionality is still very basic relative to the much richer application areas that we considerfor global computing. Moreoever, Oceanstore’s reliance on a kind of super-peer paradigmmakes it less attractive for the AEOLUS vision of ultra-scalable and completely self-organizing solutions. Recently, various kinds of higher-level data managers have been

18

Page 23: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

proposed in a P2P setting, most notably with database system and search engine func-tionalities. In the first line, the best examples are PIER [72], ObjectGlobe [22], andDBGlobe [121]. All three support relational data and the key set of relational databaseoperations including joins and aggregation queries. Experimentation with PIER has shownreasonable performance, but it is fair to say that none of these systems has undergone realstress testing with regard to scalability and resilience to dynamics. In the second line, thework on search engine functionality, recent and ongoing work includes a wider variety ofapproaches; we discuss these in the next subsection.

3.4 DHTs and SONs for Web Search

Peer-to-peer (P2P) networks potentially offer major advantages for large-scale decentral-ized search of information such as Web search [144]. The thought-provoking position paper[88] has argued that Google-style Web search does not scale in a decentralized setting andcan thus not be efficiently implemented in a P2P network. However, this paper made sev-eral crude assumptions that biased the discussion (for the sake of the position statement).Moreover, the paper assumed that the full contents of an inverted index would be dis-tributed across the nodes of a P2P network. In contrast, a much more viable approach isto limit indexing at the network level to the coarser granularity of entire peers [38, 15, 16].For each keyword or other global search feature, an entry of a peer’s quality would beindexed in the P2P overlay network, whereas information about an individual Web pagewould be indexed only locally by the peers that hold the page. Search requests issued bya peer can first be executed locally, on the peer’s locally indexed content (possibly withmethods for personalization and other advanced IR techniques). In cases when the recallof the local search result is unsatisfactory, the query can be forwarded to a small set ofother peers that are expected to provide thematically relevant, high-quality and previouslyunseen results. Deciding on this set of target peers is the query routing problem in a P2Psearch network (aka. collection selection).

A practically viable query routing strategy needs to consider the similarity of peersin terms of their thematic profiles, the overlap of their contents, the potential relevanceof a target peer for the given query, and also the costs of network communication. Manyproposals have been made in the literature, for example: globally available term statisticsabout the peers’ contents [26, 58, 106, 14], epidemic routing using gossiping strategies [77],routing indices with peer summaries from local neighborhoods [36, 93], statistical synopsessuch as Bloom filters or hash sketches maintained in a directory based on distributed hashtables (DHT) [15, 107, 124], randomized expander graphs with low-diameter guarantees[97, 98] and randomized rendezvous [118], clustering of thematically related peers [47],superpeer-based hierarchical networks [94, 95], cost/benefit optimization based on coarse-grained global knowledge [115, 116], and many more.

Typically, query routing decisions would be made at query run-time, when the queryis issued at some peer. But many of the above methods involve directory lookups, statis-

19

Page 24: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

tical computations, and multi-hop messages; so it is desirable to precompute some basicelements of query routing decisions and amortize this information over many queries. Atechnique for doing this is to encode a similarity-based precomputed binary relation amongpeers into a so-called semantic overlay network (SON) [37, 2]. So each peer P becomesdirectly connected to a small number of peers that are likely to be good routing targetsfor many of P ’s queries. Then, at query run-time, the query router would consider onlythe SON neighbors of the query initiator and select a subset of these based on a moredetailed analysis of similarity, overlap, networking costs, etc.

All of the above approaches fall into the class of bottom-up P2P systems with emphasison node autonomy. This necessitates the need for dynamic self-organization, but at thesame time it shows that the approaches are, in principle, susceptible to load imbalance:bottlenecks could arise that adversely affect the system’s scalability problems. For exam-ple, the data that is held by a particular peer is so popular that this peer (and the networkpaths in its immediate neighborhood) becomes overloaded and forms an inherent bottle-neck. In contrast, a top-down system could be designed so as to break unduly “hot” dataunits into sub-units so as to improve load balance and completely eliminate bottlenecks.This may come at the expense of higher cost for certain simple operations, but in somesituations scalability may be more important than the best possible efficiency of resourceusage. In the next subsection, we present a recently developed top-down design with thisdesired property of unlimited, virtually perfect scalability but suboptimal efficiency.

3.5 Initial AEOLUS Results: MINERVA∞ for Scalable P2P Web Search

In joint work of the Max-Planck Institute for Informatics and the University of Patras,the MINERVA∞ architecture has been designed and prototyped in a simulation testbed[108, 18]. This work is related to the MINERVA prototyping [16], carried out in thecontext of the EU DELIS project, but significantly differs in its design paradigm andspecific suitability for a global overlay computer. This will be explained in the following.

MINERVA∞ peers are assumed to be members of G, a global DHT overlay network.MINERVA∞ is designed with the goal of facilitating ultra scalability. For this reason,the fundamental distinguishing feature of MINERVA∞ is its high distribution both in thedata and computational dimensions. In this sense, it goes far beyond MINERVA and thestate-of-the-art in distributed top-k query processing algorithms, which are based on nodesstoring complete index lists for terms and performing coordinator-based, top-k algorithmsover these nodes accessing their local index lists. MINERVA∞ involves sophisticateddistributed query execution, engaging a large number of peers, which collectivly store theaccessed portions of a queried index list. To achieve ultra scalability, the key computations(such as the maintenance and retrieval of the data items) are engaging several differentnodes, with each node having to perform small (sub)tasks.

Our approach to materialize this design relies on the employment of the novel notionof Term Index Networks (TINs). A TIN can be conceptualized as a virtual node storing

20

Page 25: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

a virtually global index list for a term, which is constructed by the sorted merging of theseparate complete index lists for the term computed at different nodes. TINs serve tworoles: First, as an abstraction, encapsulating the information specific to a term of interest,and second, as a physical manifestation of a distributed repository of the term-specific dataitems, facilitating their efficient and scalable retrieval during top-k query processing. TINsare comprised of nodes which collectively store different horizontal partitions of this globalindex list. During the execution of a top-k query involving r terms, the query-initiatornode (and any other node) never needs to simultaneously communicate with more thanr other nodes. Furthermore, as the top-k algorithm is processing different data items foreach query term, this involves gradually different nodes from each TIN, producing a highlydistributed, scalable solution.

In general, TINs can form separate overlay networks, coexisting with the global overlayG. In practice, it may not always be necessary or advisable to form full-fledged separateoverlays for TINs; instead, TINs may be formed as straightforward extensions of G: whena node n of G joins a TIN, only two additional links are added to the state of n linkingit to its successor and predecessor nodes in the TIN. In this case, a TIN is simply adoubly-linked list of nodes.

Distributed Query Processing

The design of MINERVA∞, and in particular the placement of data and nodes, is heavilyinfluenced by the way the well-known, efficient top-k query processing algorithms (e.g.,[54]) operate, looking for docIDs within certain ranges of score values. Thus, correspond-ingly the networks’ lookup(s) function, will be called using scores s as input, to locate thenodes storing docIDs associated with scores s.

For any highly distributed solution to be efficient, it is crucial to keep as low as possiblethe time and bandwidth overheads. To achieve this, MINERVA∞ follows the principlesput forward by top-performing, resource-efficient top-k query processing algorithms intraditional environments. Specifically, the principles behind favoring sequential index-listaccesses to random accesses (in order to avoid random disk IOs) have been adapted inour distributed algorithms to ensure first that sequential accesses dominate, and secondthat they require at most a one-hop communication between nodes. In addition, randomaccesses require at most O(logN) messages. To ensure the at-most-one-hop communicationrequirement for successive sequential accesses, MINERVA∞ utilizes an order preservinghash function, hop(). hop() has the property that for any two values v1, v2, if v1 > v2

then hop(v1) > hop(v2). This guarantees that data items corresponding to successive scorevalues of a term t are placed either at the same or at neighboring nodes of the TIN for I(t).Similar functionality can be provided by employing for each I(t) the SkipNets overlay.

Putting Everything Together

In summary, the MINERVA∞ design and processing are based on the following pillars.

21

Page 26: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

1. Data items are (term, docID, score) triplets, posted to the underlying DHT network,using an order preserving hash function on the score value to identify the node whichwill store this item. This node then becomes a member of the TIN for the index listfor the named term, using special gateway nodes (which are randomly selected foreach TIN from the N nodes). This results in successive nodes of TIN storing itemswith successive scores.

2. The gateway nodes are easily identifiable since they are made to store dummy pre-defined score values. Hashing for one of these predefined score values yields the IDof a gateway node.

3. Once TINs are populated, queries are executed by having the query initiator nodeof G send, for each query term, a message to the node responsible for storing thehighest score value (e.g., the value 1). This is achieved by hashing for the pair (term,1) using the order-preserving hash function. In this way, the ”top” nodes of eachrelevant TIN are identified and the query is sent to them.

4. Query processing is batched; it proceeds in communication phases, between theinitiator and the TIN nodes, with each phase collecting a certain portion (batchsize) of the index list stored in each TIN. This is essence creates a pipeline, definedby the TIN nodes which collaborate to collect the batch of index list entries for thecurrent phase.

5. Communication between any two nodes in a TIN during this process requires onehop at a time; a consequence of order-preserving placement.

6. The initiator collects the batch size of index list entries from every TIN and thenruns locally a top-k algorithm.

7. This process continues with the initiator collecting more batches of data from theTINs (accessing nodes further ”down” in the TIN) until the top-k result can becomputed.

The MINERVA∞ design can leverage DHT technology to facilitate efficiency and scal-ability in key aspects of the system’s operation. Specifically, posting (and deleting) dataitems for a term from any node can be done in O(logN) time, in terms of the number ofmessages. Similarly, during top-k query processing, the TINs of the terms in the querycan be also reached by the initiator in O(logN) messages. Furthermore, no single nodeis over-burdened with tasks which can either require more resources than available, orexhaust its resources, or even stress its resources for longer periods of time. This followssince (i) no node is needed to communicate with more than r other nodes in a r-termquery (regardless of the number of peers that crawled the web and constructed index listsfor these terms) and (ii) the ’pipelined’ processing within each TIN in essence facilitatesthe pulling together of nodes’ resources to create ’virtual’ nodes of a capacity close to thesum of the individual node capacities.

22

Page 27: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

4 Caching and Replication

4.1 Data Caching

Caching of data items is a universal technique for reducing the latency and access costof slower storage or network paths. We assume that every data item has a unique homewhere its originally copy resides; the node that holds this original copy is referred to asthe owner of the item. When designing a cache manager, a number of issues arise:

• the system architecture, which can be centralized like in a server-attached storagesystem with disks and memory, or hierarchical like in Internet caching with httpservers, proxy caches, and browser caches, or completely decentralized with dataitems being cached in arbitrary network nodes;

• the possible paths for obtaining a data item that is not yet in the requestor’s localcache, which can be either the item’s home, a generally available and consistency-wisesafe choice, or from any other node that currently has a cached copy, an approachknow as “cooperative caching”;

• the granularity of caching, for example, fixed-size units such as disk blocks orvariable-size units such as Web pages;

• the nature of the caching granularities, ranging from device-oriented units like blocksto application-oriented units like query results in DB or IR systems;

• the storage model and its implications on the access costs of a cache miss, varyingfrom uniform storage where every access has the same cost to highly non-uniformnetwork storage where diverse bandwidths, complex topologies, and routing proto-cols incur different costs for different data items.

Traditionally, caching has been space-restricted. So when a cache is full and a new dataitems is fetched, the cache manager needs to identify a replace victim. The classical cachereplacement method of choice is LRU (least recently used) for fixed-size data items, butmore recent methods like LRU-k [117] or ARC [104] combine access recency and frequencyinformation in a more intelligent manner (with aging for the frequency part). For variable-size data items, the Greedy-Dual method [27] and the closely related Landlord algorithm[163] are the best known solutions. They are based on estimating the benefit of cachinga data item, i.e., the expected savings in access costs compared to fetching the item fromits storage home. These methods are also highly related to the LRU-k method and itsvariants.

In distributed Internet caching [126, 83, 142, 164, 28], on the other hand, cache space isof less concern, as even disk-resident caches are beneficial in terms of networking costs. Inthese settings, the cache replacement is more driven by the requirements of data freshnessand consistency. The perfect solution would be global cache coherence, ensuring that every

23

Page 28: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

access to a cached copy sees the most recent global state of the corresponding data item.Often, weaker notions of consistency are accepted, too, as discussed below. Moreover,most architectures for distributed caching make the restrictive assumption that updatesto a data item take place only at the item’s home. For Web caching, this is a naturalassumption; for broader classes of global-computing applications it may be too restrictive.Typically, this assumption is not made for replication architectures, which will be discussedin the next subsection.

The approaches for cache freshness and consistency can be categorized into two broadclasses [59, 162, 154, 161]:

• pull-based with responsibility for consistency delegated to the node that currentlyholds a cached copy, or

• push-based with responsibility for consistency staying with the owner of a data item.

The simplest technique for pull-oriented methods is based on time-to-live (TTL) spec-ifications. When a node fetches a data item from the item’s home into its cache, the itemcomes with a TTL and is considered invalid and then purged after the TTL expires. Incooperative caching where the item may be received from another copy holder rather thanthe item’s owner, the TTL is not re-initialized so that only the remaining time to expira-tion counts. TTL-based methods do not have any freshness or consistency guarantees, forit is possible that the item is updated many times at its home before the TTL of a copyholder expires. A simple technique for avoiding this situation is to require the copy holderto check back with the item’s owner upon each access to the cached copy. This is usedin Web caching by means of the HTTP get-if-modified-since method. The owner sends afresh copy of the item only if there has indeed been an update; otherwise only a short ac-knowledgement is sent. This technique is efficient in terms of bandwidth consumption, butit does require extensive short messages and pays the price of the corresponding latencies.Thus, it may be prohibitive in some application areas of global overlay computers.

In push-oriented methods, on the other hand, the owner of a data item rememberswhich nodes have obtained a cached copy. Upon updates (or a certain number of updatesor large changes in the item’s value), the owner notifies the copy holders and invalidatesthe copies or sends fresh versions. A popular hybrid method uses leases [59] that guar-antee push-based invalidation only for a limited time; after this time, the owner has noobligations anymore, and it is up to the copy holders whether they want to switch to apull-based mode or live with potentially stale copies.

In cooperative caching, another critical issue is to locating cached copies, in orderto reduce the necessity for contacting the owner of a data item. The Summary-Cachetechnique [55] is a widely cited solution; it uses lazily maintained Bloom filters at everynode as approximate synopses of other nodes’ cache contents. An excellent example for afull-fledged P2P-based cooperative caching is the Squirrel system [73], which is describednext.

24

Page 29: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

Example Squirrel

Squirrel [73] is a decentralized Peer-to-Peer (P2P) Web Cache. It is scalable, self-organizing,and resilient to peer failures. Without the need for additional hardware or administrationit is able to achieve the functionality and the performance of a traditional centralized WebCache. It is proposed to run in a corporate LAN type environment, located e.g. in abuilding, or a single geographical region.

Squirrel is based on the P2P overlay network Pastry [133], an object location androuting protocol for large-scale P2P systems, which provides the mentioned features andis also described in this survey. The goals and the motivation for Web Caching are declineof load on external Web servers, corporate routers and of course external traffic (trafficbetween the corporate LAN and the Internet) which is expensive, especially for large or-ganizations. Squirrel is a possibility to achieve these goals without the use of a centralizedWeb Cache or even clusters of Web Caches. The key idea is to enable Web browsers ondesktop machines to share their local caches, to form an efficient and scalable Web Cache,without the need for dedicated hardware and the associated administrative cost.

When a client requests an object it first sends a request to the Squirrel proxy runningon the client’s machine. If the object is uncacheable the proxy forwards the request directlyto the origin Web server. Otherwise it checks the local cache. If a fresh copy of the objectis not found in this cache, then Squirrel tries to locate one on some other node. To doso, it uses the distributed hash-table and the routing functionalities provided by Pastry.First, the URL of the object is hashed to give a 128-bit Object-ID from a circular list.Then the routing procedure of Pastry forwards the request to the node with the Node-ID(assigned randomly by Pastry to a participating node) numerically closest to the Object-ID. This node then becomes the home node for this object. Squirrel supports two modesfor freshness and invalidation of copies: the home-store model and the directory model.

Home-Store Model: In the Home-Store Model, the requested objects are also storedat the home nodes. If a node in the network requests an object and does not have a freshcopy in the local cache, it sends a request. This request is sent to the home node of theobject. The home node can be found by simply mapping the object to its Object-ID usingthe hash function and forwarding the request towards the closest node to this Object-ID.The node that can not forward the request to another node whose Node-ID is closer tothe Object-ID than its own Node-ID will finally notice that it is the home node for thisobject. The home node checks if it has a fresh copy of the requested object already in itslocal cache. Given the case it has a fresh copy in its cache. The home node will then sendthe object directly to the node which initiated the request or it returns a not-modifiedmessage depending on which action is appropriate. This node saves the retrieved objectin its cache and returns it to the user. The other case where the home node does not havea copy of the requested object has a stale copy in its local cache is different. The homenode issues an HTTP get or get-if-modified-since request to the origin Web server hostingthe object. Then the home node either receives a cacheable fresh copy of the object or a

25

Page 30: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

not-modified message. Then the home node takes the appropriate action with respect tothe client, sending a copy of the object or a not-modified message to the requesting client.

Directory Model: In the Directory Model, the home node for an object maintainsa small directory of pointers to nodes that have recently accessed the object. Subsequentrequests for this object are redirected to a randomly chosen node of this directory (calledthe delegate), expecting that they have a locally cached copy of the object. The home nodedoes not store the objects for which it is responsible for locally, it only keeps meta-datalike the fetch time, last modified time or explicit time-to-live. With this meta-data thehome node is able to apply the expiration policies of the Web browsers without storingthe object itself. Therefore, the home node is able to invalidate all nodes with a cachedcopy from the directory, if the object has changed, at once. Since the home node does notsend requests to the origin Web server, it receives the meta-data for the object from thedelegates which requested the object. A delegate that receives a forwarded request fromthe home node has to check the freshness of its cached copy before it sends the object tothe requesting client. If the object has changed then the delegate informs the home nodeto update the directory of the object.

4.2 Dynamic Replication

Replication is traditionally understood as a static configuration for the placement of copiesof data items, for the purpose of increased reliability and availability as well as better loadsharing. In large-scale distributed systems that rely more on self-organization rather thancarefully planned administration, replication is seen as a dynamic mechanism: a new copymay be created when an existing copy fails (transiently or permanently) or when somenode becomes overloaded, copies may be migrated, or replicas may simply be the result ofcached copies being kept at nodes for a longer time period. Thus, there is no conceptualdifference between caching and dynamic replication. Methods for dynamic replication mayalso be used for cooperative caching and vice versa. However, there may be pragmaticdifferences due to the different time scales that are typically associated with caching andreplication: caching is more short-term, whereas replication is long-term. Furthermore,in replication schemes, the assumption that only the owner of a data item can performupdates is often (but not necessarily) dropped.

The critical issues in dynamic replication are

• determining the number of replicas that we want to have for a given data item, basedon goals for reliability, availability, and performance;

• determining on which nodes we should place these replicas;

• designing a strategy for adjusting the replica placement upon certain events such asnode failures or load bursts;

• designing a mechanism and a strategy for keeping replicas fresh and consistent,relative to the desired consistency or relaxed-consistency model.

26

Page 31: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

The first two items are parameter tuning issues that will be addressed in the nextsection. As for consistency, the situation is similar to the one for caching, but the literatureon replication has proposed a wider variety of approaches.

In the synchronous or eager replication model, replication control protocols aim forperfect consistency in the sense that every read access sees the globally most recent versionof the accessed data item. Protocols for this purpose include primary-copy methods andthe quorum consensus family. In transactional information systems such as distributeddatabase systems, the consistency model is even stronger in that it considers also logicalrelationships among different data items and combine accesses to such interrelated itemsin an atomic transaction [158]. This guarantees consistent data from a business-logicviewpoint and ensures a consistent view of the global data when reading replicas of multipledata items. In this WP we will not consider transactional replication, and rather focussolely on per-data-item consistency.

It has been argued that synchronous replication does not scale well to large networkswith potentially high degrees of replication [60]. This argument has triggered work onimproved protocols, leveraging properties of the underlying communication infrastructure[120, 80, 159] or the logical relationships among data items [12], but it has also spawnedwork on relaxed consistency properties [64, 10]. In particular, the family of asynchronousor lazy or optimistic replication protocols has emerged in different research communities[138]. Under this paradigm, updates are performed immediately only on one replica andthere is an asynchronous process, along with some additional synchronization efforts, tomaintain the other replicas. The most prominent member of this family is the epidemicreplication approach by [41], which led to substantial follow-up work [52]. All these meth-ods pursue a correctness notion that is known as eventual consistency: when all updateactivity stops and the replication protocol continues to run, all replicas of the same dataitem will eventually converge to the same value. During periods with update activity, thereplicas of the same item may diverge, but additional techniques may ensure that thereis only bounded divergence, either in terms of the numbers of updates that are missingin some replica or in terms of the value differences. However, such additional techniqueshave higher messaging costs and may incur waiting situtations for synchronization. Allprotocols that allow at least temporary divergence of the same item’s replicas, encounterthe problem of conflicting versions when reconciling multiple replicas. This entails theneed for a conflict detection and conflict resolution policies. The literature contains manyelaborated proposals for these aspects. In practice, at least conflict resolution may behighly application-dependent and would best be exposed as a programmable plug-in (e.g.,to code heuristics based on timestamps, importance of updaters, etc.).

Asynchronous replication methods typically utilize policies for owner-based copy invali-dation or refreshing, which were already discussed for distributed caching. Practically pop-ular techniques are TTL-based; generally, techniques can be push-oriented, pull-oriented,or hybrid. As the techniques are more or less the same as for caching, we do not repeatthe earlier discussion. An interesting variation arises when the owner of a data item does

27

Page 32: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

not necessarily know all the copy holders. This situation may indeed arise in a highly self-organizing P2P setting. Recent results by [85] have shown that an agnostic-probabilisticapproach can achieve bounded copy divergence with very high probability.

4.3 Proactive Dissemination

There are several potential reasons for disseminating data in a proactive manner, thatis, without an explicit request for the data item. This kind of dissemination effectivelycreates additional replicas. The dissemination protocol itself may be based on an epidemicalgorithm, piggyback on payload messages that are sent in the overlay network, or simplychoose target nodes in a random manner. Proactive dissemination is of interest for thefollowing reasons:

• To improve the response time of search requests and file downloads, by means ofadditional replicas. In “blind search” situations with limited request flooding inunstructured networks, the additional replicas may even be needed to improve theprobability of a successful search.

• To improve the load balance in the network and thus increase the overall throughputof the entire system. This assumes that additional replicas can effectively be con-sidered in the request routing; most routing methods with randomized sub-decisionshave this property.

• To improve the availability of data items, in the presence of frequent node outagesand churn.

• To improve the reliability of the system, in the sense that it guarantees higherprobability of data durability, i.e., not losing a data item regardless what permanentnode failures may occur.

For the two performance reasons, random allocation of additional replicas is a simpleand extremely effective mechanism. The rich literature on distributed systems has pro-posed many sophisticated load balancing algorithms, based on tracking and extrapolatingthe load levels of the nodes in the network [151]. These methods allocate or migrate load(or data that incurs load) by approximately solving some form of combinatorial optimiza-tion problem. However, in a highly dynamic large-scale network like a P2P system, themonitoring and load assessment is a very treacherous if not infeasible issue, as the systemmay never exhibit sufficiently steady-state behavior. Therefore, simpler, randomized tech-niques are preferable in a global overlay computer. This does not rule out load assessmentwithin a small, localized environment. For example, the celebrated “power of two choices”method [111] considers two randomly chosen target nodes and chooses the one with thecurrently lighter load.

A key issue in all these replication-based methods is how many replicas the systemshould have for each data item. This may be a global configuration parameter, the same

28

Page 33: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

degree of replication for all data items, or even an item-specific tuning issue. We willdiscuss this difficult issue in the next section.

For the availability and reliability reasons, the creation of an additional replica isdriven by observing transient or permanent nodes failures. One difficulty here is thatthe observer, e.g., any random node that happens to monitor or simply interact witha node suspected to have failed, cannot distinguish between a transient and permanentfailure. A typical heuristics is to assume permanent failure after the outage exceeds acertain time limit. The prevalent proactive replication methods thus create a new replicawhenever one of the current copy holders of a data item has failed (in the above sense).This strategy is used in DHash [39] and TotalRecall [21]. It is criticized in [33] for itsexcessive resource usage, particulary, network bandwidth consumption for transferringthe data item whenever a new replica is created. The strategy seems adequate for veryhigh availability in systems with extremely high churn, where failures may occur in burstsand are often permanent because nodes leave the network without notice. However, inless chaotic environments and especially when transiently failed nodes perform recoverysteps and come back with all data intact, the eager creation of new replicas is an overkilland wastes resources. Therefore, [143] suggests a resource-limited variation of proactivereplication, and [33] develops an adaptive method for conservative creation of replicas: theCarbonite design.

Carbonite aims to provide high guarantees for data reliability (durability); replicationfor availability is viewed as an orthogonal aspect and even considered to be less importantin typical P2P applications [34]. The key idea is to determine a desired number of replicas,r, and to take action whenever the number of available replicas in the network drops belowr; this is done for each data item. The action consists of creating a new replica wheneverthe failure is potentially permanent. In addition, failed nodes are kept in the bookkeepingand monitored as to whether they become recovered. Thus, transiently failed and fullyrecovered nodes are re-integrated. This way, the frequency of having to create a new replicais much lower compared to a system that forgets nodes when they fail. The analysis in[33] shows that the number of replicas is upper-bounded by r/a with high probability,where a is the average availability of a node. This degree of replication is thus a bit higherthan the bare minimum r, but a modest price for reliability and typically much lower thanthe degrees of previous methods. For choosing the value of the configuration parameter r

itself, [33] offers heuristic considerations, particularly, choosing r slightly higher than themaximum number of failures in a burst (e.g., when many nodes leave the system withoutnotice). The justification is that the system must not lose its last copy of the data item,and thus must have more replicas or create them faster than the death rate in a burstysituation.

This choice of r, albeit practically appealing, is not fully satisfactory, as it depends onan observed measure (failures in bursts) that is neither stable nor derivable from first-orderprinciples. A principled method for tuning r, and possibly even choosing item-specificvalues of r, is an open issue for future research.

29

Page 34: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

5 Self-Tuning Configuration and Adaptation

Data placement in a global overlay computer should be completely self-organizing. Thisencompasses automated self-tuning of configuration parameters and other options, and theability to self-adapt and dynamically adjust the parameters to changing conditions suchas evolving workloads, network growth, bursts of load- or churn-related events, and so on.These capabilities are also considered in the strategic direction of autonomic computing[70, 29], but global computing faces the extra complexity of huge scale and dynamics.

Despite some very promising approaches based on queueing models (e.g., [57, 102, 141])and feedback control (e.g., [69, 69, 44], research is still far from solving the problem inits full generality. In this report, we therefore focus on one particular tuning issue thatis of most importance for data placement in a global overlay computer, namely, choosingthe degree of replication. A very general and elegant analysis in this direction is givenin [35], for an abstract computational model with certain assumptions. It is based onan unstructured network with Gnutella-style message flooding, abstracted into routingrequests to a certain number of randomly chosen nodes. Requests are lookups of dataitems by some key such as file name. Data items are replicated on a certain number ofrandom nodes to shorten the search time (i.e., number of messages for finding one copy ofthe requested item) and to improve the probability of successful lookups.

The system model of [35] consists of n nodes, m data items, and a total budget ofR ≥ m replicas overall. Each data item i has a relative request frequency qi, and afraction pi of the total budget R. Two simple strategies are uniform allocation with thesame pi for all items, and proportional allocation with pi being proportional to qi. Oneway of achieving the latter is to create a replica for an item upon each successful lookupof the item, with the target node chosen randomly and a background process of occasionalremovals of replicas (e.g., based on staleness). It is shown that both strategies have thesame expected time for successful lookups, and that that the uniform method performsbetter on unsuccessful lookups with a bounded number of messages (i.e., trials of randomlychosen nodes). Interestingly, there are better strategies with pi values chosen to lie withinthose of the uniform and proportional methods. It is shown in [35] that the square-rootallocation with pi being proportional to

√(qi) is optimal with respect to the expected time

for successful lookups. For optimizing a combination of successful and resource-boundedunsuccessful lookups, the paper provides further considerations.

A decentralized way of implementing the square-root strategy is the following. Whena lookup request is successfully terminated after having visited C nodes, the requestorcreates C new replicas on randomly chosen nodes. (It is assumed that the backgroundprocess for replica removal guarantees a steady-state situation.) This method, whichcan be implemented in a localized manner within arbitrarily large distributed systems, issteady-state equivalent to the square-root allocation, without any needs for tracking andestimating item-specific request rates.

Notwithstanding the technical depth and elegance of these results, there is still a

30

Page 35: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

significant gap between the paper’s computational model and advanced forms of real-lifeoverlay networks. The paper states that its results can be extended to heterogeneous nodes(with different storage or load capacities), but real systems have sophisticated networktopologies (with links that exhibit very different bandwidth and latency properties) androuting algorithms. Generalizing the results of [35] to a full-fledged, realistic network andalso to different kinds of structured overlay networks would be an intriguing open problemfor future research.

5.1 Initial AEOLUS Results: Practical Proactive Replication

As a novel research result in AEOLUS, the University of Ioannina has developed a prac-tically viable extension of the square-root replication technique, published in [85].

To achieve a square-root degree of replicas, the requestor of a data item that retrievesa replica needs to create a number of replicas equal to the number of nodes that wereprobed during the requested search [35]. In the original work of [35], it was suggested tochoose the nodes for these additional replicas by random selection, e.g., by a random walkalong one path. In real systems, this is often not a practical implementation. Rather,[85] has developed a flexible framework that combines a pull and a push phase for thereplica-creation step. In the pull phase retrieves a replica of the data item and collectsinformation about nodes and their interconnection topology. The subsequent push phasethen contacts the necessary number of nodes by an epidemic request. In contrast to [35]this would typically be performed by a branching walk, using randomization as needed.

The parameters of the epidemic request in the push phase can, and typically should, beset in a specific manner. For example, if they use a TTL parameter for terminating aftera certain number of hops, this TTL could be set smaller than the TTL of the precedingpull phase. Furthermore, the push phase may possibly bias the random walk based oninformation gathered in the pull phase. For example, it could prefer nearby nodes interms of hops, latency, or bandwidth. Alternatively, for better failure independence amongreplica holders, the push phase may choose to explore slightly longer paths in order to findmore reliable or further separated nodes. The framework offers significant flexibility totrade off different quality measures. The tuning of the corresponding parameters and theprovisioning of QoS guarantees in such a setting is left for future work within AEOLUS.

The framework of [85] also supports flexible update strategies for replica maintenance,even in the presence of not fully cooperative nodes and without any node that has completeknowledge about all replicas. This method again uses a hybrid pull-push strategy forpropagating updates. The node that performs an update serves as an agnostic copyowner, and pushes the update to other nodes in an epidemic manner. A copy holder thatis concerned about replica freshness and consistency would additionally contact othernodes in pull mode, in a periodic manner with adaptively chosen TTL parameter andperiodicity. Finally, not fully cooperative nodes that hold copies but do not really feelresponsible for their freshness do not need any special actions, but benefit from the replica-

31

Page 36: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

maintenance activities of other nodes with a certain probability. [85] presents specificchoices of parameter settings, and experimentally studies the probability of reaching allor a certain quorum of copy holders.

These recent results of the AEOLUS project have also been presented in DeliverableD.2.1.1 on resource discovery, as they are versatile building blocks for several aspects ofglobal computing. Deliverable D.2.1.1 provides additional details. For data managementin a global overlay computer, the choice of strategies and tuning parameters within thedeveloped framework will be a specific issue of our future research.

6 Conclusion and Outlook

This report has given a state-of-the-art survey and AEOLUS-specific research perspectiveon distributed data management. We have emphasized the need for quality-of-serviceguarantees and have therefore largely concentrated on caching and replication as the mainmechanisms to this end. More specifically, we have reviewed algorithms and a number ofprototype implementations that either provide or could possibly be enhanced to provideguarantees on reliability, availability, throughput, response time, data recall, and datafreshness.

As distributed data management has high inherent complexity and the quest forquality-of-service guarantees faces many trade-offs, we have decided to pursue solutions fora number of specific but archetypical application areas, rather than striving for perfectlygeneral-purpose methods. In the further work of this WP we will largely concentrate onapplications like Web search, Web archiving, scientific collaboration, and personal andsocial data spaces.

We have identified the objective of quality-of-service guarantees for P2P-based globaloverlay computers as a common theme that is shared between this WP 3.1 and WP 3.4.Also, the approach of studying the mechanism of dynamic replication for the differentapplication areas is a methodological commonality with WP 3.4. We therefore plan tomerge these two WPs in the implementation plan for the coming period from month 13to 30. The merged WP will be titled “Data Services with Quality-of-Service Guarantees”.The technical research work will explore the issue of how to optimize degrees of replicationand how to dynamically adjust the number and placement of replicas in response tofailures, churn, and evolving workloads.

References

[1] Karl Aberer, Luc Onana Alima, Ali Ghodsi, Sarunas Girdzijauskas, Seif Haridi,and Manfred Hauswirth. The essence of p2p: A reference architecture for overlaynetworks. In Peer-to-Peer Computing, pages 11–20, 2005.

32

Page 37: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[2] Karl Aberer and Philippe Cudre-Mauroux. Semantic overlay networks. In VLDB,page 1367, 2005.

[3] Karl Aberer, Philippe Cudre-Mauroux, Manfred Hauswirth, and Tim Van Pelt.Gridvine: Building internet-scale semantic overlay networks. In International Se-mantic Web Conference, pages 107–121, 2004.

[4] Karl Aberer, Anwitaman Datta, and Manfred Hauswirth. P-grid: Dynamics of self-organizing processes in structured peer-to-peer systems. In Peer-to-Peer Systemsand Applications, pages 137–153, 2005.

[5] Karl Aberer, Anwitaman Datta, Manfred Hauswirth, and Roman Schmidt. Indexingdata-oriented overlay networks. In VLDB, pages 685–696, 2005.

[6] Karl Aberer, Magdalena Punceva, Manfred Hauswirth, and Roman Schmidt. Im-proving data access in p2p systems. IEEE Internet Computing, 6(1):58–67, 2002.

[7] Karl Aberer and Jie Wu. Towards a common framework for peer-to-peer web re-trieval. In From Integrated Publication and Information Systems to Virtual Infor-mation and Knowledge Environments, pages 138–151, 2005.

[8] Ioannis Aekaterinidis and Peter Triantafillou. Internet scale string attribute pub-lish/subscribe data networks. In CIKM, pages 44–51, 2005.

[9] Ioannis Aekaterinidis and Peter Triantafillou. Pastrystrings: A comprehensivecontent-based publish/subscribe dht network. 2006.

[10] Fuat Akal, Can Turker, Hans-Jorg Schek, Yuri Breitbart, Torsten Grabs, andLourens Veen. Fine-grained replication and scheduling with freshness and correct-ness guarantees. In VLDB, pages 565–576, 2005.

[11] Boanerges Aleman-Meza, Meenakshi Nagarajan, Cartic Ramakrishnan, Li Ding,Pranam Kolari, Amit P. Sheth, Ismailcem Budak Arpinar, Anupam Joshi, and TimFinin. Semantic analytics on social networks: experiences in addressing the problemof conflict of interest detection. In WWW, pages 407–416, 2006.

[12] Todd A. Anderson, Yuri Breitbart, Henry F. Korth, and Avishai Wool. Replication,consistency, and practicality: Are these mutually exclusive? In SIGMOD Confer-ence, pages 484–495, 1998.

[13] Internet Archive. http://www.archive.org.

[14] Wolf-Tilo Balke, Wolfgang Nejdl, Wolf Siberski, and Uwe Thaden. Dl meets p2p -distributed document retrieval based on classification and content. In ECDL, pages379–390, 2005.

33

Page 38: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[15] Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, andChristian Zimmer. Improving collection selection with overlap awareness in p2psearch engines. In SIGIR, pages 67–74, 2005.

[16] Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, andChristian Zimmer. Minerva: Collaborative p2p search. In VLDB, pages 1263–1266,2005.

[17] Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, andChristian Zimmer. P2p web search: Give the web back to the people. In 5thInternational Workshop on Peer-to-Peer Systems (IPTPS), Santa Barbara, 2006.

[18] Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, andChristian Zimmer. ?to infinity and beyond?: P2p web search with minerva andminerva? In Roberto Baldoni, G. Cortese, and F. Davide, editors, Global DataManagement, pages 301–323. IOSPress, 2006.

[19] Klaus Berberich, Srikanta J. Bedathur, Michalis Vazirgiannis, and Gerhard Weikum.Buzzrank ... and the trend is your friend. In WWW, pages 937–938, 2006.

[20] Klaus Berberich, Michalis Vazirgiannis, and Gerhard Weikum. T-rank: Time-awareauthority ranking. In WAW, pages 131–142, 2004.

[21] Ranjita Bhagwan, Kiran Tati, Yu-Chung Cheng, Stefan Savage, and Geoffrey M.Voelker. Total recall: System support for automated availability management. InNSDI, pages 337–350, 2004.

[22] Reinhard Braumandl, Markus Keidl, Alfons Kemper, Donald Kossmann, StefanSeltzsam, and Konrad Stocker. Objectglobe: Open distributed query processingservices on the internet. IEEE Data Eng. Bull., 24(1):64–70, 2001.

[23] Reinhard Braumandl, Alfons Kemper, and Donald Kossmann. Quality of service inan information economy. ACM Trans. Internet Techn., 3(4):291–333, 2003.

[24] Laura Bright and David Maier. Deriving and managing data products in an envi-ronmental observation and forecasting system. In CIDR, pages 162–173, 2005.

[25] Ingo Brunkhorst, Hadhami Dhraief, Alfons Kemper, Wolfgang Nejdl, and ChristianWiesner. Distributed queries and query optimization in schema-based p2p-systems.In DBISP2P, pages 184–199, 2003.

[26] James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collectionswith inference networks. In SIGIR, pages 21–28, 1995.

[27] Pei Cao and Sandy Irani. Cost-aware www proxy caching algorithms. In USENIXSymposium on Internet Technologies and Systems, 1997.

34

Page 39: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[28] Jim Challenger, Paul Dantzig, Arun Iyengar, Mark S. Squillante, and Li Zhang.Efficiently serving dynamic data at highly accessed web sites. IEEE/ACM Trans.Netw., 12(2):233–246, 2004.

[29] Surajit Chaudhuri and Gerhard Weikum. Foundations of automated database tun-ing. In ICDE, page 104, 2006.

[30] Liping Chen, K. Selcuk Candan, Jun’ichi Tatemura, Divyakant Agrawal, and DirceuCavendish. On overlay schemes to support point-in-range queries for scalable gridresource discovery. In Peer-to-Peer Computing, pages 23–30, 2005.

[31] Paul-Alexandru Chirita, Stefania Costache, Wolfgang Nejdl, and Raluca Paiu.Beagle++: Semantically enhanced searching and ranking on the desktop. In ESWC,pages 348–362, 2006.

[32] Paul-Alexandru Chirita, Stratos Idreos, Manolis Koubarakis, and Wolfgang Nejdl.Publish/subscribe for rdf-based p2p networks. In ESWS, pages 182–197, 2004.

[33] Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weather-spoon, Frans Kaashoek, John Kubiatowicz, and Robert Morris. Efficient replicamaintenance for distributed storage systems. In Proceedings of the 3rd USENIXSymposium on Networked Systems Design and Implementation (NSDI ’06), SanJose, CA, May 2006.

[34] Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weather-spoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. Efficient replicamaintenance for distributed storage systems. In NSDI, pages 45–58, 2006.

[35] Edith Cohen and Scott Shenker. Replication strategies in unstructured peer-to-peernetworks. In SIGCOMM, pages 177–190, 2002.

[36] Arturo Crespo and Hector Garcia-Molina. Routing indices for peer-to-peer systems.In ICDCS, pages 23–33, 2002.

[37] Arturo Crespo and Hector Garcia-Molina. Semantic overlay networks for p2p sys-tems. In AP2PC, pages 1–13, 2004.

[38] Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P. Martin, and Thu D.Nguyen. Planetp: Using gossiping to build content addressable peer-to-peer infor-mation sharing communities. In HPDC, pages 236–249, 2003.

[39] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, andRobert Morris. Designing a dht for low latency and high throughput. In NSDI,pages 85–98, 2004.

35

Page 40: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[40] Anwitaman Datta, Manfred Hauswirth, Renault John, Roman Schmidt, and KarlAberer. Range queries in trie-structured overlays. In Peer-to-Peer Computing, pages57–66, 2005.

[41] Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson, ScottShenker, Howard E. Sturgis, Daniel C. Swinehart, and Douglas B. Terry. Epidemicalgorithms for replicated database maintenance. Operating Systems Review, 22(1):8–32, 1988.

[42] Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph M. Hellerstein, and WeiHong. Model-based approximate querying in sensor networks. VLDB J., 14(4):417–443, 2005.

[43] Zoran Despotovic and Karl Aberer. P2p reputation management: Probabilisticestimation vs. social networks. Computer Networks, 50(4):485–500, 2006.

[44] Yixin Diao, Joseph L. Hellerstein, Sujay S. Parekh, Rean Griffith, Gail E. Kaiser, andDan B. Phung. Self-managing systems: A control theory foundation. In 12th IEEEInternational Conference on the Engineering of Computer-Based Systems (ECBS2005), pages 441–448, 2005.

[45] Jens-Peter Dittrich, Peter M. Fischer, and Donald Kossmann. Agile: Adaptiveindexing for context-aware information filters. In SIGMOD Conference, pages 215–226, 2005.

[46] Xin Dong and Alon Y. Halevy. A platform for personal information managementand integration. In CIDR, pages 119–130, 2005.

[47] C. Doulkeridis, K. Norvag, and M. Vazirgiannis. Desent: Decentralized and dis-tributed semantic overlay generation in p2p networks. Special Issue on Peer-to-PeerCommunications and Applications, IEEE Journal on Selected Areas in Communi-cations (to appear).

[48] Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Ragha-van, and Andrew Tomkins. Visualizing tags over time. In WWW, pages 193–202,2006.

[49] Jignesh M. Patel (Editor). Special issue on querying biological sequences. IEEEData Eng. Bull., 27(3), 2004.

[50] Z. Meral Ozsoyoglu (Editor). Special issue on database support for the sciences.IEEE Data Eng. Bull., 27(4), 2004.

[51] Patrick Th. Eugster, Pascal Felber, Rachid Guerraoui, and Anne-Marie Kermarrec.The many faces of publish/subscribe. ACM Comput. Surv., 35(2):114–131, 2003.

36

Page 41: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[52] Patrick Th. Eugster, Rachid Guerraoui, Anne-Marie Kermarrec, and Laurent Mas-soulie. Epidemic information dissemination in distributed systems. IEEE Computer,37(5):60–67, 2004.

[53] Francoise Fabret, Hans-Arno Jacobsen, Francois Llirbat, Joao Pereira, Kenneth A.Ross, and Dennis Shasha. Filtering algorithms and implementation for very fastpublish/subscribe. In SIGMOD Conference, pages 115–126, 2001.

[54] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms formiddleware. In PODS, 2001.

[55] Li Fan, Pei Cao, Jussara M. Almeida, and Andrei Z. Broder. Summary cache: ascalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281–293, 2000.

[56] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scalestudy of the evolution of web pages. Softw., Pract. Exper., 34(2):213–237, 2004.

[57] Michael Gillmann, Gerhard Weikum, and Wolfgang Wonner. Workflow managementwith service quality guarantees. In SIGMOD Conference, pages 228–239, 2002.

[58] Luis Gravano, Hector Garcia-Molina, and Anthony Tomasic. Gloss: Text-sourcediscovery over the internet. ACM Trans. Database Syst., 24(2):229–264, 1999.

[59] Cary G. Gray and David R. Cheriton. Leases: An efficient fault-tolerant mechanismfor distributed file cache consistency. In SOSP, pages 202–210, 1989.

[60] Jim Gray, Pat Helland, Patrick E. O’Neil, and Dennis Shasha. The dangers ofreplication and a solution. In SIGMOD Conference, pages 173–182, 1996.

[61] Jim Gray, David T. Liu, Marıa A. Nieto-Santisteban, Alex Szalay, David J. DeWitt,and Gerd Heber. Scientific data management in the coming decade. SIGMODRecord, 34(4):34–41, 2005.

[62] Jim Gray and Alexander S. Szalay. The world wide telescope: An archetype foronline science. CoRR, cs.DB/0403018, 2004.

[63] R. Guha, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. Propagation oftrust and distrust. In WWW, pages 403–412, 2004.

[64] Hongfei Guo, Per-Ake Larson, and Raghu Ramakrishnan. Caching with ’goodenough’ currency, consistency, and completeness. In VLDB, pages 457–468, 2005.

[65] Abhishek Gupta, Ozgur D. Sahin, Divyakant Agrawal, and Amr El Abbadi. Megh-doot: Content-based publish/subscribe over p2p networks. In Middleware, pages254–273, 2004.

37

Page 42: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[66] Emir Halepovic and Ralph Deters. The jxta performance model and evaluation.Future Generation Comp. Syst., 21(3):377–390, 2005.

[67] Alon Y. Halevy, Michael J. Franklin, and David Maier. Principles of dataspacesystems. In PODS, pages 1–9, 2006.

[68] Boudewijn R. Haverkort. Performance of Computer Communication Systems. Wiley,1998.

[69] Thomas Heinis, Cesare Pautasso, and Gustavo Alonso. Design and evaluation of anautonomic workflow engine. In ICAC, pages 27–38, 2005.

[70] Lorraine Herger, Kazuo Iwano, Pratap Pattnaik, Alfred G. Davis, and John R. Rit-sko (Editors). Special issue on autonomic computing. IBM Systems Journal., 42(1),2003.

[71] Bill Howe and David Maier. Algebraic manipulation of scientific datasets. VLDBJ., 14(4):397–416, 2005.

[72] Ryan Huebsch, Brent N. Chun, Joseph M. Hellerstein, Boon Thau Loo, PetrosManiatis, Timothy Roscoe, Scott Shenker, Ion Stoica, and Aydan R. Yumerefendi.The architecture of pier: an internet-scale query processor. In CIDR, pages 28–43,2005.

[73] Sitaram Iyer, Antony I. T. Rowstron, and Peter Druschel. Squirrel: a decentralizedpeer-to-peer web cache. In PODC, pages 213–222, 2002.

[74] H. V. Jagadish, Beng Chin Ooi, and Quang Hieu Vu. Baton: A balanced treestructure for peer-to-peer networks. In VLDB, pages 661–672, 2005.

[75] Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and JenniferWidom. Declarative support for sensor data cleaning. In Pervasive, pages 83–100,2006.

[76] M. Frans Kaashoek and David R. Karger. Koorde: A simple degree-optimal dis-tributed hash table. In IPTPS, pages 98–107, 2003.

[77] Panos Kalnis, Wee Siong Ng, Beng Chin Ooi, and Kian-Lee Tan. Answering simi-larity queries in peer-to-peer networks. Inf. Syst., 31(1):57–72, 2006.

[78] David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina Panigrahy,Matthew S. Levine, and Daniel Lewin. Consistent hashing and random trees: Dis-tributed caching protocols for relieving hot spots on the world wide web. In STOC,pages 654–663, 1997.

38

Page 43: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[79] David R. Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan Dhanidina,Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav Yerushalmi. Web caching withconsistent hashing. Computer Networks, 31(11-16):1203–1213, 1999.

[80] Bettina Kemme and Gustavo Alonso. A new approach to developing and implement-ing eager database replication protocols. ACM Trans. Database Syst., 25(3):333–379,2000.

[81] Georgia Koloniari and Evaggelia Pitoura. Peer-to-peer management of xml data:issues and research challenges. SIGMOD Record, 34(2):6–17, 2005.

[82] Anshul Kothari, Divyakant Agrawal, Abhishek Gupta, and Subhash Suri. Rangeaddressable network: A p2p cache architecture for data ranges. In Peer-to-PeerComputing, pages 14–22, 2003.

[83] B. Krishnamurthy and J. Rexford. Web Protocols and Practice: HTTP/1.1, Net-working Protocols, Caching, and Traffice Measurement. Addison-Wesley, 2001.

[84] John Kubiatowicz, David Bindel, Yan Chen, Steven E. Czerwinski, Patrick R. Eaton,Dennis Geels, Ramakrishna Gummadi, Sean C. Rhea, Hakim Weatherspoon, West-ley Weimer, Chris Wells, and Ben Y. Zhao. Oceanstore: An architecture for global-scale persistent storage. In ASPLOS, pages 190–201, 2000.

[85] Elias Leontiadis, Vassilios V. Dimakopoulos, and Evaggelia Pitoura. Creating andmaintaining replicas in unstructured peer-to-peer systems. In 12th InternationalEuro-Par Conference on Parallel Processing, 2006.

[86] Ulf Leser. A query language for biological networks. 3rd European Conference onComputational Biology, Madrid, Spain, 2005.

[87] Guoli Li and Hans-Arno Jacobsen. Composite subscriptions in content-based pub-lish/subscribe systems. In Middleware, pages 249–269, 2005.

[88] Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kaashoek, David R.Karger, and Robert Morris. On the feasibility of peer-to-peer web indexing andsearch. In IPTPS, pages 207–215, 2003.

[89] Jinyang Li, Jeremy Stribling, Thomer M. Gil, Robert Morris, and M. FransKaashoek. Comparing the performance of distributed hash tables under churn. InIPTPS, pages 87–99, 2004.

[90] David Liben-Nowell, Hari Balakrishnan, and David R. Karger. Analysis of theevolution of peer-to-peer systems. In PODC, pages 233–242, 2002.

[91] Prakash Linga, Adina Crainiceanu, Johannes Gehrke, and Jayavel Shanmugasun-daram. Guaranteeing correctness and availability in p2p range indices. In SIGMODConference, pages 323–334, 2005.

39

Page 44: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[92] Witold Litwin, Marie-Anne Neimat, and Donovan A. Schneider. Lh* - a scalable,distributed data structure. ACM Trans. Database Syst., 21(4):480–525, 1996.

[93] Alexander Loser, Christoph Tempich, Bastian Quilitz, Wolf-Tilo Balke, SteffenStaab, and Wolfgang Nejdl. Searching dynamic communities with personal indexes.In International Semantic Web Conference, pages 491–505, 2005.

[94] Jie Lu and James P. Callan. Content-based retrieval in hybrid peer-to-peer networks.In CIKM, pages 199–206, 2003.

[95] Jie Lu and Jamie Callan. Federated search of text-based digital libraries in hierar-chical peer-to-peer networks. In ECIR, pages 52–66, 2005.

[96] Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. Tinydb:an acquisitional query processing system for sensor networks. ACM Trans. DatabaseSyst., 30(1):122–173, 2005.

[97] Peter Mahlmann and Christian Schindelhauer. Peer-to-peer networks based on ran-dom transformations of connected regular undirected graphs. In SPAA, pages 155–164, 2005.

[98] Peter Mahlmann and Christian Schindelhauer. Distributed random digraph trans-formations for peer-to-peer networks. In Proceedings of the 18th ACM Symposiumon Parallelism in Algorithms and Architectures, 2006.

[99] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: a scalable and dynamicemulation of the butterfly. In PODC, pages 183–192, 2002.

[100] Sergio Marti and Hector Garcia-Molina. Taxonomy of trust: Categorizing p2p rep-utation systems. Computer Networks, 50(4):472–484, 2006.

[101] Julien Masanes and Andreas Rauber, editors. 5th International Web ArchivingWorkshop (IWAW05), 2005.

[102] Laurent Massoulie and Milan Vojnovic. Coupon replication systems. In SIGMET-RICS, pages 2–13, 2005.

[103] Petar Maymounkov and David Mazieres. Kademlia: A peer-to-peer informationsystem based on the xor metric. In IPTPS, pages 53–65, 2002.

[104] Nimrod Megiddo and Dharmendra S. Modha. Arc: A self-tuning, low overheadreplacement cache. In FAST, 2003.

[105] Daniel A. Menasce and Virgilio A. F. Almeida. Capacity Planning for Web Perfor-mance. Prentice Hall, 2001.

40

Page 45: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[106] Weiyi Meng, Clement T. Yu, and King-Lup Liu. Building efficient and effectivemetasearch engines. ACM Comput. Surv., 34(1):48–89, 2002.

[107] Sebastian Michel, Matthias Bender, Peter Triantafillou, and Gerhard Weikum. Iqnrouting: Integrating quality and novelty in p2p querying and ranking. In EDBT,pages 149–166, 2006.

[108] Sebastian Michel, Peter Triantafillou, and Gerhard Weikum. Minervainfinity: Ascalable efficient peer-to-peer search engine. In Middleware, pages 60–81, 2005.

[109] Alan Mislove, Andreas Haeberlen, Ansley Post, and Peter Druschel. epost. In Peer-to-Peer Systems and Applications, pages 171–192, 2005.

[110] Tom Mitchell. Computer workstations as intelligent agents. Keynote, SIGMOD2005.

[111] Michael Mitzenmacher. The power of two choices in randomized load balancing.IEEE Trans. Parallel Distrib. Syst., 12(10):1094–1104, 2001.

[112] Alberto Montresor, Mark Jelasity, and Ozalp Babaoglu. Chord on demand. InPeer-to-Peer Computing, pages 87–94, 2005.

[113] Wolfgang Mueller, Martin Eisenhardt, and Andreas Henrich. Scalable summarybased retrieval in p2p networks. In ACM CIKM International Conference on Infor-mation and Knowledge Management, pages 586–593, 2005.

[114] Moni Naor and Udi Wieder. A simple fault tolerant distributed hash table. InIPTPS, pages 88–97, 2003.

[115] Henrik Nottelmann and Norbert Fuhr. Combining cori and the decision-theoreticapproach for advanced resource selection. In ECIR, pages 138–153, 2004.

[116] Henrik Nottelmann and Norbert Fuhr. Comparing different architectures for queryrouting in peer-to-peer networks. In ECIR, pages 253–264, 2006.

[117] Elizabeth J. O’Neil, Patrick E. O’Neil, and Gerhard Weikum. An optimality proofof the lru- page replacement algorithm. J. ACM, 46(1):92–112, 1999.

[118] Josiane Xavier Parreira, Sebastian Michel, and Gerhard Weikum. p2pdating: Reallife inspired semantic overlay networks for web search. In Proceedings of theWorkshop on Heterogeneous and Distributed Information Retrieval, Salvador Bahia,Brazil, 2005.

[119] Jignesh M. Patel, Donald P. Huddler, and Laurie Hammel. Declarative and efficientquerying on protein secondary structures. In Wang et al. [156], pages 243–273.

41

Page 46: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[120] Fernando Pedone, Matthias Wiesmann, Andre Schiper, Bettina Kemme, and Gus-tavo Alonso. Understanding replication in databases and distributed systems. InICDCS, pages 464–474, 2000.

[121] Evaggelia Pitoura, Serge Abiteboul, Dieter Pfoser, George Samaras, and MichalisVazirgiannis. Dbglobe: a service-oriented p2p system for global computing. SIG-MOD Record, 32(3):77–82, 2003.

[122] Theoni Pitoura, Nikos Ntarmos, and Peter Triantafillou. Replication, load balancingand efficient range query processing in dhts. In EDBT, pages 131–148, 2006.

[123] C. Greg Plaxton, Rajmohan Rajaraman, and Andrea W. Richa. Accessing nearbycopies of replicated objects in a distributed environment. In SPAA, pages 311–320,1997.

[124] I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Beyond term indexing:A p2p framework for web information retrieval. Informatica (to appear), 2006.

[125] Ivana Podnar, Toan Luu, Martin Rajman, Fabius Klemm, and Karl Aberer. Apeer-to-peer architecture for information retrieval across digital library collections.ECDL, 2006.

[126] Michael Rabinovich and Oliver Spatschek. Web caching and replication. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.

[127] Sriram Ramabhadran, Sylvia Ratnasamy, Joseph M. Hellerstein, and Scott Shenker.Brief announcement: prefix hash tree. In PODC, page 368, 2004.

[128] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and ScottShenker. A scalable content-addressable network. In SIGCOMM, pages 161–172,2001.

[129] Sean C. Rhea, Patrick R. Eaton, Dennis Geels, Hakim Weatherspoon, Ben Y. Zhao,and John Kubiatowicz. Pond: The oceanstore prototype. In FAST, 2003.

[130] Sean C. Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handlingchurn in a dht. In USENIX Annual Technical Conference, General Track, pages127–140, 2004.

[131] Sean C. Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy,Scott Shenker, Ion Stoica, and Harlan Yu. Opendht: a public dht service and itsuses. In SIGCOMM, pages 73–84, 2005.

[132] Sean C. Rhea, Chris Wells, Patrick R. Eaton, Dennis Geels, Ben Y. Zhao, HakimWeatherspoon, and John Kubiatowicz. Maintenance-free global data storage. IEEEInternet Computing, 5(5):40–49, 2001.

42

Page 47: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[133] Antony I. T. Rowstron and Peter Druschel. Pastry: Scalable, decentralized objectlocation, and routing for large-scale peer-to-peer systems. In Middleware, pages329–350, 2001.

[134] Antony I. T. Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel.Scribe: The design of a large-scale event notification infrastructure. In NetworkedGroup Communication, pages 30–43, 2001.

[135] Ozgur D. Sahin, S. Antony, Divyakant Agrawal, and Amr El Abbadi. Probe: Multi-dimensional range queries in p2p networks. In WISE, pages 332–346, 2005.

[136] Ozgur D. Sahin, Abhishek Gupta, Divyakant Agrawal, and Amr El Abbadi. Apeer-to-peer framework for caching range queries. In ICDE, pages 165–176, 2004.

[137] Robin A. Sahner, Kishor S. Trivedi, and Antonio Puliafito. Performance and Reli-ability Analysis of Computer Systems. Kluwer, 1996.

[138] Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Comput. Surv.,37(1):42–81, 2005.

[139] Daniel Sandler, Alan Mislove, Ansley Post, and Peter Druschel. Feedtree: Sharingweb micronews with peer-to-peer event notification. In IPTPS, pages 141–151, 2005.

[140] Mario T. Schlosser, Michael Sintek, Stefan Decker, and Wolfgang Nejdl. Hypercup- hypercubes, ontologies, and efficient search on peer-to-peer networks. In AP2PC,pages 112–124, 2002.

[141] Bianca Schroeder, Mor Harchol-Balter, Arun Iyengar, Erich M. Nahum, and AdamWierman. How to determine a good multi-programming level for external scheduling.In 22nd International Conference on Data Engineering (ICDE 2006), 2006.

[142] Junho Shim, Peter Scheuermann, and Radek Vingralek. Proxy cache algorithms: De-sign, implementation, and performance. IEEE Trans. Knowl. Data Eng., 11(4):549–562, 1999.

[143] E. Sit, A. Haeberlen, F. Dabek, B. Chun, H. Weatherspoon, R. Morris, M. Kaashoek,and J. Kubiatowicz. Proactive replication for data durability. In IPTPS, 2006.

[144] Ralf Steinmetz and Klaus Wehrle, editors. Peer-to-Peer Systems and Applications,volume 3485 of Lecture Notes in Computer Science. Springer, 2005.

[145] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakr-ishnan. Chord: A scalable peer-to-peer lookup service for internet applications. InSIGCOMM, pages 149–160, 2001.

43

Page 48: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[146] Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. FransKaashoek, Frank Dabek, and Hari Balakrishnan. Chord: a scalable peer-to-peerlookup protocol for internet applications. IEEE/ACM Trans. Netw., 11(1):17–32,2003.

[147] Etzard Stolte and Gustavo Alonso. Approximated trial and error analysis in scientificdatabases. Inf. Syst., 28(1-2):137–157, 2003.

[148] Etzard Stolte, Christoph von Praun, Gustavo Alonso, and Thomas R. Gross. Sci-entific data repositories: Designing for a moving target. In SIGMOD Conference,pages 349–360, 2003.

[149] Torsten Suel, Chandan Mathur, Jo wen Wu, Jiangong Zhang, Alex Delis, MehdiKharrazi, Xiaohui Long, and Kulesh Shanmugasundaram. Odissea: A peer-to-peerarchitecture for scalable web search and information retrieval. In 6th InternationalWorkshop on Web and Databases (WebDB 2003), pages 67–72, 2003.

[150] David Tam, Reza Azimi, and Hans-Arno Jacobsen. Building content-based pub-lish/subscribe systems with distributed hash tables. In DBISP2P, pages 138–152,2003.

[151] Andrew S. Tanenbaum and Maarten Van Steen. Distributed Systems: Principlesand Paradigms. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2001.

[152] Christos Tryfonopoulos, Stratos Idreos, and Manolis Koubarakis. Publish/subscribefunctionality in ir environments using structured overlay networks. In SIGIR, pages322–329, 2005.

[153] Radek Vingralek, Yuri Breitbart, and Gerhard Weikum. Snowball: Scalable storageon networks of workstations with balanced load. Distributed and Parallel Databases,6(2):117–156, 1998.

[154] Radek Vingralek, Mehmet Sayal, Yuri Breitbart, and Peter Scheuermann. Web++architecture, design and performance. World Wide Web, 3(2):65–77, 2000.

[155] Spyros Voulgaris, Etienne Riviere, Anne-Marie Kermarrec, and Maarten van Steen.Sub-2-sub: Self-organizing content-based publish subscribe for dynamic large scalecollaborative networks. IPTPS, 2006.

[156] Jason Tsong-Li Wang, Mohammed Javeed Zaki, Hannu Toivonen, and DennisShasha, editors. Data Mining in Bioinformatics. Springer, 2005.

[157] Yuan Wang and David J. DeWitt. Computing pagerank in a distributed internetsearch engine system. In VLDB, pages 420–431, 2004.

44

Page 49: IP-FP6-015964 AEOLUS Algorithmic Principles for Building E ... · search, Web archiving, publish-subscribe services, scientiflc workbenches and collab-oration, sensor networks, management

[158] Gerhard Weikum and Gottfried Vossen. Transactional information systems: theory,algorithms, and the practice of concurrency control and recovery. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2001.

[159] Matthias Wiesmann, Andre Schiper, Fernando Pedone, Bettina Kemme, and Gus-tavo Alonso. Database replication techniques: A three parameter classification. InSRDS, pages 206–215, 2000.

[160] Tak W. Yan and Hector Garcia-Molina. The sift information dissemination system.ACM Trans. Database Syst., 24(4):529–565, 1999.

[161] Jian Yin, Lorenzo Alvisi, Michael Dahlin, and Arun Iyengar. Engineering web cacheconsistency. ACM Trans. Internet Techn., 2(3):224–259, 2002.

[162] Jian Yin, Lorenzo Alvisi, Michael Dahlin, and Calvin Lin. Volume leases for consis-tency in large-scale systems. IEEE Trans. Knowl. Data Eng., 11(4):563–576, 1999.

[163] Neal E. Young. On-line file caching. In SODA, pages 82–86, 1998.

[164] Haobo Yu, Lee Breslau, and Scott Shenker. A scalable web cache consistency archi-tecture. In SIGCOMM, pages 163–174, 1999.

[165] Jiangong Zhang and Torsten Suel. Efficient query evaluation on large textual col-lections in a peer-to-peer environment. In Peer-to-Peer Computing, pages 225–233,2005.

[166] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure forfault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141,UC Berkeley, April 2001.

45