high-throughput computing and opportunistic computing for matchmaking processes and indexing...

High-throughput computing and opportunistic computing for matchmaking processes and indexing processes

University of CalabriaBachelor thesis in

Computer Engineering

SupervisorIng. Carlo Mastroianni

Bachelor CandidateSilvio Sangineto

Matriculation Number: 83879

2007-2008

Contents Introduction to the Thesis Introduction to Distributed Systems Introduction to the Grid, High-throughput Computing and opportunistic computing Condor Why Condor? Introduction to Prototype Architecture Centralized prototype architecture Centralized Scorer Results achieved A possible solution: Distributed Scorer Distributed Scorer New Results achieved From “local” business case to the big business case…

Introduction to the Thesis Creation of a Distributed Web-Spider with particular attention about

the efficiency, scalability, energy saving and costs.Description:The goal of this project is recovery the URLs about Italian Companies. This recovery is possible because we can use a customer database with general informations which: VAT number, phone, emails, etc.. These informations can be matched with the Web-Site contents so we can find the official Web-Site for each company.

Why:Knowing the Official Web-Site is very important because you can know quickly:• contacts and emails about it;• updates, news preview;• many descriptions about the Company activities;• other informations (e.g. history).

Actually in Italy not exist a complete list about the Italian Companies that have a Web-Site!!!

Introduction to the Thesis

Boundary value problems for my thesis: Difficulty to estimate how many companies have a Web-Site (Coverage Level); The Web-Site structures could have many parts no-standard (some Web-Sites

couldn’t have information about VAT number, email, etc..) ; The updating of the data-base that contains the URLs must allow to catch the Web-

Site of a new Company and the new Web-Site of an old Company; Some problems about privacy (e.g. email).

Relevant problems for my thesis: Load balancing work, efficient resources utilization; Scalability; Costs; Energy saving.

Usually in the Web-Spider that exists on the Web (e.g. Google), when they need to increase the computational power the Company buy other servers to provide it!!!!! (General Solution)

Introduction to the ThesisWe want to find an answer to the relevant problems in the “local” business case to use these solutions for the “big” business case !!!

Introduction to Distributed Systems

Advantages of Distributed System: Reliability; Sharing of resources; Aggregate computing power; Scalability;

Definition:A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility.

In our case we use a distributed system to have more computational power…

Grid Computing, High-throughput computing and opportunistic computing

Grid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed by the user (whether an individual or another computer) as a virtual environment with uniform access to resources. Much of Grid software technology addresses the issues of resource scheduling, quality of service, fault tolerance, decentralized control and security and so on, which enable the Grid to be perceived as a single virtual platform by the user.

High-throughput computing: The goal of a high-throughput computing Environment is to provide large amounts of fault-tolerant computational power over prolonged periods of time by effectively utilizing all resources available to the network.

Opportunistic computing: The goal of opportunistic computing is the ability to utilize resources whenever they are available, without requiring 100% availability.

The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means.easily achieved through opportunistic means.

The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means.easily achieved through opportunistic means.

CondorModern processing environments that consist of large collections of workstations interconnected by high capacity network raise the following challenging question: can we satisfy the needs of users who need extra capacity without lowering the quality of service experienced by the owners of under utilized workstations? . . . The Condor scheduling system is our answer to this question.

At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on cooperative processing with the powerful Crystal Multicomputer designed by DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by Litzkow. The result was Condor, a new system for distributed computing.The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing and opportunistic computing on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput. Condor is a middleware that allow the users to join and use the distributed resources.

CondorCondor is a specialized job and a resource management system (RMS) for computeintensive jobs. Like other full-featured systems, Condor provides a job management mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their jobs to Condor, and Condor subsequently chooses when and where to run them based upon a policy, monitors their progress, and ultimately informs the user upon completion.

Two very important mechanisms: ClassAds: The ClassAd mechanism in Condor provides an extremely flexible andexpressive framework for matching resource requests (e.g. jobs) with resource offers(e.g. machines) RemoteSystemCalls: When running jobs on remote machines, Condor can often preserve the local execution environment via remote system calls. Remote system calls is one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related system calls back to the machine that submitted the job. Therefore, users do not need to make data files available on remote workstations before Condor executes their programs there, even in the absence of a shared file system.

CondorHow condor works?

This is an example [An agent (A) is shown executing a job on a resource(R) with the help of a matchmaker (M)]:

Step 1: The agent and the resource advertise themselvesto the matchmaker. Step 2: The matchmaker informs the two parties that they are potentiallycompatible. Step 3: The agent contacts the resource and executes a job.

This figure shows the major processes in a Condor system

CondorWhat happen when you have more condor pools?

This is an example [An agent (A) is shown executing a job on a resource (R) viadirect flocking] :

Step 1: The agent and the resource advertise themselves locally. Step 2: The agentis unsatisfied, so it also advertises itself to Condor Pool B. Step 3: The matchmaker (M) informsthe two parties that they are potentially compatible. Step 4: The agent contacts the resource andexecutes a job.

CondorCondor Universe:Condor has several runtime environments (called a universe) from which to choose. The Java Universe was the best for our project (for this first version) so I could take advantage of portability (heterogeneous system) and it was good for the “local” business case. A universe for Java programs was added to Condor in late 2001. This was due to a growing community of scientific users that wished to perform simulations and other work in Java. Although such programs might run slower than native code, such losses were offset by faster development times and access to larger numbers of machines.

Why Condor?

• We used Condor because (some motivations):1) Efficient resource management (opportunistic computing and

high-throughput computing, ClassAds, etc..);

2) It’s a middleware for heterogeneous Distributed Systems (e.g. we can use different types of Operative Systems);

3) It’s an open source project and It’s used in many projects in the world like batch system;

4) Flexibility.

Introduction to Centralized Prototype Architecture

Web-Sites

CrawlerCrawler IndexIndex

Customer Data-Base

Customer Data-Base

Data-BaseURLs

Data-BaseURLs

Updater

ScorerScorer

Validator Manual

Validator Manual

QueryResults

New Companies,New Web-Sites

Candidates

MakeIndex

Identifyinginformation

Introduction to Centralized Prototype Architecture

Crawler:The prototype Web-Spider must have a Crawler that make an Index of the companies Web-Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler on the basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this business case we used the data extract throught theUbiCrawler. For indexing processes we used Managing GigaByte (MG4J).

Consumer Data-Base:This database contains the identifying information about the Companies: VAT number, phone, mails, company name, sign, etc..

Scorer:In this step there is the execution of several query and many matchmaking processes to find the right “match” between identifying information and the companies Web-Sites. Each match will have a score.

Centralized Scorer

Class Diagram - These are the most important classes, where we can see the principal processes of the Web-

Spider (together with the indexing processes)

Centralized ScorerInto the Centralized Scorer we have the following activities:

Score

Query over the Phone and VAT Number

Query over the address

Query over the other fields

Check over the URL name

Check over the type page

All these activities are completed in about 5 seconds (average), so to complete the analysis of a Company you need to wait this time. If you have to analysee 56.000 company you have to wait about 280.000 seconds!!!

There is a big problem: the number of the Companies can be very high !!!!

Centralized ScorerWe can glance at the java code that implements some functions:

AssociaDomini constructorIn this Class is implemented principally the logic that allows the “match” between the identifying information and the companies’ Web-Sites.

Centralized Scorer

1/3 – Associa() method 2/3 – Associa() method

Centralized Scorer

3/3 How the method called associa() record the results on a log file

We preferred to use hibernate because it’s an open source java persistence framework project. Perform powerfull object relational mapping.

Results AchievedOn a sample of 56000 companies:

Query Coverage (#)

Coverage (%)

Sign 2747 4,43%

Phone 25715 41,47%

VAT Number

4369 7,05%

Company Name

27487 44,33%

How many companies can you cover with these queries?What precision can you achieve?

Phone and VAT Number:These types of query are very good for the coverage and for the reliability.

Sign:Low coverage

Company Name:Very good coverage but low precision

Query Precision (%)

Sign 1%

Phone 25%

VAT Number

55%

Company Name

3%

Results Achieved

0

50000

100000

150000

200000

250000

300000

500 Companies 1000 Companies 10000 Companies

56000 Companies

Trend (S)

Trend (S)

For a sample of 56000 companies:

1 Personal Computer works for 77h (only for this computation)

Personal Computer used:Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB di Ram e 1 TB di HDisk

Possible Problems: the personal computer goes down; there are new Companies (updating) or some Web-Sites are changed, in this case the computation must continue…

The matchmaking processes and indexing processes are frequent in the time!!!!

For 1.000.000 companies that you have to analyse:

1 Personal Computer works about for 1389 days. It’s an ideal case…

This isn’t a scalable solution!

A possible solution: Distributed Scorer

We want to make a scalable solution for our Web-Spider.

There are some important constraints that we have to respect:

1) Energy Saving; 2) Efficient resources management and efficient resources utilization;3) Cost cutting;4) Having more companies analysed in a long time;

We can submit each set of queries on a different computer !!!

Distributed ScorerWe built a distributed scorer using the Condor middleware.

This is a possible architecture where execute our Distributed Scorer.

Example of architecture used by the National Institute of Nuclear Physics

Distributed Scorer

We used the vertical distribution and the horizontal distribution.

We built a wrapper class to prepare the work environment on Condor. This class realize the logic connection between the application and Condor. This class is runned on the Server (Central Manager).

Distributed Scorer

Builder JobmakeFile(impresa)Execute()

Central Manager

Job JobJob

Job

Distributed ScorerWe can see some tests on Condor for our application:

Some examples about Submit Description Files, these files are used by Condor for the matchmaking processes between the resources and the Jobs.

This is our Condor Pool during the tests

Distributed Scorer

Our application submit the jobs… We can check the status for our jobs…

If we have more jobs… we can check the status for our resources… We can check the status for our jobs…

Now, we have to check which results we achieved with this Distributed Scorer!! What is better?

New Results Achieved(1) We have an excellent work load balancing and efficient resources utilization…

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

10 hours 50 hours 500 hours

1 PC

50 PCs

500 PCs

(2) We can see how is possible increase the number of computations in a period of time (using High-throughput computing). It works even better if we have a sample of Companies higher. (+ Scalability!)

0

500000

1000000

1500000

2000000

2500000

3000000

1 PC 50 PCs 500 PCs

Seconds for 56000 companies



Marginal Gain

New Results AchievedWe can think to use the internet user’s machine when they are in an inactive mode… or we can use the companies’ machines because they can use our web-spider for direct marketing…

Make profit with your idle CPU cycles!

0 €

200.000 €

400.000 €

600.000 €

800.000 €

1.000.000 €

1.200.000 €

1 Macchina 50 Macchine 500 Macchine

Energy cost for the Company (every year) (with owner machines)

How does the energy cost increase in a year? (using users' machines)

You can economize much money and much energy saving!!!

From “local” business case to the big business case…

The Googleplex is the corporate headquarters complex of Google, Inc., located at 1600 Amphitheatre Parkway in Mountain View, Santa Clara County, California, near San Jose.Google purchased some of Silicon Graphics' properties, including the Googleplex, for $319 million.In late 2006 and early 2007 the company installed a series of solar panels, capable of producing 1.6 megawatts of electricity. At the time, it was believed to be the largest corporate installation in the United States. About 30 percent of the Googleplex's electricity needs will be fulfilled by this project, with the remainder being purchased.

high-throughput computing and opportunistic computing for matchmaking processes and indexing...

Technology