high-throughput computing and opportunistic computing for matchmaking processes and indexing...
DESCRIPTION
ThesisTRANSCRIPT
High-throughput computing and opportunistic computing for matchmaking processes and indexing processes
University of CalabriaBachelor thesis in
Computer Engineering
SupervisorIng. Carlo Mastroianni
Bachelor CandidateSilvio Sangineto
Matriculation Number: 83879
2007-2008
Contents Introduction to the Thesis Introduction to Distributed Systems Introduction to the Grid, High-throughput Computing and opportunistic computing Condor Why Condor? Introduction to Prototype Architecture Centralized prototype architecture Centralized Scorer Results achieved A possible solution: Distributed Scorer Distributed Scorer New Results achieved From “local” business case to the big business case…
Introduction to the Thesis Creation of a Distributed Web-Spider with particular attention about
the efficiency, scalability, energy saving and costs.Description:The goal of this project is recovery the URLs about Italian Companies. This recovery is possible because we can use a customer database with general informations which: VAT number, phone, emails, etc.. These informations can be matched with the Web-Site contents so we can find the official Web-Site for each company.
Why:Knowing the Official Web-Site is very important because you can know quickly:• contacts and emails about it;• updates, news preview;• many descriptions about the Company activities;• other informations (e.g. history).
Actually in Italy not exist a complete list about the Italian Companies that have a Web-Site!!!
Introduction to the Thesis
Boundary value problems for my thesis: Difficulty to estimate how many companies have a Web-Site (Coverage Level); The Web-Site structures could have many parts no-standard (some Web-Sites
couldn’t have information about VAT number, email, etc..) ; The updating of the data-base that contains the URLs must allow to catch the Web-
Site of a new Company and the new Web-Site of an old Company; Some problems about privacy (e.g. email).
Relevant problems for my thesis: Load balancing work, efficient resources utilization; Scalability; Costs; Energy saving.
Usually in the Web-Spider that exists on the Web (e.g. Google), when they need to increase the computational power the Company buy other servers to provide it!!!!! (General Solution)
Introduction to the ThesisWe want to find an answer to the relevant problems in the “local” business case to use these solutions for the “big” business case !!!
Introduction to Distributed Systems
Advantages of Distributed System: Reliability; Sharing of resources; Aggregate computing power; Scalability;
Definition:A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility.
In our case we use a distributed system to have more computational power…
Grid Computing, High-throughput computing and opportunistic computing
Grid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed by the user (whether an individual or another computer) as a virtual environment with uniform access to resources. Much of Grid software technology addresses the issues of resource scheduling, quality of service, fault tolerance, decentralized control and security and so on, which enable the Grid to be perceived as a single virtual platform by the user.
High-throughput computing: The goal of a high-throughput computing Environment is to provide large amounts of fault-tolerant computational power over prolonged periods of time by effectively utilizing all resources available to the network.
Opportunistic computing: The goal of opportunistic computing is the ability to utilize resources whenever they are available, without requiring 100% availability.
The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means.easily achieved through opportunistic means.
The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means.easily achieved through opportunistic means.
CondorModern processing environments that consist of large collections of workstations interconnected by high capacity network raise the following challenging question: can we satisfy the needs of users who need extra capacity without lowering the quality of service experienced by the owners of under utilized workstations? . . . The Condor scheduling system is our answer to this question.
At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on cooperative processing with the powerful Crystal Multicomputer designed by DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by Litzkow. The result was Condor, a new system for distributed computing.The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing and opportunistic computing on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput. Condor is a middleware that allow the users to join and use the distributed resources.
CondorCondor is a specialized job and a resource management system (RMS) for computeintensive jobs. Like other full-featured systems, Condor provides a job management mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their jobs to Condor, and Condor subsequently chooses when and where to run them based upon a policy, monitors their progress, and ultimately informs the user upon completion.
Two very important mechanisms: ClassAds: The ClassAd mechanism in Condor provides an extremely flexible andexpressive framework for matching resource requests (e.g. jobs) with resource offers(e.g. machines) RemoteSystemCalls: When running jobs on remote machines, Condor can often preserve the local execution environment via remote system calls. Remote system calls is one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related system calls back to the machine that submitted the job. Therefore, users do not need to make data files available on remote workstations before Condor executes their programs there, even in the absence of a shared file system.
CondorHow condor works?
This is an example [An agent (A) is shown executing a job on a resource(R) with the help of a matchmaker (M)]:
Step 1: The agent and the resource advertise themselvesto the matchmaker. Step 2: The matchmaker informs the two parties that they are potentiallycompatible. Step 3: The agent contacts the resource and executes a job.
This figure shows the major processes in a Condor system
CondorWhat happen when you have more condor pools?
This is an example [An agent (A) is shown executing a job on a resource (R) viadirect flocking] :
Step 1: The agent and the resource advertise themselves locally. Step 2: The agentis unsatisfied, so it also advertises itself to Condor Pool B. Step 3: The matchmaker (M) informsthe two parties that they are potentially compatible. Step 4: The agent contacts the resource andexecutes a job.
CondorCondor Universe:Condor has several runtime environments (called a universe) from which to choose. The Java Universe was the best for our project (for this first version) so I could take advantage of portability (heterogeneous system) and it was good for the “local” business case. A universe for Java programs was added to Condor in late 2001. This was due to a growing community of scientific users that wished to perform simulations and other work in Java. Although such programs might run slower than native code, such losses were offset by faster development times and access to larger numbers of machines.
Why Condor?
• We used Condor because (some motivations):1) Efficient resource management (opportunistic computing and
high-throughput computing, ClassAds, etc..);
2) It’s a middleware for heterogeneous Distributed Systems (e.g. we can use different types of Operative Systems);
3) It’s an open source project and It’s used in many projects in the world like batch system;
4) Flexibility.
Introduction to Centralized Prototype Architecture
Web-Sites
CrawlerCrawler IndexIndex
Customer Data-Base
Customer Data-Base
Data-BaseURLs
Data-BaseURLs
Updater
ScorerScorer
Validator Manual
Validator Manual
QueryResults
New Companies,New Web-Sites
Candidates
MakeIndex
Identifyinginformation
Introduction to Centralized Prototype Architecture
Crawler:The prototype Web-Spider must have a Crawler that make an Index of the companies Web-Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler on the basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this business case we used the data extract throught theUbiCrawler. For indexing processes we used Managing GigaByte (MG4J).
Consumer Data-Base:This database contains the identifying information about the Companies: VAT number, phone, mails, company name, sign, etc..
Scorer:In this step there is the execution of several query and many matchmaking processes to find the right “match” between identifying information and the companies Web-Sites. Each match will have a score.
Centralized Scorer
Class Diagram - These are the most important classes, where we can see the principal processes of the Web-
Spider (together with the indexing processes)
Centralized ScorerInto the Centralized Scorer we have the following activities:
Score
Query over the Phone and VAT Number
Query over the address
Query over the other fields
Check over the URL name
Check over the type page
All these activities are completed in about 5 seconds (average), so to complete the analysis of a Company you need to wait this time. If you have to analysee 56.000 company you have to wait about 280.000 seconds!!!
There is a big problem: the number of the Companies can be very high !!!!
Centralized ScorerWe can glance at the java code that implements some functions:
AssociaDomini constructorIn this Class is implemented principally the logic that allows the “match” between the identifying information and the companies’ Web-Sites.
Centralized Scorer
1/3 – Associa() method 2/3 – Associa() method
Centralized Scorer
3/3 How the method called associa() record the results on a log file
We preferred to use hibernate because it’s an open source java persistence framework project. Perform powerfull object relational mapping.
Results AchievedOn a sample of 56000 companies:
Query Coverage (#)
Coverage (%)
Sign 2747 4,43%
Phone 25715 41,47%
VAT Number
4369 7,05%
Company Name
27487 44,33%
How many companies can you cover with these queries?What precision can you achieve?
Phone and VAT Number:These types of query are very good for the coverage and for the reliability.
Sign:Low coverage
Company Name:Very good coverage but low precision
Query Precision (%)
Sign 1%
Phone 25%
VAT Number
55%
Company Name
3%
Results Achieved
0
50000
100000
150000
200000
250000
300000
500 Companies 1000 Companies 10000 Companies
56000 Companies
Trend (S)
Trend (S)
For a sample of 56000 companies:
1 Personal Computer works for 77h (only for this computation)
Personal Computer used:Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB di Ram e 1 TB di HDisk
Possible Problems: the personal computer goes down; there are new Companies (updating) or some Web-Sites are changed, in this case the computation must continue…
The matchmaking processes and indexing processes are frequent in the time!!!!
For 1.000.000 companies that you have to analyse:
1 Personal Computer works about for 1389 days. It’s an ideal case…
This isn’t a scalable solution!
A possible solution: Distributed Scorer
We want to make a scalable solution for our Web-Spider.
There are some important constraints that we have to respect:
1) Energy Saving; 2) Efficient resources management and efficient resources utilization;3) Cost cutting;4) Having more companies analysed in a long time;
We can submit each set of queries on a different computer !!!
Distributed ScorerWe built a distributed scorer using the Condor middleware.
This is a possible architecture where execute our Distributed Scorer.
Example of architecture used by the National Institute of Nuclear Physics
Distributed Scorer
We used the vertical distribution and the horizontal distribution.
We built a wrapper class to prepare the work environment on Condor. This class realize the logic connection between the application and Condor. This class is runned on the Server (Central Manager).
Distributed Scorer
Builder JobmakeFile(impresa)Execute()
Central Manager
Job JobJob
Job
Distributed ScorerWe can see some tests on Condor for our application:
Some examples about Submit Description Files, these files are used by Condor for the matchmaking processes between the resources and the Jobs.
This is our Condor Pool during the tests
Distributed Scorer
Our application submit the jobs… We can check the status for our jobs…
If we have more jobs… we can check the status for our resources… We can check the status for our jobs…
Now, we have to check which results we achieved with this Distributed Scorer!! What is better?
New Results Achieved(1) We have an excellent work load balancing and efficient resources utilization…
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
10 hours 50 hours 500 hours
1 PC
50 PCs
500 PCs
(2) We can see how is possible increase the number of computations in a period of time (using High-throughput computing). It works even better if we have a sample of Companies higher. (+ Scalability!)
0
500000
1000000
1500000
2000000
2500000
3000000
1 PC 50 PCs 500 PCs
Seconds for 56000 companies
Seconds for 250000 companies
Seconds for 500000 companies
Marginal Gain
New Results AchievedWe can think to use the internet user’s machine when they are in an inactive mode… or we can use the companies’ machines because they can use our web-spider for direct marketing…
Make profit with your idle CPU cycles!
0 €
200.000 €
400.000 €
600.000 €
800.000 €
1.000.000 €
1.200.000 €
1 Macchina 50 Macchine 500 Macchine
Energy cost for the Company (every year) (with owner machines)
How does the energy cost increase in a year? (using users' machines)
You can economize much money and much energy saving!!!
From “local” business case to the big business case…
The Googleplex is the corporate headquarters complex of Google, Inc., located at 1600 Amphitheatre Parkway in Mountain View, Santa Clara County, California, near San Jose.Google purchased some of Silicon Graphics' properties, including the Googleplex, for $319 million.In late 2006 and early 2007 the company installed a series of solar panels, capable of producing 1.6 megawatts of electricity. At the time, it was believed to be the largest corporate installation in the United States. About 30 percent of the Googleplex's electricity needs will be fulfilled by this project, with the remainder being purchased.