1616 ieee transactions on industrial informatics, … › ieee_2013-2014_projects...1616 ieee...

1616 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 10, NO. 2, MAY 2014

Self-Adaptive Semantic Focused Crawler for MiningServices Information Discovery

Hai Dong, Member, IEEE, and Farookh Khadeer Hussain

Abstract—It is well recognized that the Internet has becomethe largest marketplace in the world, and online advertising isvery popular with numerous industries, including the traditionalmining service industry where mining service advertisements areeffective carriers of mining service information. However, serviceusers may encounter three major issues – heterogeneity, ubiquity,and ambiguity, when searching for mining service informationover the Internet. In this paper, we present the framework of anovel self-adaptive semantic focused crawler – SASF crawler,with the purpose of precisely and efficiently discovering, format-ting, and indexing mining service information over the Internet,by taking into account the three major issues. This frameworkincorporates the technologies of semantic focused crawling andontology learning, in order to maintain the performance of thiscrawler, regardless of the variety in the Web environment. Theinnovations of this research lie in the design of an unsupervisedframework for vocabulary-based ontology learning, and a hybridalgorithm for matching semantically relevant concepts and meta-data. A series of experiments are conducted in order to evaluatethe performance of this crawler. The conclusion and the directionof future work are given in the final section.

Index Terms—Mining service industry, ontology learning,semantic focused crawler, service advertisement, service informa-tion discovery.

I. INTRODUCTION

I T is well recognized that information technology has a pro-found effect on the way business is conducted, and the In-

ternet has become the largest marketplace in the world. It is es-timated that there were over 2 billion Internet users in 2011,with an estimated annual growth of over 16%, compared with360 million users in 20001. Innovative business professionalshave realized the commercial applications of the Internet bothfor their customers and strategic partners, turning the Internetinto an enormous shopping mall with a huge catalogue. Con-sumers are able to browse a huge range of products and service

Manuscript received January 19, 2012; revised April 06, 2012; acceptedOctober 30, 2012. Date of publication December 20, 2012; date of currentversion May 02, 2014. Paper no. TII-12-0022.H. Dong is with School of Information Systems, Curtin Business School,

Curtin University of Technology, Perth WA 6845, Australia (e-mail:[email protected]).F. K. Hussain is with School of Software, Faculty of Engineering and Infor-

mation Technology, University of Technology, Sydney, NSW 2007, Australia(e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TII.2012.2234472

1http://www.internetworldstats.com/

advertisements over the Internet, and buy these goods directlythrough online transaction systems [1]. Service advertisementsform a considerable part of the advertising which takes placeover the Internet and have the following features:

A. Heterogeneity

Given the diversity of services in the real world, manyschemes have been proposed to classify the services fromvarious perspectives, including the ownership of service instru-ments [2], the effects of services [3], the nature of the serviceact, delivery, demand and supply [4], and so on. Nevertheless,there is not a publicly agreed scheme available for classifyingservice advertisements over the Internet. Furthermore, whilstmany commercial product and service search engines provideclassification schemes of services with the purpose of facil-itating a search, they do not really distinguish between theproduct and the service advertisement; instead, they combineboth into one taxonomy.

B. Ubiquity

Service advertisements can be registered by service providersthrough various service registries, including 1) global businesssearch engines, such as Business.com2 and Kompass3, 2) localbusiness directories, such as Google™ Local Business Center4

and local Yellowpages®5, 3) domain-specific business searchengines, such as healthcare, industry and tourism businesssearch engines, and 4) search engine advertising, such asGoogle™6 and Yahoo!®7 Advertising Home [5]. These serviceregistries are geographically distributed over the Internet.

C. Ambiguity

Most of the online service advertising information is em-bedded in a vast amount of information on the Web and is de-scribed in natural language, therefore it may be ambiguous.Moreover, online service information does not have a consistentformat and standard, and varies from Web page to Web page.Mining is one of the oldest industries in human history,

having emerged with the beginning of human civilization.Mining services refer to a series of services which supportmining, quarrying, and oil and gas extraction activities [6]. InAustralia, the mining industry contributed about 7.7% of the

2http://www.business.com/3http://www.kompass.com/4http://www.google.com/local/add/businessCenter/5http://www.yellowpages.com/6http://www.google.com/7http://www.yahoo.com/

Australian GDP between 2007 and 2008, to which the fieldof mining services contributed 7.65% [6]. Since the advent ofthe information age, mining service companies have realizedthe power of online advertising, and they have attempted topromote themselves by actively joining the service advertisingcommunity. It was found that nearly 50,000 companies world-wide have registered their services on the Kompass website.However, these mining service advertisements are also subjectto the issues of heterogeneity, ubiquity and ambiguity, whichprevent users from precisely and efficiently searching formining service information over the Internet.Service discovery is an emerging research area in the do-

main of industrial informatics, which aims to automatically orsemi-automatically retrieve services or service information inparticular environments by means of various IT methods. Manystudies have been carried out in the environments of wirelessnetworks [7]–[9] and distributed industrial systems [10]. How-ever, few studies have been planned for industrial service ad-vertisement discovery in the Web environment, by taking intoaccount the heterogeneous, ubiquitous and ambiguous featuresof service advertising information.In order to address the above problems, in this paper, we pro-

pose the framework of a novel self-adaptive semantic focused(SASF) crawler, by combining the technologies of semantic fo-cused crawling and ontology learning, whereby semantic fo-cused crawling technology is used to solve the issues of hetero-geneity, ubiquity and ambiguity of mining service information,and ontology learning technology is used to maintain the highperformance of crawling in the uncontrolled Web environment.This crawler is designed with the purpose of helping search en-gines to precisely and efficiently search mining service infor-mation by semantically discovering, formatting, and indexinginformation.The rest of this paper is organized as follows: in Section II, we

review the related work in the field of ontology-learning-basedfocused crawling and address the research issues in this field; inSection III, we present the framework of the SASF crawler, in-cluding the mining service ontology and metadata schema andworkflow of this crawler; in Section IV, we deliver a hybridconcept-metadata matching algorithm to help this crawler se-mantically index the mining service information; in Section V,we conduct a series of experiments in order to empirically eval-uate the framework of the crawler; and in the final section, wediscuss the features and the limitations of this work and proposeour future work.

II. RELATED WORK

In this section, we briefly introduce the fields of semantic fo-cused crawling and ontology-learning-based focused crawling,and review previous work on ontology learning-based focusedcrawling.A semantic focused crawler is a software agent that is able

to traverse the Web, and retrieve as well as download relatedWeb information on specific topics by means of semantic tech-nologies [11], [12]. Since semantic technologies provide sharedknowledge for enhancing the interoperability between hetero-geneous components, semantic technologies have been broadlyapplied in the field of industrial automation [13]–[15]. The goal

of semantic focused crawlers is to precisely and efficiently re-trieve and download relevantWeb information by automaticallyunderstanding the semantics underlying the Web informationand the semantics underlying the predefined topics. A surveyconducted by Dong et al. [16] found that most of the crawlersin this domain make use of ontologies to represent the knowl-edge underlying topics and Web documents. However, the lim-itation of the ontology-based semantic focused crawlers is thatthe crawling performance crucially depends on the quality ofontologies. Furthermore, the quality of ontologies may be af-fected by two issues. The first issue is that, as it is well knownthat an ontology is the formal representation of specific domainknowledge [17] and ontologies are designed by domain experts,a discrepancy may exist between the domain experts’ under-standing of the domain knowledge and the domain knowledgethat exists in the real world. The second issue is that knowledgeis dynamic and is constantly evolving, compared with relativelystatic ontologies. These two contradictory situations could leadto the problem that ontologies sometimes cannot precisely rep-resent real-world knowledge, considering the issues of differen-tiation and dynamism. The reflection of this problem in the fieldof semantic focused crawling is that the ontologies used by se-mantic focused crawlers cannot precisely represent the knowl-edge revealed in Web information, since Web information ismostly created or updated by human users with different knowl-edge understandings, and human users are efficient learners ofnew knowledge. The eventual consequence of this problem isreflected in the gradually descending curves in the performanceof semantic focused crawlers.In order to solve the defects in ontologies and maintain

or enhance the performance of semantic-focused crawlers,researchers have begun to pay attention to enhancing se-mantic-focused crawling technologies by integrating them withontology learning technologies. The goal of ontology learningis to semi-automatically extract facts or patterns from a corpusof data and turn these into machine-readable ontologies [18].Various techniques have been designed for ontology learning,such as statistics-based techniques, linguistics (or natural lan-guage processing)-based techniques, logic-based techniques,etc. These techniques can also be classified into supervisedtechniques, semi-supervised techniques, and unsupervisedtechniques from the perspective of learning control. Obviously,ontology-learning-based techniques can be used to solve theissue of semantic-focused crawling, by learning new knowledgefrom crawled documents and integrating the new knowledgewith ontologies in order to constantly refine the ontologies.In the rest of this section, we will review the two existing

studies in the field of ontology learning-based semantic focusedcrawling.Zheng et al. [19] proposed a supervised ontology-learning-

based focused crawler that aims to maintain the harvest rateof the crawler in the crawling process. The main idea of thiscrawler is to construct an artificial neural network (ANN) modelto determine the relatedness between a Web document and anontology. Given a domain-specific ontology and a topic repre-sented by a concept in the ontology, a set of relevant conceptsare selected to represent the background knowledge of the topicby counting the distance between the topic concept and the other

concepts in the ontology. The crawler then calculates the termfrequency of the relevant concepts occurring in the visited Webdocuments. Next, the authors used the backpropagation algo-rithm to train a three-layer feedforward ANN model. The firstlayer is a linear layer with a transfer function . Thenumber of input nodes depends on the number of relevant con-cepts. The hidden layer is a sigmoid layer with a transfer func-tion . There are four nodes in the hidden layer.The output layer is also a sigmoid layer. The output of the ANNis the relevance score between the topic and a Web document.The training process follows a supervised paradigm, wherebythe ANN is trained by labeled Web documents. The trainingwill not stop until the root mean square error (RMSE) is lessthan 0.01. The limitations of this approach are: 1) it can only beused to enhance the harvest rate of crawling but does not havethe function of classification; 2) it cannot be used to evolve on-tologies by enriching the vocabulary of ontologies; and 3) thesupervised learning may not work within an uncontrolled Webenvironment with unpredicted new terms.Su et al. [20] proposed an unsupervised ontology-learning-

based focused crawler in order to compute the relevance scoresbetween topics and Web documents. Given a specific domainontology and a topic represented by a concept in this ontology,the relevance score between a Web document and the topic isthe weighted sum of the occurrence frequencies of all the con-cepts of the ontology in the Web document. The original weightof each concept is , where isa predefined discount factor, and is the distance be-tween the topic concept and . Next, this crawler makes useof reinforcement learning, which is a probabilistic frameworkfor learning optimal decision making from rewards or punish-ments [21], in order to train the weight of each concept. Thelearning step follows an unsupervised paradigm, which uses thecrawler to download a number of Web documents and learn sta-tistics based on these Web documents. The learning step can berepeated many times. The weight of a concept to a topic inlearning step is mathematically expressed as follows:

(1)

where is the number of Web documents in which occurs,is the number of Web documents in which and co-oc-

curs, is the total number of Web documents crawled, andis the number of Web documents in which occurs. Comparedwith Zheng et al. [19]’s approach, this approach is able to clas-sify Web documents by means of the concepts in an ontology,to learn the weights of relations between concepts, and to workin an uncontrolled Web environment thanks to the unsupervisedlearning paradigm. The limitations of Su et al.’s approach are:1) it cannot be used to enrich the vocabulary of ontologies; 2)although the unsupervised learning paradigm can work in anuncontrolled Web environment, it may not work well when nu-merous new terms emerge or when ontologies have a limitedrange of vocabulary.By means of a comparative analysis of the two ontology-

based focused crawlers, we found a common limitation, which

is that none of the two crawlers is able to really evolve ontolo-gies by enriching their contents, namely their vocabularies. Itis found that both of the approaches attempt to use learningmodels to deduce the quantitative relationship between the oc-currence frequencies of the concepts in an ontology and thetopic, which may not be applicable in the realWeb environment.When numerous unpredictable new terms outside the scope ofthe vocabulary of an ontology emerge in Web documents, theseapproaches cannot determine the relatedness between the newterms and the topic, and cannot make use of the new termsfor the relatedness determination, which could result in the de-cline in their performance. Consequently, in order to address thisresearch issue, we propose to design an innovative ontology-learning-based focused crawler, in order to precisely discover,format and index relevant Web documents in the uncontrolledWeb environment.

III. SYSTEM COMPONENTS AND WORKFLOW

In this section, we introduce the system architecture and theworkflow of the proposed SASF crawler. It needs to be notedthat this crawler is built upon the semantic focused crawler de-signed in our previous research [11], [12]. The differences be-tween this work and the previous work can be summarized asfollows.• Our previous research work created a purely semantic fo-cused crawler, which does not have an ontology-learningfunction to automatically evolve the utilized ontology. Thisresearch aims to remedy this shortcoming.

• Our previous work utilized the service ontologies and theservice metadata formats, especially designed for the trans-portation service domain and the health care service do-main. In this research, we design a mining service on-tology and a mining service metadata schema to solve theproblem of self-adaptive service information discovery forthe mining service industry.

An overview of the system architecture and the workflow isshown in Fig. 1. As can be seen, the SASF crawler consists oftwo knowledge bases – a Mining Service Ontology Base and aMining ServiceMetadata Base, and a series of processes, as wellas a workflow coordinating these processes. In the rest of thissection, we will introduce the two knowledge bases and eachprocess in this workflow.

A. Mining Service Ontology Base and Mining ServiceMetadata Base

The Mining Service Ontology Base is used to store a miningservice ontology, which is the representation of specific miningservice domain knowledge. Concepts in the mining service on-tology are organized in a hierarchical structure, and these con-cepts are associated by a generalization/specialization relation-ship, and are organized in the form of a four-level hierarchy.Fig. 2 shows the first and second level of the concepts in themining service ontology. Each concept in the mining service on-tology represents a mining service sub-domain, and is definedby three properties – conceptDescription, learnedConceptDe-scription, and linkedMetadata, which are expressed as follows:• The conceptDescription property is a datatype propertyused to store the textual descriptions of a mining service

Fig. 1. System architecture and workflow of the proposed self-adaptive semantic focused crawler.

Fig. 2. The mining service ontology.

concept, which consists of several phrases in order to con-cisely summarize the discriminate features of the corre-sponding mining service sub-domain. The contents of eachconceptDescription property are manually specified by do-main experts and this will be used to calculate the similarityvalue between a mining service concept and a mining ser-vice metadata.

• The learnedConceptDescription property is a datatypeproperty that has a purpose similar to that of the con-

ceptDescription property. The difference between the twoproperties is that the former is automatically learned fromWeb documents by the SASF crawler.

• The linkedMetadata property is an object property usedto associate a mining service concept and semanticallyrelevant mining service metadata. This property is usedto semantically index the generated mining servicemetadata by means of the concepts in the mining ser-vice ontology.

The Mining Service Metadata Base is used to store theautomatically generated and indexed mining service metadata.Mining service metadata is the abstraction of an actual miningservice advertisement published in a Web document. Themining service metadata schema follows a structure similarto the health service metadata schema defined in our previouswork [12], by which mining service metadata comprise twoparts – mining service provider metadata and mining servicemetadata. Mining service provider metadata is the abstractionof a service provider’s profile, including the service provider’sbasic introduction, address, contact information, and so on.Mining service metadata has the properties of serviceDescrip-tion and linkedConcept, which are stated as follows.• The serviceDescription property is a datatype property,which contains the texts used to describe the general fea-tures of an actual service. The contents of this property areautomatically extracted from the mining service advertise-ments by the SASF crawler. This property will be used forthe subsequent concept-metadata similarity computation.

• The linkedConcept property is the inverse property of thelinkedMetadata property, which stores the URIs of the se-mantically relevant mining service concepts of mining ser-vice metadata. It needs to be noted that mining servicemetadata and mining service concepts can have a many-to-many relationship.

In addition, mining service metadata and a relevant miningservice provider metadata are associated by an object propertyof isProvidedBy, and this association follows a many-to-one re-lationship, because in fact, a service provider can provide morethan one service.

B. System Workflow

In this section, we will introduce the system workflow of theSASF crawler step-by-step, as shown in Fig. 1. The primarygoals of this crawler include: 1) to generate mining servicemetadata from Web pages; and 2) to precisely associate be-tween the semantically relevant mining service concepts andmining service metadata with relatively low computing cost.The second goal is realized by: 1) measuring the semanticrelatedness between the conceptDescription and learned-ConceptDescription property values of the concepts and theserviceDescription property values of the metadata; and 2)automatically learning new values, namely descriptive phrases,for the learnedConceptDescription properties of the concepts.As can be seen in Fig. 1, the first step is preprocessing, which

is to process the contents of the conceptDescription property ofeach concept in the ontology before matching the metadata andthe concepts. This processing is realized by using JavaWordNetLibrary8 (JWNL) to implement tokenization, part-of-speech(POS) tagging, nonsense word filtering, stemming, and syn-onym searching for the conceptDescription property values ofthe concepts.The second and third steps are crawling and term extrac-

tion. The aim of these two processes is to download Webpages from the Internet at one time ( will be explained inSection IV-B), and to extract the required information from the

8http://sourceforge.net/projects/jwordnet/

downloaded Web pages, according to the mining service meta-data schema and the mining service provider metadata schemadefined in Section III-A, in order to prepare the property valuesto generate a new group of metadata. These two processes arerealized by the semantic focused crawler designed in our pre-vious work [11], [12].The next step is term processing, which is to process the con-

tent of the serviceDescription property of the metadata in orderto prepare for subsequent concept-metadata matching. The im-plementation of this process is similar to the implementationof the preprocessing process. The major difference is that termprocessing does not need the function of synonym searching fortwo major reasons: 1) the synonyms of the terms in the concept-Description properties of concepts have already been retrievedin preprocessing; and 2) the computing cost of the synonymsearching for the terms in the serviceDescription property is rel-atively high and this may influence the scalability of the SASFcrawler, as term processing is a real-time process.The rest of the workflow can be integrated as a self-adaptive

metadata association and ontology learning process. The de-tails of this process are as follows: first of all, the direct stringmatching process examines whether or not the contents of theserviceDescription property ofmetadata are included in the con-ceptDescription and learnedConceptDescription properties of aconcept. If the answer is ‘yes’, then the concept and the meta-data are regarded as semantically relevant. By means of meta-data generation and association process, the metadata can thenbe generated and stored in the mining service metadata baseas well as being associated with the concept. If the answer is‘no’, an algorithm-based string matching process will be in-voked to check the semantic relatedness between the metadataand the concept, by means of a concept-metadata semantic sim-ilarity algorithm (introduced in Section IV). If the concept andthe metadata are semantically relevant, the contents of the ser-viceDescription property of the metadata can be regarded as anew value for the learnedConceptDescription property of theconcept. The metadata is thus allowed to go through the meta-data generation and association process; otherwise the meta-data is regarded as semantically non-relevant to the concept. Theabove process is repeated until all the concepts in themining ser-vice ontology have been compared with the metadata. If none ofthe concepts is semantically relevant to the metadata, this meta-data is regarded as semantically non-relevant to the mining ser-vice domain and will be dropped.It needs to be noted that only the conceptDescription property

values of the concepts can be used in the algorithm-based stringmatching process, due to the fact that the semantic relatednessbetween the concept and the metadata is determined by com-paring their algorithm-based property similarity values with athreshold value. If the maximum similarity value between theserviceDescription property value of a metadata and the con-ceptDescription property values of a concept is higher than thethreshold value, themetadata and the concept are regarded as se-mantically relevant; otherwise not. Hence, the threshold valuecan be viewed as the boundary for determining whether or notthe serviceDescription property value of the metadata is seman-tically relevant to a conceptDescription property value of a con-cept, and the conceptDescription property values of this concept

can be viewed as the foundation for constructing this boundary.On one hand, this boundary is relatively stable, as both thethreshold value and the conceptDescription property values arerelatively stable; on the other hand, the maximum similarityvalues between the conceptDescription property values and thelearnedConceptDescription property values are relatively dy-namic (within the interval [threshold, 1] according to the algo-rithm introduced in Section IV). Therefore, the learnedConcept-Description property values of the concepts cannot be used inthe algorithm-based string matching process and can only beused in the direct string matching process.

IV. CONCEPT-METADATA SEMANTIC SIMILARITY ALGORITHM

In this section, we introduce a novel concept-metadata se-mantic similarity algorithm to judge the semantic relatednessbetween concepts and metadata in the algorithm-based stringmatching process (Fig. 1). The major goal of this algorithm is tomeasure the semantic similarity between a concept descriptionand a service description. This algorithm follows a hybrid pat-tern by aggregating a semantic-based string matching (SeSM)algorithm and a statistics-based string matching (StSM) algo-rithm. In the rest of this section, we will describe these two al-gorithms in detail.

A. Semantic-Based String Matching Algorithm

The key idea of the SeSM algorithm is to measure the textsimilarity between a concept description and a service descrip-tion, by means of WordNet9 and a semantic similarity model.As the concept description and the service description can

be regarded as two groups of terms after the preprocessing andterm processing phase, first of all, we need to examine the se-mantic similarity between any two terms from these two groups.Here we make use of Resnik [22]’s information-theoretic modeland WordNet to achieve this goal. Since terms (or concepts) inWordNet are organized in a hierarchical structure, in which con-cepts have the relationships of hypernym/hyponym, it is pos-sible to assess the similarity between two concepts by com-paring their relative position in WordNet. Resnik’s model canbe expressed as follows:

(2)

where and are two concepts in WordNet, andis the set of concepts that subsume both and , andis the probability of encountering a sub-concept of . Hence,

(3)

where is the number of concepts subsumed by and isthe total number of concepts in WordNet. It needs to be notedthat a concept sometimes consists of more than one term inWordNet, so concepts sometimes do not equate to terms. Sincethe result of Resnik’s model is within the interval , we

9http://wordnet.princeton.edu/

Fig. 3. Graphical representation of the assignment in the bipartite graphproblem.

make use of a model introduced by Dong et al. [23] to nor-malize the result into the interval , which can be expressedas follows:

ifif or

(4)

where is the synset of .In the above step, we calculate the similarity values between

any two terms in a concept description and a service description.We now make use of Plebani et al. [24]’s bipartite graph modelto optimally assign the matching between the terms from eachgroup. Given a graph , where are a group ofvertices and are a group of edges linking between the vertices.A matching in is defined as so that no two edges inshare a common end vertex. An assignment in is a matchingso that each vertex in has an incident edge in . Let us

suppose that the set of vertices are partitioned into two sets(namely the terms in the service description) and (namely theterms in the concept description), and each edge in this graphhas an associatedweight within the interval given by (4).A function returns the maximum weightedassignment, i.e., an assignment so that the average weight of theedge is maximum. Fig. 3 shows the graphical representation ofthe assignment in the bipartite graph problem. The assignmentin bipartite graphs can be expressed in a linear programmingmodel, which is

(5)

Fig. 4. Graphical representation of the probabilistic model.

B. Statistics-Based String Matching Algorithm

The StSM algorithm is a complementary solution for theSeSM algorithm, in case the latter does not work effectivelyin some circumstances. For example, for a service description“old mine workings consolidation contractor” and a conceptdescription “mining contractor”, their similarity value is

according to the SeSM algorithm, which isrelatively lower than the actual extent of their semantic rele-vance. In this circumstance, we need to find an alternate way tomeasure their similarity. Here we make use of a statistics-basedmodel [25] to achieve this goal. In the crawling process andthe subsequent processes indicated in Fig. 1, the SASF crawlerdownloads Web pages at the beginning, and automaticallyobtains the statistical data from the Web pages, in order tocompute the semantic relevance between a service description( ) and a concept description ( ) of a concept ( ).The StSM algorithm follows an unsupervised training para-digm aimed at finding the maximum probability thatand co-occur in the Web pages. A graphic representationof the StSM algorithm is shown in Fig. 4. The StSM algorithmis shown as follows:

(6)

where is a concept description of , is the numberof Web pages that contain both and , is thenumber of Web pages that contain , is the number ofWeb pages that contain both and , and is the Webpages of metadata that contain .

C. Hybrid Algorithm

On top of the SeSM and StSM algorithm, a hybrid algorithmis required to seek the maximum similarity values from the twoalgorithms.

(7)

V. SYSTEM IMPLEMENTATION AND EVALUATION

In this section, in order to systematically evaluate the pro-posed SASF crawler, we implement a prototype of this crawler,and compare the performance of the crawler with the existingwork reviewed in Section II, based on several performance in-dicators adopted from the information retrieval (IR) field.

A. Prototype Implementation

We implement the prototype of this SASF crawler in Javawithin the platform of Eclipse 3.7.110. This prototype is anextension of our previous versions presented in [11], [12]. TheMining Service Ontology Base and Mining Service MetadataBase are built in OWL-DL within the platform of Protégé3.4.711. The service mining ontology consists of 158 concepts,knowledge of which is mostly referenced from Wikipedia12,Australian Bureau of Statistics13, and the websites of nearly200 Australian and international mining service companies.

B. Performance Indicators

We define the parameters for a comparison between ourcrawler and the existing ontology-learning-based focusedcrawlers. All the indicators are adopted from the field of IR andneed to be redefined in order to be applied in the scenario ofontology-based focused crawling.Harvest rate is used to measure the harvesting ability of a

crawler. The harvest rate for a crawler after crawling Webpages is determined as follows:

(8)

where is the number of associated metadata from theWeb pages, and is the number of generated metadatafrom the Web pages.Precision is used to measure the preciseness of a crawler.

The precision for a concept after crawling Web pages isdetermined as follows:

(9)

where is the set of associated metadata from the Webpages for , is the number of associated metadata fromthe Web pages for , and is the set of relevant metadatafrom the Web pages for . It needs to be noted that the set ofrelevant metadata need to be manually identified by operatorsbefore the evaluation.Recall is used to measure the effectiveness of a crawler. The

recall for a concept after crawling Web pages is determinedas follows:

(10)

10http://www.eclipse.org/11http://protege.stanford.edu/12http://en.wikipedia.org/13http://www.abs.gov.au/

where is the number of the relevant metadata from theWebpages for .Harmonic mean is a measure of the aggregated performance

of a crawler. The harmonic mean for a concept after crawlingWeb pages is determined as follows:

(11)

Fallout is used to measure the inaccuracy of a crawler. Thefallout for a concept after crawling Web pages is deter-mined as follows:

(12)

where is the set of non-relevant metadata from the Webpages for , and is the number of non-relevant metadatafrom the Web pages for . It needs to be noted that the set ofnon-relevant metadata need to be manually identified by opera-tors before the evaluationCrawling time is used to measure the efficiency of a crawler.

The crawling time of the SASF crawler for a Web page is de-fined as the time interval from processing the Web page fromthe Crawling process to theMetadata Generation and Associa-tion process or to the Filtering process, as shown in Fig. 1.

C. System Evaluation

In this section, we will evaluate our SASF crawler bycomparing its performance with that of the existing on-tology-learning-based focused crawlers of Zheng et al. and Suet al., introduced in Section II.1) Testing Data Source: As mentioned in Section II, one

common defect of the existing ontology-learning-based focusedcrawlers is that these crawlers are not able to work in an un-controlled Web environment with unpredicted new terms, dueto the limitations of the adopted ontology learning approaches.Hence, our proposed SASF crawler aims to remedy this de-fect, by combining a real-time SeSM algorithm, and an unsu-pervised StSM algorithm. In order to evaluate our model andthe existing models in the uncontrolled Web environment, wechoose two mainstream business advertising websites – Aus-tralian Kompass14 (abbreviated as Kompass below) and Aus-tralian Yellowpages®15 (abbreviated as Yellowpages® below),as the test data source. There are around 800 downloadablemining-related service or product advertisements registered inKompass, and around 3200 similar advertisements registered inYellowpages®. All of them are published in English.2) Test Environment and Targets: Owing to the primary ob-

jective of the ontology-learning-based focused crawlers, that is,to precisely and efficiently download and index relevantWeb in-formation, we subsequently employ the ontology learning andWeb page classification models used in each of the focusedcrawlers, namely the ANNmodel used in Zheng et al.’s crawler,the probabilistic model used in Su et al.’s crawler, and our self-adaptive model, together in our SASF crawler framework, for

14http://au.kompass.com/15http://www.yellowpages.com.au/

Fig. 5. Comparison of the ontology-learning-based focused crawling modelson harvest rate.

the task of metadata harvesting and indexing (or classification)(i.e., the Metadata Generation and Ontology Learning processin Fig. 1), and compare their performance in this process. It isrecognized that all of these models need a training process be-fore harvesting and classification, since the ANNmodel followsa supervised training paradigm, and the other twomodels followan unsupervised training paradigm. Therefore, we use the Kom-pass website as the training data source, and label theWeb pagesfrom this website for the ANN training. Following that, we testand compare the performance of these models by using the un-labelled data source from Yellowpages®, with the purpose ofevaluating their capability in an uncontrolled environment. Inaddition, by means of a series of experiments, we find that 0.6is the optimal threshold value for the self-adaptive model to de-termine the relatedness between a pair of service description andconcept description.3) Test Results: We compare the performance of the ANN

model, the probabilistic model, and the self-adaptive model interms of the six parameters defined in Section V-B. It needs to benoted that we evaluate the performance of the ANNmodel basedonly on the parameters of harvest rate and crawling time, sincethe major purpose of the ANN model is to enhance the harvestrate, and it does not contain the function of classification. In therest of this section, we will present and discuss the test results.4) Harvest Rate: The graphic representation of the com-

parison of the probabilistic, self-adaptive and ANN models onharvest rate, along with the increasing number of visited Webpages, is shown in Fig. 5. It needs to be noted that the har-vest rate concerns only the crawling ability, not the accuracy,of a crawler. A high proportion (around 40%) of Web pages arepeer-reviewed as non-mining-service-related Web pages in theunlabeled data source, which is part of the reason that the overallharvest rates of these three models are all below 60%. It can beseen that the self-adaptive model has the optimal performance(more than 50%), compared to the ANN model (around 16%)and the probabilistic model (between 8% and 9%). This provesthat the self-adaptive model has a positive impact on improvingthe crawling ability of the semantic focused crawler, as moreservice descriptions extracted from the Web pages are matchedto the learned concept descriptions.5) Precision: The graphic representation of the comparison

of the probabilistic and self-adaptive models on precision, alongwith the increasing number of visited Web pages, is shown inFig. 6. It can be observed that the overall precision of the self-adaptive model is 32.50%, and the overall precision of the prob-abilistic model is 13.46%, which is less than half of that of the

Fig. 6. Comparison of the ontology-learning-based focused crawling modelson precision.

Fig. 7. Comparison of the ontology-learning-based focused crawling modelson recall.

former. This is because the self-adaptive model is able to filterout more non-relevant mining service Web pages based on itsvocabulary-based ontology learning function. This proves thatthe self-adaptive model significantly enhances the precisenessof semantic focused crawling.6) Recall: The graphic representation of the comparison of

the probabilistic and self-adaptive models on recall, along withthe increasing number of visited Web pages, is shown in Fig. 7.It can be seen that the overall recall for the self-adaptive modelis 65.86%, compared to only 9.62% for the probabilistic model.This is because the self-adaptive model is able to generate morerelevant mining service metadata based on its vocabulary-basedontology learning function, and thus improves the effectivenessof the semantic focused crawler.7) Harmonic Mean: The graphic representation of the

comparison of the probabilistic and self-adaptive models onharmonic mean, along with the increasing number of visitedWeb pages, is shown in Fig. 8. As an aggregated parameter,the overall harmonic mean values for both of these models arebelow 50% (11.22% for the probabilistic model, and 43.51%for the self-adaptive model), due to their low performance onprecision. Since the self-adaptive model outperforms the prob-abilistic model on both precision and recall, it is not surprisingthat the overall harmonic mean of the former is nearly fourtimes as high as that of the latter.8) Fallout Rate: The graphic representation of the compar-

ison of the probabilistic and self-adaptive models on falloutrate, along with the increasing number of visited Web pages, isshown in Fig. 9. It can be seen that the overall fallout rate of theself-adaptive model is 0.46%, and for the probabilistic modelis around 0.49%. This indicates that the former generates fewerfalse results than the latter, which proves the low inaccuracy ofthe self-adaptive model.9) Crawling Time: The graphic representation of the com-

parison of the probabilistic, self-adaptive and ANN models

Fig. 8. Comparison of the ontology-learning-based focused crawling modelson harmonic mean.

Fig. 9. Comparison of the ontology-learning-based focused crawling modelson fallout rate.

on crawling time, along with the increasing number of visitedWeb pages, is shown in Fig. 10. It can be seen that the ANNmodel uses less time than the others, since the ANNmodel doesnot need time for metadata classification. For a comparisonbetween the probabilistic model and the self-adaptive model,for the first 200 pages, the former runs faster than the latter.However, after 200 pages, the latter gradually gathers speed, incontrast to the relatively fixed speed of the former. Eventually,from the beginning of 1200 pages onwards, the total time of thelatter is gradually less than that of the former. This gap becomeslarger and larger, as the number of visited Web pages increases.This is because this crawler is able to use the self-adaptivemetadata association and ontology learning process in theself-adaptive model to learn new concept descriptions at thebeginning. Afterwards, along with the enrichment of learnedconcept descriptions, it needs less and less time to run thealgorithm-based string matching and ontology learning pro-cesses. Instead, it runs the direct string matching process only,in order to implement the concept-metadata matching, whichgreatly saves crawling time. Overall, the average crawlingspeed for the ANN model is 77.3 ms/page, for the probabilisticmodel it is 91.9 ms/page, and for the self-adaptive model itis 81.5 ms/page. This test result proves the efficiency of theself-adaptive model for crawling large numbers of Web pages.10) Summary: From the above comparison of the three

models, it can be seen that the self-adaptive model has thestrongest performance on nearly all of the six indicators, ex-cept for a slightly slower crawling time compared with thatof the ANN model. The self-adaptive model outperforms itscompetitors several times on the parameters of harvest rate,precision, recall, and harmonic mean, which shows the signifi-cant technical advantage of this model. In addition, in contrastto its competitors, the self-adaptive model incurs a relativelylower computing cost while retaining strong performance. Last

Fig. 10. Comparison of the ontology-learning-based focused crawling modelson crawling time.

but not least, from Figs. 5–10, it can be observed that all of theperformance curves of the self-adaptive model are relativelysmooth, regardless of the variety in the number of visited Webpages, thereby partly fulfilling our goals. Therefore, accordingto these test results, it can be deduced that the proposed frame-work of the SASF crawler was shown to be feasible by theseexperiments.

VI. CONCLUSION AND FUTURE WORK

In this paper, we presented an innovative ontology-learning-based focused crawler – the SASF crawler, for service infor-mation discovery in the mining service industry, by taking intoaccount the heterogeneous, ubiquitous and ambiguous natureof mining service information available over the Internet.This approach involved an innovative unsupervised ontologylearning framework for vocabulary-based ontology learning,and a novel concept-metadata matching algorithm, whichcombines a semantic-similarity-based SeSM algorithm and aprobability-based StSM algorithm for associating semanticallyrelevant mining service concepts and mining service metadata.This approach enables the crawler to work in an uncontrolledenvironment where the numerous new terms and ontologiesused by the crawler have a limited range of vocabulary. Sub-sequently, we conduct a series of experiments to empiricallyevaluate the performance of the SASF crawler, by comparingthe performance of this approach with the existing approachesbased on the six parameters adopted from the IR field.We describe a limitation of this approach and our future work

as follows: in the evaluation phase, it can be clearly seen thatthe performance of the self-adaptive model did not completelymeet our expectations regarding the parameters of precision andrecall. We deduce two reasons that caused this issue as follows:Firstly, in this research, we try to find a universal threshold

value for the concept-metadata semantic similarity algorithmin order to set up a boundary for determining concept-meta-data relatedness. However, in order to achieve optimal perfor-mance, each concept should have its own particular boundaries,namely particular threshold values, for the judgment of the re-latedness. Consequently, in future research, we intend to de-sign a semi-supervised approach by aggregating the unsuper-vised approach and the supervised ontology learning-based ap-proach, with the purpose of automatically choosing the optimalthreshold values for each concept, while keeping the optimalperformance without considering the limitation of the trainingdata set.

Secondly, the relevant service descriptions for each conceptare manually determined through a peer-reviewed process; i.e.,many relevant service descriptions and concept descriptions aredetermined on the basis of common sense, which cannot bejudged by string similarity or term co-occurrence. Hence, in ourfuture research, it is necessary to enrich the vocabulary of themining service ontology by surveying those unmatched but rel-evant service descriptions, in order to further improve the per-formance of the SASF crawler.

REFERENCES

[1] H. Wang, M. K. O. Lee, and C. Wang, “Consumer privacy concernsabout Internet marketing,” Commun. ACM, vol. 41, pp. 63–70, 1998.

[2] R. C. Judd, “The case for redefining services,” J. Marketing, vol. 28,pp. 58–59, 1964.

[3] T. P. Hill, “On goods and services,” Rev. Income Wealth, vol. 23, pp.315–38, 1977.

[4] C. H. Lovelock, “Classifying services to gain strategic marketing in-sights,” J. Marketing, vol. 47, pp. 9–20, 1983.

[5] H. Dong, F. K. Hussain, and E. Chang, “A service search engine forthe industrial digital ecosystems,” IEEE Trans. Ind. Electron., vol. 58,no. 6, pp. 2183–2196, Jun. 2011.

[6] Mining Services in the US: Market Research Report IBISWorld2011.[7] B. Fabian, T. Ermakova, and C. Muller, “SHARDIS – A privacy-en-

hanced discovery service for RFID-based product information,” IEEETrans. Ind. Informat., to be published.

[8] H. L. Goh, K. K. Tan, S. Huang, and C. W. d. Silva, “Development ofBluewave: A wireless protocol for industrial automation,” IEEE Trans.Ind. Informat., vol. 2, no. 4, pp. 221–230, Nov. 2006.

[9] M. Ruta, F. Scioscia, E. D. Sciascio, and G. Loseto, “Semantic-basedenhancement of ISO/IEC 14543–3 EIB/KNX standard for building au-tomation,” IEEE Trans. Ind. Informat., vol. 7, no. 4, pp. 731–739, Nov.2011.

[10] I. M. Delamer and J. L. M. Lastra, “Service-oriented architecture fordistributed publish/subscribe middleware in electronics production,”IEEE Trans. Ind. Informat., vol. 2, no. 4, pp. 281–294, Nov. 2006.

[11] H. Dong and F. K. Hussain, “Focused crawling for automatic servicediscovery, annotation, and classification in industrial digital ecosys-tems,” IEEE Trans. Ind. Electron., vol. 58, no. 6, pp. 2106–2116, Jun.2011.

[12] H. Dong, F. K. Hussain, and E. Chang, “A framework for discoveringand classifying ubiquitous services in digital health ecosystems,” J.Comput. Syst. Sci., vol. 77, pp. 687–704, 2011.

[13] J. L. M. Lastra and M. Delamer, “Semantic web services in factoryautomation: Fundamental insights and research roadmap,” IEEE Trans.Ind. Informat., vol. 2, no. 1, pp. 1–11, Feb. 2006.

[14] S. Runde and A. Fay, “Software support for building automation re-quirements engineering—An application of semantic web technologiesin automation,” IEEE Trans. Ind. Informat., vol. 7, no. 4, pp. 723–730,Nov. 2011.

[15] M. Ruta, F. Scioscia, E. Di Sciascio, and G. Loseto, “Semantic-basedenhancement of ISO/IEC 14543–3 EIB/KNX standard for building au-tomation,” IEEE Trans. Ind. Informat., vol. 7, no. 4, pp. 731–739, Nov.2011.

[16] H. Dong, F. Hussain, and E. Chang, O. Gervasi, D. Taniar, B. Mur-gante, A. Lagana, Y. Mun, and M. Gavrilova, Eds., “State of the art insemantic focused crawlers,” in Proc. ICCSA 2009, Berlin, Germany,2009, vol. 5593, pp. 910–924.

[17] T. R. Gruber, “A translation approach to portable ontology specifica-tions,” Knowledge Acquisition, vol. 5, pp. 199–220, 1993.

[18] W. Wong, W. Liu, and M. Bennamoun, “Ontology learning from text:A look back and into the future,” ACM Comput. Surveys, vol. 44, pp.20:1–36, 2012.

[19] H.-T. Zheng, B.-Y. Kang, and H.-G. Kim, “An ontology-based ap-proach to learnable focused crawling,” Inf. Sciences, vol. 178, pp.4512–4522, 2008.

[20] C. Su, Y. Gao, J. Yang, and B. Luo, “An efficient adaptive focusedcrawler based on ontology learning,” in Proc. 5th Int. Conf. HybridIntell. Syst. (HIS ’05), Rio de Janeiro, Brazil, 2005, pp. 73–78.

[21] J. Rennie and A. McCallum, “Using reinforcement learning to spiderthe Web efficiently,” in Proc. 16th Int. Conf. Mach. Learning (ICML’99), Bled, Slovenia, 1999, pp. 335–343.

[22] P. Resnik, “Semantic similarity in a taxonomy: An information-basedmeasure and its application to problems of ambiguity in natural lan-guage,” J. Artif. Intell. Res., vol. 11, pp. 95–130, 1999.

[23] H. Dong, F. K. Hussain, and E. Chang, “A context-aware semanticsimilarity model for ontology environments,” Concurrency Comput.:Practice Exp., vol. 23, pp. 505–524, 2011.

[24] P. Plebani and B. Pernici, “URBE: Web service retrieval based on sim-ilarity evaluation,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp.1629–1642, Nov. 2009.

[25] H. Dong, F. K. Hussain, and E. Chang, “Ontology-learning-based fo-cused crawling for online service advertising information discoveryand classification,” in Proc. 10th Int. Conf. Service Oriented Comput.(ICSOC 2012), Shanghai, China, 2012, pp. 591–598.

Hai Dong (S’08–M’11) received the B.S. degree ininformation management from Northeastern Univer-sity, Shenyang, China, in 2003, and the M.S. and thePh.D. degrees in information technology from CurtinUniversity of Technology, Perth, Australia, respec-tively, in 2006 and 2010. He received the award ofCurtin Research Fellowship from Curtin Universityof Technology, in 2012.He is currently a research fellow in School of

Information Systems at Curtin University of Tech-nology. Prior to this position, he was a research

associate in Digital Ecosystems and Business Intelligence Institute, CurtinUniversity of Technology. His research interest includes: service-oriented com-puting, semantic search, ontology, digital ecosystems, Web services, machinelearning, project monitoring, information retrieval, and cloud computing.

Farookh Khadeer Hussain received the B.Tech. de-gree in computer science and computer engineeringand the M.S. degree in information technology fromthe La Trobe University, Melbourne, Australia, andthe Ph.D. degree in information systems from CurtinUniversity of Technology, Perth, Australia, in 2006.He is currently a Facultymember at School of Soft-

ware, Faculty of Engineering and Information Tech-nology, University of Technology, Sydney, Australia.His areas of active research are cloud computing, ser-vices computing, trust and reputation modeling, se-

mantic web technologies and industrial informatics. He works actively in thedomain of cloud computing and business intelligence. In the area of businessintelligence the focus of his research is to develop smart technological mea-sures such as trust and reputation technologies and semantic web for enhancedand accurate decision making.

1616 ieee transactions on industrial informatics, … › ieee_2013-2014_projects...1616 ieee...

Documents