[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

10
Automatic Sales Lead Generation from Web Data Ganesh Ramakrishnan IBM India Research Lab [email protected] Sachindra Joshi IBM India Research Lab [email protected] Sumit Negi IBM India Research Lab [email protected] Raghu Krishnapuram IBM India Research Lab [email protected] Sreeram Balakrishnan IBM India Research Lab [email protected] Abstract Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the deci- sion making process of a purchase, the higher are the chances of converting that prospect into a customer. Traditional methods to identify sales leads such as com- pany surveys and direct marketing are manual, expen- sive and not scalable. Over the past decade the World Wide Web has grown into an information-mesh, with most important facts being reported through Web sites. Several news pa- pers, press releases, trade journals, business magazines and other related sources are on-line. These sources could be used to identify prospective buyers automati- cally. In this paper, we present a system called ETAP (Electronic Trigger Alert Program) that extracts trigger events from Web data that help in identifying prospec- tive buyers. Trigger events are events of corporate rel- evance and indicative of the propensity of companies to purchase new products associated with these events. Examples of trigger events are change in management, revenue growth and mergers & acquisitions. The un- structured nature of information makes the extraction task of trigger events difficult. We pose the problem of trigger events extraction as a classification problem and develop methods for learning trigger event classifiers using existing classification methods. We present meth- ods to automatically generate the training data required to learn the classifiers. We also propose a method of feature abstraction that uses named entity recognition to solve the problem of data sparsity. We score and rank the trigger events extracted from ETAP for easy browsing. Our experiments show the effectiveness of the method and thus establish the feasibility of auto- matic sales lead generation using the Web data. 1 Introduction Various marketing techniques have been used by companies to target and entice prospective buyers in order to increase their sales. The process of identifica- tion of potential buyers is also known as sales lead gen- eration. Live seminars, trade shows, cold calling, mass mailing, advertising and partner referrals are some ex- amples of sales lead generation. These methods can be classified into two types 1 : 1) reactive methods, and 2) proactive methods. Reactive methods are the ones in which a customer contacts a company following an ad- vertisement such as one in yellow pages. Subsequently, sales representatives handle the sale. On the other hand, proactive methods are the ones in which a pos- sible customer is contacted cold by a company. Sales representatives then try to verify that that the possi- ble customer is actually a prospect. Identifying possible customers based on certain information about the cus- tomers is the most important part of proactive meth- ods. Several data mining methods that use customer information for the identification of likely buyers can be found in the literature [9]. However, these meth- ods cannot be used when relevant information about customers is not readily available. Automatic methods to gather relevant information about customers are re- quired to help identify possible customers. In this paper, we propose a proactive method of sales lead generation that gathers new or evolving informa- tion about customers from the Web. Our approach is inspired by the growth of information (such as company descriptions, life cycles, profits and revenue changes) on 1 http://www.consultancymarketing.co.uk/sales-lead.htm 1 Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: s

Post on 07-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

Automatic Sales Lead Generation from Web Data

Ganesh RamakrishnanIBM India Research [email protected]

Sachindra JoshiIBM India Research Lab

[email protected]

Sumit NegiIBM India Research Lab

[email protected]

Raghu KrishnapuramIBM India Research [email protected]

Sreeram BalakrishnanIBM India Research Lab

[email protected]

Abstract

Speed to market is critical to companies that aredriven by sales in a competitive market. The earliera potential customer can be approached in the deci-sion making process of a purchase, the higher are thechances of converting that prospect into a customer.Traditional methods to identify sales leads such as com-pany surveys and direct marketing are manual, expen-sive and not scalable.

Over the past decade the World Wide Web has growninto an information-mesh, with most important factsbeing reported through Web sites. Several news pa-pers, press releases, trade journals, business magazinesand other related sources are on-line. These sourcescould be used to identify prospective buyers automati-cally. In this paper, we present a system called ETAP(Electronic Trigger Alert Program) that extracts triggerevents from Web data that help in identifying prospec-tive buyers. Trigger events are events of corporate rel-evance and indicative of the propensity of companiesto purchase new products associated with these events.Examples of trigger events are change in management,revenue growth and mergers & acquisitions. The un-structured nature of information makes the extractiontask of trigger events difficult. We pose the problem oftrigger events extraction as a classification problem anddevelop methods for learning trigger event classifiersusing existing classification methods. We present meth-ods to automatically generate the training data requiredto learn the classifiers. We also propose a method offeature abstraction that uses named entity recognitionto solve the problem of data sparsity. We score andrank the trigger events extracted from ETAP for easybrowsing. Our experiments show the effectiveness ofthe method and thus establish the feasibility of auto-

matic sales lead generation using the Web data.

1 Introduction

Various marketing techniques have been used bycompanies to target and entice prospective buyers inorder to increase their sales. The process of identifica-tion of potential buyers is also known as sales lead gen-eration. Live seminars, trade shows, cold calling, massmailing, advertising and partner referrals are some ex-amples of sales lead generation. These methods can beclassified into two types1: 1) reactive methods, and 2)proactive methods. Reactive methods are the ones inwhich a customer contacts a company following an ad-vertisement such as one in yellow pages. Subsequently,sales representatives handle the sale. On the otherhand, proactive methods are the ones in which a pos-sible customer is contacted cold by a company. Salesrepresentatives then try to verify that that the possi-ble customer is actually a prospect. Identifying possiblecustomers based on certain information about the cus-tomers is the most important part of proactive meth-ods. Several data mining methods that use customerinformation for the identification of likely buyers canbe found in the literature [9]. However, these meth-ods cannot be used when relevant information aboutcustomers is not readily available. Automatic methodsto gather relevant information about customers are re-quired to help identify possible customers.

In this paper, we propose a proactive method of saleslead generation that gathers new or evolving informa-tion about customers from the Web. Our approach isinspired by the growth of information (such as companydescriptions, life cycles, profits and revenue changes) on

1http://www.consultancymarketing.co.uk/sales-lead.htm

1

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

the Web, posted by several news papers, press releases,trade journals, business magazines and other relatedsources. We present a system called ETAP (ElectronicTrigger Alert Program) that uses this information todiscover sales leads. The system is suited for iden-tification of prospective companies as opposed to theidentification of individual buyers, making it suitablefor B2B (business to business) scenarios. Analysis ofsales leads is a complex process that requires skill, timeand effort. The proposed system is aimed at gather-ing sales leads from the Web and presenting them todomain specialists for the final validation.

ETAP is based on a traditional method of sales leadgeneration. In the traditional method, companies iden-tify a set of sales drivers for their products/services. Asales driver represents a class of events whose existenceindicates a strong propensity to buy products/servicesof companies associated with the events. Revenuegrowth, change in management and mergers & acqui-sitions are examples of sales drivers. The set of salesdrivers could be different for different industries. As anexample, mergers & acquisitions could be a sales driverfor the IT industry but may not be a sales driver forthe steel industry. This is because mergers and ac-quisitions of companies could lead to the integrationof IT systems of the companies thereby generating de-mand for new IT products. Typically, the drivers fora specific industry are determined by expert opinions.Suitable questionnaires are developed and surveys areconducted to gather information about the existence oftrigger events related to the known sales drivers. Trig-ger events are events that occur in the context of acompany (or its environment) and belong to a givensales driver. As an example, given mergers & acqui-sitions as a sales driver for the IT industry, an eventsuch as “Company X plans to acquire Company Y laterthis year” is a trigger event. Similarly, given revenuegrowth as a sales driver, an event such as “Company Xreported a revenue growth of 10% in the fourth quar-ter” is a trigger event. Identification of such a trig-ger event makes Company X a prospective buyer of ITproducts. The traditional method of gathering triggerevents is to call representatives of different companiesor conduct interviews with them. This method, beingmanual, tedious and expensive, is not scalable.

In our work, we assume that the set of sales driversfor the industry is known before hand. Given a setof sales drivers for the industry, ETAP automaticallyidentifies trigger events on the Web. The ETAP sys-tem crawls the Web to gather documents related tocompanies and financial news, extracts trigger eventsfrom the crawled documents and then ranks the triggerevents based on a scoring function for easy browsing.

Thus, ETAP assists sales representatives in prioritizingcompanies that need to be approached.

Event extraction is the most challenging part of thesystem, due to the unstructured nature of the data.Traditional information extraction [4, 5, 13] deals withextracting entities, relations and events from docu-ments according to pre-defined templates, and can beused to identify events. In our case, identifying trig-ger events for the mergers & acquisitions sales driverwould involve extracting entities that play the rolesof “the acquired company”, and “the acquiring com-pany”. Most existing systems for information extrac-tion employ linguistic patterns to fill the slots corre-sponding to entities and relations. As an example, alinguistic pattern “company acquired company” can beused to extract information about acquired and acquir-ing companies. Since several different patterns need tobe created manually for a single information extrac-tion task, this method involves a large amount of ef-fort. Learning based methods for information extrac-tion have been proposed. However, they suffer frompoor precision and recall2.

We argue that for identifying trigger events, it suf-fices to distill from a large collection of documents,snippets (a collection of sentences) that are related toa particular sales driver and present them in a rankedorder of relevance. In ETAP, events are associated withsnippets of text. We refer to a snippet that containsinformation relevant to a sales driver as a trigger event.As an example, to find information pertaining to merg-ers & acquisitions, it is good enough to determine snip-pets consisting of one or more sentences that describethe event. In our proposed method, we formulate theproblem of trigger event identification as a two-classclassification problem. The two classes are: a positiveclass of data relevant to the sales driver and a negativeclass comprising a large random sample of data fromthe Web. Since traditional text classification based ona bag-of-words model suffers from data sparsity prob-lem [7], we address this drawback by using feature ab-straction. We obtain feature abstraction by replacingcertain words with their type information. The typeswe consider in this paper are restricted to named entitytypes and part-of-speech types.

Another issue with training a classifier is the avail-ability of relevant training data. We solve this problemby querying the Web to generate ‘noisy’ training data,and then using a method similar to Carla Brodley [3]to train the classifier using this data. Other classifiersthat can learn in presence of noise could also be usedinstead.

2http://www.itl.nist.gov/iaui/894.02/related

projects/muc/proceedings/muc 7 proceedings/overview.html

2

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

The number of extracted trigger events could belarge. Therefore, the ranking component of ETAPranks the extracted trigger events based on a scoringfunction for easy browsing. Thus the ranking compo-nent of ETAP acts as a precursor to the validation ofthe generated sales leads by domain experts. The sim-plest scoring function is the posterior probability of thesales-driver class. The ranked list of trigger events canbe used by the sales representatives for the further salesrelated processes.

The organization of the rest of the paper is as fol-lows. In Section 2, we present an overview of the ETAPsystem. In [2] we discuss the details of the data gath-ering component. In Section 3, we present the methodthat we used to extract trigger events and discuss fea-ture abstraction and other issues involved in learningclassifiers. We present several scoring methods used inETAP in Section 4. Finally, we present our conclusionsand future work in Section 6.

2 System Description

ETAP is based on two main concepts, viz., salesdrivers and trigger events. A sales driver represents aclass of events whose existence indicates a high propen-sity to buy products/services by the companies associ-ated with the events. Trigger events are events thatoccur in the context of a company (or its environment)and indicative of a given sales driver. ETAP currentlyconsiders three sales drivers, viz., mergers & acquisi-tions, change in management, and revenue growth.

The ETAP system consists of three components,viz., the data gathering component, an event identi-fication component and a ranking component. Figure1 depicts the three components of ETAP and there in-teractions. Below, we briefly describe the functionalityof each component.

1. The data gathering component [2] gathers a col-lection of documents D from various sources suchas proprietary databases and corpora as well asfrom a focused crawl of the Web.

2. The event identification component splits eachdocument in D into snippets and associates witheach snippet, a score of its relevance to the givensales drivers.

3. The ranking component orders snippets so thatsnippets with higher confidence values for beingtrigger events are ranked higher. This componentalso provides the facility for ranking companiesbased on all trigger events associated with them.

Figure 1. Overview of the system

Details of components (2) and (3) are provided inthe following sections.

3 Event identification component

3.1 Overview

The event identification component consists of threesub-components, viz., snippet generator, annotator andclassifiers. This is shown in Figure 2. Each documentis first split into snippets (groups of sentences). Thechoice of operating at the snippet level was motivatedby the observation that a snippet conveys a precisepiece of information, in contrast with the entire docu-ment that contains the snippet. We have built a sen-tence chunker based on rules for sentence boundarydetection. The snippet generator uses the chunker andsplits the documents into snippets, each of which is agroup of n consecutive sentences. We have used n = 3in our system.

A two-class classifier is employed for determiningthe relevance of a snippet to each sales-driver. Thetwo classes are (1) a positive class of snippets that per-tain to the sales-driver and (2) a large negative class ofsnippets randomly sampled from the Web. The snip-pets are annotated by a named entity annotator be-fore being classified. The named entity annotator pro-vides feature abstraction for the classification task. Wepresent the motivation and other details behind fea-ture abstraction in section 3.2. Following named en-tity annotation, a snippet is passed through the two-class classifiers, each of which determines the probabil-ity that the given snippet pertains to the sales drivercorresponding to that classifier.

In practice, it is difficult to obtain sufficiently largeamount of manually labeled data for training the classi-

3

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

Figure 2. Event Identification Component

fiers. We get around this problem by smartly queryingthe Web to generate noisy-positive data. Details of theprocedure employed for obtaining the noisy positivedata set are provided in Section 3.3.1. In Section 3.3.2,we discuss the technique that we adopt for learning theclassifier in the presence of noise.

3.2 Named Entity Annotations and Feature Ab-straction

3.2.1 Named Entity Annotation

Data sparsity poses a big problem for learning accuratetext classifiers. Given the large amount of informationin text data sets, it is desirable to find a compact rep-resentation of the data set, losing as little informationas possible. This is traditionally a pre-processing stepin text classification called feature selection. After sim-ple operations such as changing all text to lower case,stemming, and stop-word elimination, statistical mea-sures are used to compute the amount of informationthat tokens (features) contain with respect to the label-set. Standard measures used are χ2, information gain,and mutual information. Features are ranked by one ofthese measures and only the top few (an ad hoc tunableparameter in most experiments) features are retained.

Research in natural language processing technologyallows us to take more of the semantics of the textinto account. Named entity and part-of-speech tagshave been found to be useful representations of textualcontent in some text classification applications such asgender classification and authorship attribution. Wepropose the use of named entity and parts-of-speechannotations as candidate representations of the text

data set.We achieve two broad objectives through entity an-

notation:

1. Generalization: While training, it helps to asso-ciate trigger events with general concepts ratherthan specific instantiations of the general con-cepts. For example, the fact that IBM made prof-its of $ 5 billion in the year 1996 could be gener-alized to learn that potentially any ORGANIZA-TION could make a profit of CURRENCY in someparticular TIME PERIOD.

2. Reducing number of model parameters: Machine-learning-based techniques, such as classification,suffer from over-fitting, when there are too manymodel parameters. If no annotations are used, allentities tend to become individual features. Thereare millions of person names, company names,place names, dates, currency and time expressionsacross the Web. Learning algorithms can save a lotof memory and can also avoid over fitting if theseindividual expressions are replaced with their an-notations.

In ETAP, we apply two types of annotations to docu-ments, viz. part-of-speech annotations and named en-tity annotations. Named entity annotation is a taskin which person names, location names, company andorganization names, monetary amounts, time and per-centage expressions are recognized in a text document.Below, we present details of a method we employ inETAP to determine the right level of abstraction usingthese annotations.

The named entity recognizer [11] employed in ETAPidentifies and annotates entities falling under one ofthe following categories: (1) ORG (organization name),(2) DESIG (designation), (3) OBJ (object name), (4)TIM (time), (5) PERIOD (months, days, date, etc),(6) CURRENCY (currency measure), (7) YEAR (solemention of a year), (8) PRCNT (percentage figure), (9)PROD (product name), (10) PLC (name of a place),(11) PRSN (person name), (12) LNGTH (all units ofmeasurement other than currency), and, (13) CNT(count figures). Any entity that did not fall in theabove categories, was assigned a part-of-speech cate-gory as determined by QTag3.

3.2.2 Choosing the right abstraction levels

Feature abstraction is a process of replacing words bytheir type information. In ETAP, we obtain one of two

3http://www.nlplab.cn/zhangle/morphix-nlp/manual/

node17.html

4

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

different types of information, namely, part-of-speechtype and named entity type, for each word. In thissection, we present the method that we employ to de-termine which type information should be used for fea-ture abstraction for a given sales driver.

Let Y be the class random variable that takes onvalues corresponding to different classes. Let X be arandom variable corresponding to a particular abstrac-tion category. We consider two sets of abstraction cat-egories: entity categories and part of speech categories.

For each abstraction category, we contrast betweenthe relative information gains for two random variablerepresentations, viz., presence-absence and instance-valued representations.

1. Presence-Absence (PA): In this representation,the abstraction category variable X takes values 0and 1 corresponding to its presence and absencerespectively in a document.

2. Instance-valued (IV): In this representation,the abstraction category variable X takes valuescorresponding to all the instances of that cat-egory that occur in the labeled data set. Forinstance, the variable for the abstraction cate-gory “PLACE” will take the values “Washington”,“New Zealand”, etc, while that for category “VB”(verb) will take values such as “acquired”, “an-nounced”, etc.

Relative information gain(RIG) is defined as follows:Given two random variables X and Y , and given that Yis to be transmitted, what fraction of bits would be savedif X was known at both sender’s and receiver’s ends.Relative information gain is denoted as RIG(Y |X) andis defined in Equation 1.

RIG(Y |X) =H(Y ) − H(Y |X)

H(Y )(1)

where H(Y ) is the entropy of random variable Yand H(Y |X) is conditional entropy.

In our classification setting, RIG(Y |X) serves as ameasure of the correlation of the feature random vari-able X with the class random variable Y . We computerelative information gain for each abstraction randomvariable X, comparing its representations PA(X) andIV(X).

Figures 3 and 4 compare the relative informationgains for PA and IV representations of several abstrac-tion categories. Figure 3 is for the pure positive andnegative classes for the mergers & acquisitions salesdriver. Figure 4 is for the pure positive and negativeclasses for change in management sales driver. Thegeneration of the pure positive and negative data sets

Figure 3. Relative Information Gains for twoalternative random variable representationsof each abstraction category for the mergers& acquisitions sales driver.

is described in Section 3.3.1. Note that the Y-axis ineach figure corresponds to the logarithm of the relativeinformation gain. Also note that all named entity cat-egory names are capitalized while the part of speechcategory names are expressed in small letters.

We summarize some of our observations based onFigures 3 and 4:

1. Verbs (vb), adverbs (rb), nouns (nn and np) andadjectives(jj) should not be abstracted at all be-cause the relative information gain for correspond-ing IV representation is much higher than that forPA.

2. Entities (such as PLC and ORG) should be ab-stracted because the relative information gain forthe corresponding PA representation is higherthan that for IV.

Based on these observations, we choose PA repre-sentation for all entity category abstractions and IVrepresentation for part-of-speech abstraction categorieslike vb, rb, nn and jj.

3.3 Snippet Classifier

3.3.1 Training data generation

Learning a classifier for snippets requires snippets la-beled as ‘positive’ and ‘background’. The snippets fortraining must capture all variations that express trigger

5

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

Figure 4. Relative Information Gains for twoalternative random variable representationsof each abstraction category for the change inmanagement sales driver.

events pertaining to each sales driver. Thus, learningrequires large collection of labeled data. In ETAP, iflabeled data for a sales driver is available, any of theexisting classifiers such as naıve Bayes [10] and SVM[7] could be used to identify trigger events. However,it is difficult to obtain a manually labeled documentcollection for most sales drivers. Moreover, one maywant to introduce new categories of sales drivers quitefrequently and hand-labeling to produce training datafor new categories can be very tedious. Therefore, fol-lowing [1, 6], we devise a simple technique for gener-ating training data consisting of snippets for classifierconstruction. Three types of data are involved in ourclassifier construction:

1. Pure Positive Data P p: This is a set of man-ually labeled snippets that pertain to the givensales driver. Generally, it is difficult to find largeamounts of pure positive data for a given salesdriver.

2. Noisy Positive Data P n: Noisy positive data con-tains some fraction of irrelevant data in addition torelevant data for the given sales driver. We use atwo-step method to accumulate a large noisy pos-itive data set.

• Step 1: In this step, we fetch documents fromthe Web, by querying a search engine usingsmart queries. These queries are manuallyformed such that documents returned in re-sponse to them are relevant to the sales driver

Figure 5. Positive snippet in the result forquery “new ceo”.

Figure 6. Noise in the result for query “newceo”.

with high confidence. As an example, to gen-erate noisy positive data set for change inmanagement. we use the query “new ceo”on a search engine to obtain a large numberof highly ranked documents. Most of thesedocuments are relevant to the change in man-agement sales driver. Figure 5 shows the firsthit that results from this query along with in-teresting trigger events present on the page.However, not all sentences of a relevant doc-ument form trigger events. Figure 6 showsseveral sentences on the same page that arenot valid trigger events. We therefore intro-duce a second level snippet filtering heuristicas explained below.

The smart queries for a sales driver could beobtained by analyzing the pure positive data

6

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

set. However, in many cases when pure pos-itive data for a sales driver is not available,it is difficult to formulate smart queries. Insuch cases, we find it feasible to extract in-stances of trigger events (pertaining to thesales driver) by querying the Web for a re-cent trigger event related to the sales driver.As an example, consider the mergers & acqui-sitions sales driver. Searching the Web witha search engine using the naıve query mergersand acquisitions, results in many documentsthat do not contain instances of mergers andacquisitions. Instead, if one queries the Webwith an example of a recent occurrence of amerger or acquisition, then the ratio of rele-vant documents to the total hits will be rel-atively high. For example, if one queries theWeb with “IBM Daksh”, most of the docu-ments that are returned, are about the recentIBM acquisition of Daksh.

• Step 2: To distill out positive snippets fromthe relevant documents obtained in the firststep, we use simple filters to extract onlythose snippets that contain specific combi-nations of named entity tags or keywords.For instance, one of the combinations thatwere used as a snippet-level filter for the salesdriver change in management was “Designa-tion AND (Person OR Organization)”. Forthe sales driver revenue growth, one of the fil-ters used was “Organization AND (CurrencyOR percent figure)”. The resultant distilledset of snippets forms the noisy positive dataset for the corresponding sales driver.

3. Data for Negative Class N : We construct the neg-ative class by randomly picking a large number ofsnippets from the Web. The same set of negativeclass snippets can be used across different sales-driver categories. Note, that the negative classmay contain a small fraction of data that is posi-tive for a sales driver, however, we ignore this factand assume that this fraction is very small. InETAP, we use a collection of over 2 million ran-domly sampled snippets from the Web as the neg-ative class data.

3.3.2 Choice of Classifier

The choice of the classifier depends on the availabilityof sufficient pure positive data for a sales driver. Tradi-tional methods of classification such as Naıve Bayes [10]and SVM [7] could be used to identify trigger events ifsufficient amount of pure positive data is available.

For the cases in which pure positive data is not avail-able in a sufficient amount, noisy positive data canbe generated as described in Section 3.3.1. If a smallamount of pure positive data is available, we use it af-ter oversampling it by a factor of 3. To learn fromthe noisy positive data in conjunction with the purepositive and negative data, we construct a naıve Bayesclassifier through an iterative method that aims at re-ducing the effect of the noise. At each iteration, thenoise is reduced by reclassifying the noisy positive dataset by using the classifier trained in the previous iter-ation. The method is similar to that proposed in [3].Given sets of noisy positive data set P n, pure posi-tive data set P p, negative data set N and a classifierCθ, with parameters θ, the iterative method does thefollowing:

1. Learns the parameters θ for the classifier using P n,P p and N . P n and P p form the positive class,while N forms the negative class for training theclassifier.

2. Using the trained classifier at present iteration,classifies the noisy positive data set P n. For thenext iteration, P n is set to the collection of snip-pets that are assigned the positive class by theclassifier. This reduces the noise in the noisy pos-itive data in each iteration.

3. Iterates on the above two processes until the noisypositive data does not change considerably.

Alternatively, any one of the proposed methods oflearning classifiers in the presence of noise [8, 12] canbe used to train classifiers using the generated noisypositive data along with the background data and purepositive data, if available.

4 Snippet ranking component

Analysis of sales leads is a complex process that re-quires skill, time and effort. Many organizations gathercompetitive information. However, due to the associ-ated costs, only a few formally analyze the informationand integrate the results of their analysis into theirbusiness strategy. The ranking component of ETAPacts as a precursor to the analysis task. The analysistask still needs to be done by a domain specialist sincethe issues to be analyzed are often quite complex andthe overall reality of the situation may not be all thatobvious from the snippet. This component assigns arank to each trigger event to reduce the time and ef-fort required in the analysis stage.

The output of a classifier of the event identifica-tion component is a list of snippets along with the

7

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

Figure 7. Snapshot of ETAP output that con-tains trigger events along with their rankingbased on classification scores for the changein management sales driver

associated scores. The score for a snippet stands forthe confidence with which it was classified as a triggerevent for the sales driver. The number of trigger eventsflagged by the event identification component could belarge. Therefore methods for ranking the snippets arerequired so that more useful trigger events are higherin the order. The usefulness of a trigger event can bebased on its score given by the classifier(s). Figure 7shows a snapshot of ETAP output that contains triggerevents along with their ranking based on classificationscores for the change in management sales driver.

ETAP also provides sales-driver-specific scoringfunctions. These functions capture the business valueof the sales drivers. As an example, for the revenuegrowth sales driver, trigger events may be ordered basedon the percentage change in the revenue for compa-nies in a given year. This requires extraction of exactrevenue growth figures from snippets. Determining ex-act numbers (either in % or in dollar figures ) is noteasy in most cases and may require some prior knowl-edge. For example, revenue declared in the previousyear could be required to determnine the percentagechange in the revenue of a company for the currentyear. In ETAP, we use a simpler approach of scoringsnippets using the semantic orientation of the words inthe snippet. Phrases that convey a stronger sense, e.g.,‘sharp decline’, ‘worst losses’ are weighted more thanother phrases, e.g., ‘loss’ and ‘profit’. The semantic ori-entation of words in the text can be used to judge theoverall orientation of the snippet by combining scoresof individual words and phrases, to obtain an overallscore for the snippet. We constructed a lexicon of pos-itive and negative phrases and assigned weights to eachphrase. Examples of positive phrases for the revenuegrowth sales driver are ‘significant growth’ and ‘solidquarter’. ‘Severe losses’ and ‘sharp decline’ are exam-

Figure 8. Snapshot of ETAP output containingexample trigger events along with their rank-ing based on semantic orientation scores forthe revenue growth sales driver

ples of negative phrases for revenue growth. Figure 8shows a snapshot of ETAP output that contains ex-ample trigger events along with their ranking based onsemantic orientation scores for the revenue growth salesdriver. Currently this lexicon is constructed manuallyfor each sales driver. Automated methods of generat-ing lexicons using positive and negative seed terms asdescribed in [14] could also be used.

The ranking component can also aggregate scores ofall the trigger events that are associated with a com-pany to give the company a score that reflects its over-all propensity of buying new products/services. Wedo this using a variant of the mean reciprocal rank4

(MRR). Let sdi be the ith sales driver and let |SD|be the total number of sales drivers. Let tej(c, sdi)be the jth trigger event pertaining to company c andsales driver sdi and let rank(tej(c, sdi)) be the rankassigned to the corresponding snippet by the rankingcomponent. Let |TE(c, sdi)| be the total number oftrigger events that are associated with company c andsales driver sdi. We define the MRR(c) score for com-pany c as in (2).

MRR(c) =

i

j

(1/rank(tej(c, sdi))

i

|TE(c, sdi)|(2)

MRR(c) assigns a score to each company c by tak-ing into consideration all trigger events across all salesdrivers that pertain to company c. In this way, theranking component also provides the facility for rank-ing companies based on all the trigger events that areassociated with each company.

4http://trec.nist.gov/data/qa.html

8

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

5 Experiments & Results

5.1 Data Preparation

We discuss the performance of the ETAP systemthrough F1 Measure on two sales drivers, viz., merg-ers & acquisitions and change in management. TheF1 measure of is computed as the harmonic mean ofthe precision and recall measures. We used the proce-dure described in Section 3.3.1 for generating the threesets of data required for training, namely, pure positive,noisy positive and negative data. A portion of the purepositive data was used in the classifier training phase,while the remaining portion was used for the purposeof evaluation. Five queries were used for generationof the noisy positive training data for each sales driver.Some of the example queries used for the mergers & ac-quisitions driver were “IBM Daksh”, “Coors Molson”and “Jobsahead Monster”. “New CEO”, “new CTO”,“new Manager” and “new President” were examplequeries used for the change in management driver. Wegathered the top 200 documents returned by the searchengine Google5 for each query. We split the documentsinto snippets and annotated then using the snippet gen-erator and the annotator, respectively. We then ap-plied a set of simple filters based on query terms andnamed entity annotations to the annotated snippets todistill out the noisy positive data for each driver. Anexample filter for the change in management driverwas “Discard all snippets not containing a (PRSN andORG) or (DESIG and ORG) annotations”. Similarly,an example filter for the mergers & acquisitions driverwas “Discard all snippets not containing two ORG an-notations”. See Section 3.2.1 for a description of theseannotations. The noisy positive data comprised ap-proximately 3500 snippets for each sales driver.

The pure positive data for each category was man-ually gathered from news Web sites. We held out afraction of the pure positive data for evaluation. Wecreated a common test data for the classifiers corre-sponding to both sales drivers. Our test data contained72 instances of true positives for mergers & acquisitionsdriver, 56 instances of true positives for the change inmanagement driver and 2265 snippets that did not be-long to either of the two sales drivers.

5.2 Evaluation

In Table 1, we present results obtained after the clas-sification step (c.f. Section 3.3) for the two sales driversunder consideration. The classification was carried out

5http://www.google.com

following the feature abstraction on the documents per-formed using the named entity annotator (c.f. Section3.2). We used Weka’s6 naıve Bayes classifier.

Sales driver Precision Recall F1Mergers & acquisitions 0.744 0.806 0.773Change in management 0.656 0.786 0.715

Table 1. Results after two iterations, us-ing naıve Bayes classifier for the two salesdrivers

It is important to observe that certain sales drivers,such as change in management, contain a large num-ber of misleading trigger events. For the case of changein management, a recurring example is the biographi-cal description of a person. Biographical descriptionstypically contain sentences such as - “Mr. Andersenwas the CEO of XYZ Inc. from 1980-1985”. Theseare not instances of the change in management salesdriver, but will deceive the classifier because of its fea-tures. Thus, certain sales drivers are bound to havea lower F1 Measure because of the wide presence ofoutliers. These can be further tackled by the rankingcomponent by making the score corresponding to eachsnippet a function of the time period associated withthe snippet.

6 Conclusion and Future Work

In this paper, we described a system called ETAPthat generates sales leads from the Web data by ex-tracting new or evolving facts about companies. Theapproach is based on the knowledge of sales drivers. Asales driver represents a class of events whose existenceindicates a good propensity to buy products/servicesby the companies associated with the events. Giventhe sales drivers, ETAP identifies trigger events fromWeb pages. We pose the problem of trigger identifi-cation given a sales driver as a two-class classificationproblem. We use feature abstraction to address theproblem of data sparsity and to achieve generalization.We use a novel technique that helps in identifying theright level of abstraction. To address the problem oflack of availability of training data for classifiers wepresent, a method that uses smart queries to generatea noisy positive data set. Although we use a specificmethod, any existing method for learning classifiers inthe presence of noise can be used to obtain classifiersusing the generated noisy positive data. We introduce

6http://www.cs.waikato.ac.nz/ml/weka/

9

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic

ranking methods to assist easy evaluation, verificationand use of the identified trigger events.

The ranked trigger events obtained from ETAP havebeen found to be useful for actual sales lead genera-tion. However, several problems need to be addressedfor better identification of trigger events. The overallresult of ETAP is heavily dependent on the accuracyof the named entity recognizer. Wrong annotation ofcompany and person names leads to incorrect triggerevents. For a trigger event to be useful, it should belongto a relevant time period. We need to associate a timewith each trigger event to evaluate its relevance. Thisis not always easy and methods need to be developedto resolve phrases such as “last year” and “previousquarter”. To determine an overall score of a companybased on its trigger events, we need to know all thevariations to the reference of the company. This infor-mation is not always available and automated methodsto determine variations of a company name need to bedeveloped.

References

[1] E. Agichtein and L. Gravano. Querying text data-bases for efficient information extraction. In Pro-ceedings of the 19th IEEE International Confer-ence on Data Engineering (ICDE), 2003.

[2] N. Agrawal, R. Ananthanarayanan, R. Gupta,S. Joshi, R. Krishnapuram, and S. Negi. Eshop-monitor: A web content monitoring tool. In ICDE,pages 817–820, 2004.

[3] C. E. Brodley and M. A. Friedl. Identifying andeliminating mislabeled training instances. In Pro-ceedings of the 13th National Conference on Arti-ficial Intelligence, pages 799–805, Portland, Ore-gon, 1996.

[4] M. E. Califf and R. J. Mooney. Relational learn-ing of pattern-match rules for information extrac-tion. In Working Notes of AAAI Spring Sympo-sium on Applying Machine Learning to DiscourseProcessing, pages 6–11, Menlo Park, CA, 1998.AAAI Press.

[5] D. Freitag. Toward general-purpose learning forinformation extraction. In C. Boitet and P. White-lock, editors, Proceedings of the Thirty-Sixth An-nual Meeting of the Association for Computa-tional Linguistics and Seventeenth InternationalConference on Computational Linguistics, pages404–408, San Francisco, California, 1998. MorganKaufmann Publishers.

[6] R. Ghani and R. Jones. Automatic training datacollection for semi-supervised learning of informa-tion extraction systems. In Accenture TechnologyLabs Technical Report, 2002.

[7] T. Joachims. A statistical learning model of textclassification for support vector machines. In Pro-ceedings of the SIGIR-2001, 2001.

[8] W. S. Lee and B. Liu. Learning with positiveand unlabeled examples using weighted logistic re-gression. In T. Fawcett and N. Mishra, editors,Proceedings of the 20th International Conferenceon Machine Learning, pages 448–455, WashingtonDC, US, 2003.

[9] C. X. Ling and C. Li. Data mining for directmarketing: Problems and solutions. In KnowledgeDiscovery and Data Mining, pages 73–79, 1998.

[10] K. Nigam, A. McCallum, S. Thrun, andT. Mitchell. Using em to classify text from labeledand unlabeled documents, 1998.

[11] G. Ramakrishnan. Word Associations in TextMining. PhD thesis, IIT Bombay, 2005.

[12] G. Ramakrishnan, K. P. Chitrapura, R. Kr-ishnapuram, and P. Bhattacharya. A modelfor handling approximate, noisy or incom-plete labeling in text classification. In Proc.of ICML 2005. http: // www. cse. itib. ac. in/∼hare/ bayesanil. pdf .

[13] S. Sarawagi and W. W. Cohen. Semi-markov con-ditional random fields for information extraction.In NIPs, 2004.

[14] P. D. Turney. Mining the Web for synonyms:PMI–IR versus LSA on TOEFL. Lecture Notesin Computer Science, 2167, 2001.

10

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE