[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic Sales Lead Generation from Web Data

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Automatic Sales Lead Generation from Web Data

Post on 07-Mar-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Automatic Sales Lead Generation from Web Data

    Ganesh RamakrishnanIBM India Research Labganramkr@in.ibm.com

    Sachindra JoshiIBM India Research Labjsachind@in.ibm.com

    Sumit NegiIBM India Research Labnegsum@in.ibm.com

    Raghu KrishnapuramIBM India Research Labkraghura@in.ibm.com

    Sreeram BalakrishnanIBM India Research Lab

    sreevb@us.ibm.com

    Abstract

    Speed to market is critical to companies that aredriven by sales in a competitive market. The earliera potential customer can be approached in the deci-sion making process of a purchase, the higher are thechances of converting that prospect into a customer.Traditional methods to identify sales leads such as com-pany surveys and direct marketing are manual, expen-sive and not scalable.

    Over the past decade the World Wide Web has growninto an information-mesh, with most important factsbeing reported through Web sites. Several news pa-pers, press releases, trade journals, business magazinesand other related sources are on-line. These sourcescould be used to identify prospective buyers automati-cally. In this paper, we present a system called ETAP(Electronic Trigger Alert Program) that extracts triggerevents from Web data that help in identifying prospec-tive buyers. Trigger events are events of corporate rel-evance and indicative of the propensity of companiesto purchase new products associated with these events.Examples of trigger events are change in management,revenue growth and mergers & acquisitions. The un-structured nature of information makes the extractiontask of trigger events difficult. We pose the problem oftrigger events extraction as a classification problem anddevelop methods for learning trigger event classifiersusing existing classification methods. We present meth-ods to automatically generate the training data requiredto learn the classifiers. We also propose a method offeature abstraction that uses named entity recognitionto solve the problem of data sparsity. We score andrank the trigger events extracted from ETAP for easybrowsing. Our experiments show the effectiveness ofthe method and thus establish the feasibility of auto-

    matic sales lead generation using the Web data.

    1 Introduction

    Various marketing techniques have been used bycompanies to target and entice prospective buyers inorder to increase their sales. The process of identifica-tion of potential buyers is also known as sales lead gen-eration. Live seminars, trade shows, cold calling, massmailing, advertising and partner referrals are some ex-amples of sales lead generation. These methods can beclassified into two types1: 1) reactive methods, and 2)proactive methods. Reactive methods are the ones inwhich a customer contacts a company following an ad-vertisement such as one in yellow pages. Subsequently,sales representatives handle the sale. On the otherhand, proactive methods are the ones in which a pos-sible customer is contacted cold by a company. Salesrepresentatives then try to verify that that the possi-ble customer is actually a prospect. Identifying possiblecustomers based on certain information about the cus-tomers is the most important part of proactive meth-ods. Several data mining methods that use customerinformation for the identification of likely buyers canbe found in the literature [9]. However, these meth-ods cannot be used when relevant information aboutcustomers is not readily available. Automatic methodsto gather relevant information about customers are re-quired to help identify possible customers.

    In this paper, we propose a proactive method of saleslead generation that gathers new or evolving informa-tion about customers from the Web. Our approach isinspired by the growth of information (such as companydescriptions, life cycles, profits and revenue changes) on

    1http://www.consultancymarketing.co.uk/sales-lead.htm

    1

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • the Web, posted by several news papers, press releases,trade journals, business magazines and other relatedsources. We present a system called ETAP (ElectronicTrigger Alert Program) that uses this information todiscover sales leads. The system is suited for iden-tification of prospective companies as opposed to theidentification of individual buyers, making it suitablefor B2B (business to business) scenarios. Analysis ofsales leads is a complex process that requires skill, timeand effort. The proposed system is aimed at gather-ing sales leads from the Web and presenting them todomain specialists for the final validation.

    ETAP is based on a traditional method of sales leadgeneration. In the traditional method, companies iden-tify a set of sales drivers for their products/services. Asales driver represents a class of events whose existenceindicates a strong propensity to buy products/servicesof companies associated with the events. Revenuegrowth, change in management and mergers & acqui-sitions are examples of sales drivers. The set of salesdrivers could be different for different industries. As anexample, mergers & acquisitions could be a sales driverfor the IT industry but may not be a sales driver forthe steel industry. This is because mergers and ac-quisitions of companies could lead to the integrationof IT systems of the companies thereby generating de-mand for new IT products. Typically, the drivers fora specific industry are determined by expert opinions.Suitable questionnaires are developed and surveys areconducted to gather information about the existence oftrigger events related to the known sales drivers. Trig-ger events are events that occur in the context of acompany (or its environment) and belong to a givensales driver. As an example, given mergers & acqui-sitions as a sales driver for the IT industry, an eventsuch as Company X plans to acquire Company Y laterthis year is a trigger event. Similarly, given revenuegrowth as a sales driver, an event such as Company Xreported a revenue growth of 10% in the fourth quar-ter is a trigger event. Identification of such a trig-ger event makes Company X a prospective buyer of ITproducts. The traditional method of gathering triggerevents is to call representatives of different companiesor conduct interviews with them. This method, beingmanual, tedious and expensive, is not scalable.

    In our work, we assume that the set of sales driversfor the industry is known before hand. Given a setof sales drivers for the industry, ETAP automaticallyidentifies trigger events on the Web. The ETAP sys-tem crawls the Web to gather documents related tocompanies and financial news, extracts trigger eventsfrom the crawled documents and then ranks the triggerevents based on a scoring function for easy browsing.

    Thus, ETAP assists sales representatives in prioritizingcompanies that need to be approached.

    Event extraction is the most challenging part of thesystem, due to the unstructured nature of the data.Traditional information extraction [4, 5, 13] deals withextracting entities, relations and events from docu-ments according to pre-defined templates, and can beused to identify events. In our case, identifying trig-ger events for the mergers & acquisitions sales driverwould involve extracting entities that play the rolesof the acquired company, and the acquiring com-pany. Most existing systems for information extrac-tion employ linguistic patterns to fill the slots corre-sponding to entities and relations. As an example, alinguistic pattern company acquired company can beused to extract information about acquired and acquir-ing companies. Since several different patterns need tobe created manually for a single information extrac-tion task, this method involves a large amount of ef-fort. Learning based methods for information extrac-tion have been proposed. However, they suffer frompoor precision and recall2.

    We argue that for identifying trigger events, it suf-fices to distill from a large collection of documents,snippets (a collection of sentences) that are related toa particular sales driver and present them in a rankedorder of relevance. In ETAP, events are associated withsnippets of text. We refer to a snippet that containsinformation relevant to a sales driver as a trigger event.As an example, to find information pertaining to merg-ers & acquisitions, it is good enough to determine snip-pets consisting of one or more sentences that describethe event. In our proposed method, we formulate theproblem of trigger event identification as a two-classclassification problem. The two classes are: a positiveclass of data relevant to the sales driver and a negativeclass comprising a large random sample of data fromthe Web. Since traditional text classification based ona bag-of-words model suffers from data sparsity prob-lem [7], we address this drawback by using feature ab-straction. We obtain feature abstraction by replacingcertain words with their type information. The typeswe consider in this paper are restricted to named entitytypes and part-of-speech types.

    Another issue with training a classifier is the avail-ability of relevant training data. We solve this problemby querying the Web to generate noisy training data,and then using a method similar to Carla Brodley [3]to train the classifier using this data. Other classifiersthat can learn in presence of noise could also be usedinstead.

    2http://www.itl.nist.gov/iaui/894.02/related

    projects/muc/proceedings/muc 7 proceedings/overview.html

    2

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • The number of extracted trigger events could belarge. Therefore, the ranking component of ETAPranks the extracted trigger events based on a scoringfunction for easy browsing. Thus the ranking compo-nent of ETAP acts as a precursor to the validation ofthe generated sales leads by domain experts. The sim-plest scoring function is the posterior probability of thesales-driver class. The ranked list of trigger events canbe used by the sales representatives for the further salesrelated processes.

    The organization of the rest of the paper is as fol-lows. In Section 2, we present an overview of the ETAPsystem. In [2] we discuss the details of the data gath-ering component. In Section 3, we present the methodthat we used to extract trigger events and discuss fea-ture abstraction and other issues involved in learningclassifiers. We present several scoring methods used inETAP in Section 4. Finally, we present our conclusionsand future work in Section 6.

    2 System Description

    ETAP is based on two main concepts, viz., salesdrivers and trigger events. A sales driver represents aclass of events whose existence indicates a high propen-sity to buy products/services by the companies associ-ated with the events. Trigger events are events thatoccur in the context of a company (or its environment)and indicative of a given sales driver. ETAP currentlyconsiders three sales drivers, viz., mergers & acquisi-tions, change in management, and revenue growth.

    The ETAP system consists of three components,viz., the data gathering component, an event identi-fication component and a ranking component. Figure1 depicts the three components of ETAP and there in-teractions. Below, we briefly describe the functionalityof each component.

    1. The data gathering component [2] gathers a col-lection of documents D from various sources suchas proprietary databases and corpora as well asfrom a focused crawl of the Web.

    2. The event identification component splits eachdocument in D into snippets and associates witheach snippet, a score of its relevance to the givensales drivers.

    3. The ranking component orders snippets so thatsnippets with higher confidence values for beingtrigger events are ranked higher. This componentalso provides the facility for ranking companiesbased on all trigger events associated with them.

    Figure 1. Overview of the system

    Details of components (2) and (3) are provided inthe following sections.

    3 Event identification component

    3.1 Overview

    The event identification component consists of threesub-components, viz., snippet generator, annotator andclassifiers. This is shown in Figure 2. Each documentis first split into snippets (groups of sentences). Thechoice of operating at the snippet level was motivatedby the observation that a snippet conveys a precisepiece of information, in contrast with the entire docu-ment that contains the snippet. We have built a sen-tence chunker based on rules for sentence boundarydetection. The snippet generator uses the chunker andsplits the documents into snippets, each of which is agroup of n consecutive sentences. We have used n = 3in our system.

    A two-class classifier is employed for determiningthe relevance of a snippet to each sales-driver. Thetwo classes are (1) a positive class of snippets that per-tain to the sales-driver and (2) a large negative class ofsnippets randomly sampled from the Web. The snip-pets are annotated by a named entity annotator be-fore being classified. The named entity annotator pro-vides feature abstraction for the classification task. Wepresent the motivation and other details behind fea-ture abstraction in section 3.2. Following named en-tity annotation, a snippet is passed through the two-class classifiers, each of which determines the probabil-ity that the given snippet pertains to the sales drivercorresponding to that classifier.

    In practice, it is difficult to obtain sufficiently largeamount of manually labeled data for training the classi-

    3

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • Figure 2. Event Identification Component

    fiers. We get around this problem by smartly queryingthe Web to generate noisy-positive data. Details of theprocedure employed for obtaining the noisy positivedata set are provided in Section 3.3.1. In Section 3.3.2,we discuss the technique that we adopt for learning theclassifier in the presence of noise.

    3.2 Named Entity Annotations and Feature Ab-straction

    3.2.1 Named Entity Annotation

    Data sparsity poses a big problem for learning accuratetext classifiers. Given the large amount of informationin text data sets, it is desirable to find a compact rep-resentation of the data set, losing as little informationas possible. This is traditionally a pre-processing stepin text classification called feature selection. After sim-ple operations such as changing all text to lower case,stemming, and stop-word elimination, statistical mea-sures are used to compute the amount of informationthat tokens (features) contain with respect to the label-set. Standard measures used are 2, information gain,and mutual information. Features are ranked by one ofthese measures and only the top few (an ad hoc tunableparameter in most experiments) features are retained.

    Research in natural language processing technologyallows us to take more of the semantics of the textinto account. Named entity and part-of-speech tagshave...

Recommended

View more >