ontology induction for mining experiential …gkmc.utah.edu/winter2005/papers/lee.pdfontology...

Ontology induction for mining experiential knowledge from customer reviews

Thomas Lee

Department of Operations and Information Management

University of Pennsylvania, The Wharton School Philadelphia, PA 19104

Abstract The rapid rate of product innovation, increasing degree of product complexity in categories ranging from digital cameras to toaster ovens fuels the need for resources to help consumers navigate the product space. For a given product category, rather than relying upon domain experts, we turn to the collective experience of the user community as captured in online customer reviews. In order to learn from users, we must first understand the concepts, the relationships, and the vocabulary with which the community discusses the product category. We induce the corresponding ontology from all reviews of all products in the category.

In this paper, we represent the set of customer reviews for a product category as a graph. Vocabulary tokens are represented as vertices and edges capture the co-occurrence of tokens within and between reviews. We learn hyponym (is-a) relationships between concepts by identifying the connected components of the graph. Meronym/holonym (part-of/has-a) relationships are defined by heuristically searching for a maximal clique. We discuss preliminary results from approximately 8000 on-line digital camera reviews and specify a set of quantitative metrics that is currently being evaluated.

Working Winter Information Systems Conference TLee 1 Paper Eccles School of Business

Introduction At the intersection of current manufacturing trends such as mass customization, personalization,

and short product life cycles, the modern consumer is faced with a dizzying selection of makes

and models in a myriad of product categories. Customers formerly relied upon knowledgeable

sales clerks and informed colleagues. Unfortunately, today's product variety challenges the

ability of even professional reviewers like the Consumer's Union to keep pace. Consumer

Reports might review 10 or 20 digital cameras in a single study. Amazon.com recently listed

more than 600 available digital cameras (and camera packages). Fortunately, the World Wide

Web has enabled a new knowledgebase for navigating the product space. On-line customer

reviews represent the community of users. The challenge is to harness the knowledge contained

within the virtual community of customer reviews to complement input from the physical

community of colleagues, neighbors, and clerks.

We previously proposed use-centric mining as one approach to leveraging reviews [18].

From the user community, as represented by customer reviews, we learn those features (product

attributes) that align with particular product uses and applications. For example, what features

does the user community make note of when purchasing a digital camera for vacation travel?

Which features are noteworthy if 'taking eBay auction photos' is the primary reason for

purchasing? A customer's prospective interests are matched to an associated set of product

features. Products are recommended based upon their support for the desired mix of features.

In this position paper, we present a graph-based approach to semi-automatically learn a

domain-specific ontology of product features for the purpose of enabling use-centric mining. We

define an ontology as a controlled vocabulary for discussing both concepts and relationships

between concepts [13, 21]. We learn a category-specific ontology of product features from the


customer reviews of products in that category. For example, the 'battery types' used for motor

vehicles are not the same 'battery types' used in digital cameras. Using a common vocabulary of

product features (concepts) enables use-centric mining to discover associations between user

interests and product features from customer reviews.

We model the space of online customer reviews in a product category as a weighted

graph where vocabulary terms are vertices in the graph and edges are weighted by the frequency

with which two terms are used together to describe a product feature. Three contributions that

follow from our graph-based approach are presented in this position paper. First, we use co-

occurrences of terms within and between review documents to heuristically prune the graph.

Second, the principle of connected components is used to automatically define subclass or

hyponym (is-a) relationships between concepts. Third, we use cliques to automatically discover

meronym/holonym (part-of/has-a) relationships between concepts.

In the remainder of this paper, we begin with a background statement to both motivate

our work and to clarify the problem statement. We then provide a formal model and include

initial results from processing a set of 8695 digital camera reviews from Epinions.com [1]. The

preliminary status of this research leads us to describe a more formal set of evaluation methods

that we are in the process of conducting but for which we do not yet have results. We close with

a review of related work on ontology learning, a discussion, and future work.

Background

This section begins with a brief overview of use-centric mining to motivate our work. We then

focus on the specific problem that is addressed in this paper: Inducing an ontology of product

features. We describe the ontology and the nature of the learning problem in the context of use-

centric mining.


Use-centric mining

Prior research involving customer reviews tends to fall into one of two categories: product

focused, or customer focused. Product-focused researchers cluster reviews by manufacturer and

model. One objective is to help designers understand positive and negative aspects of a design

both for incremental improvements and new product development [17, 26, 30]. Moreover,

prospective buyers use product-focused reviews to evaluate a specific item [17]. By contrast,

customer-focused research on product reviews attempts to model individual consumers.

Researchers combine data from all (multiple) reviews written by a single reviewer; the objective

is to create a user profile and predict future purchases [5]

The goal of use-centric mining is to model a particular product category based upon

feedback from that category's user community. For use-centric mining, we develop two

category-specific ontologies: one of product features and one of customer uses. Using ontology-

based extraction [8, 25], each review is mapped into a Boolean vector representation (see Figure

1). Finally, we mine the review vectors for those frequent item sets that associate particular

usages with particular features [15].

<!-- Ontology entries> <Concept> <CName>Ease of use</CName> … <ClassifyExtract> <RegExp>'ease[29]of[15]use' </RegExp> <RegExp>'easy[ -]to[ -]use' </RegExp> …</Concept> … <CName>Compactness</CName> … <ClassifyExtract> <RegExp>'handy'</RegExp> <RegExp>'pocket-size'</RegExp> …</Concept> <Concept> … Review Vector Ease of use Compactness Zoom Lens Cover … ProdID-ReviewerID 1 1 0 0

Figure 1. Ontology support for mapping a review into a Boolean vector

Product feature ontologies

We define an ontology as a controlled vocabulary for discussing both concepts and relationships

between concepts [13, 21]. Each concept in our product ontology corresponds to a particular

product feature. Abstractly, each on-line review is a list of those features or concepts that a

customer likes or dislikes. Because there are many different ways that different customers might

refer to a particular feature (e.g. '2 AA NiMH' and 'rechargeable lithium ion' are different string

literals for describing a battery type), the ontology normalizes the language for distinguishing

between different features. Normalizing the terminology enables us to mine the feature-use

patterns for use-centric recommendations. Having identified the concepts, we now describe

relationships between concepts.

There are two types of relationships between concepts that we would like to learn.

Where each review is a list of camera feature(s), the first relationship is the hyponym

relationship that subdivides a concept like camera feature into its sub-classes. Battery type is a

feature of digital cameras as are resolution, and compression. We can then distinguish review

phrases like '2 AA NiMH,' which are instances of battery type, from other literals like 'MPEG-1'.

Moreover, a given subclass might be hierarchically decomposed into a more precise level of

specificity. For example, MPEG, JPEG, and TIFF are hyponyms of compression. The string

'MPEG-1' is an instance of the concept MPEG while the string 'JPEG' is an instance of the

concept JPEG which is-a compression scheme which is-a camera feature. By representing

concept features at different levels of precision, we aggregate reviews in different ways and can

isolate whether the concept of compression is important or whether customers care about a

specific compression scheme because it is easier to import into their software for making holiday

greeting cards.


The second type of concept relationship that we would like to learn is the 'part-of' or 'has-

a' relationship (meronym/holonym) that describes the 'attributes' of a particular product feature.

For example, a battery type might be described by the size (instances of size include 'kcrv3' and

'AA'), by the chemistry (instances of battery chemistry are 'LiMH' and 'Alkaline') or by the

voltage. Each attribute constitutes a distinct concept. For example, digital camera users who

travel worldwide are concerned about battery size. AA and AAA sized batteries are easily

available world-wide while kcrv3 cells may be more difficult to purchase overseas. In this case,

use-centric mining would identify the holonym size rather than the more general concept battery

type for travelers.

Approach

In this paper, we focus on an unsupervised approach to inducing the ontology of features for a

particular product category from on-line customer reviews. Although the most common

approach to ontology development involves expert intervention, products within a single

category evolve over time and the breadth of product variety makes finding experts for multiple

categories implausible.

Abstractly, each customer review is a list of features. To minimize the complexities of

natural language processing and information extraction, we model each review solely by its list

of Pros and Cons (see Figure 2). In addition, we gain additional data on ontology concepts from

the 'product detail' page for each item in the product category. From each 'product detail' page,

we extract a list of features (battery type, compression type, etc.) and the corresponding

relationship data. For example, in Figure 3, 'MPEG,' 'TIFF,' and 'JPEG' are listed as instances of

Compression Types.


Every review and every line of text from the product detail page is parsed into a list of

phrases using naïve, lexical processing techniques. Pro and Con comments are inter-mixed

because for ontology construction, we are only interested in identifying the concepts that

customers are interested in. Whether the feature is a strength or a weakness of a particular

product is not important. Phrases that emerge from lexical processing of the reviews in Figure 2

include 'small size,' 'movie mode,' 'great auto-focus,' '3.2 megapixels,' 'weak flash,' and 'very

slow response when using redeye reduction'. Corresponding phrases for the concepts in the

Figure 3 Product Description include '2 Lithium Batteries(LB-01),' '4 AA Alkaline,' '4 AA

NIMH Rechargeable,' '4 AA Lithium,' '4 AA NiCd Rechargeable Batteries (Optional).

Figure 2. Epinions Customer Review Page Figure 3. Epinions Product Description Page

Each phrase is tokenized into its constituent words, and the resulting token list is pruned

using shallow linguistic processing techniques including stop-words, capitalization, Porter

stemming, and a co-occurrence heuristic described below. For example, the phrase 'very slow

response when using redeye reduction' is reduced to redey reduct.

Each phrase is a graph where each token is a vertex and every token pair in the phrase is

an edge. Edges are weighted by their frequency. The graphs for each phrase in a review and

each review in the set of all reviews are combined. A portion of the graph constructed on an


initial experiment of more than 8500 digital camera reviews from Epinions.com is depicted in

Figure 4. Edge weights are omitted for readability.

Hyponyms are defined in terms of a

Breadth First Search (BFS) of the token

graph. All phrases involving only tokens in

the BFS traversal of the graph from a given

node to a parameterized depth (we used a

depth of 2) are clustered in the same hyponym

(see Tables 1 and 2).

indoorredey

wboptimunustint

yellow

focuss

troubl

focusaction

still

warrantiineffect

reduct

unaccept

honor

pentax

part

servic

year

month

model

customhorribl

pri

plastic

move

flimsi

coupl

indoorredey

wboptimunustint

yellow

focuss

troubl

focusaction

still

warrantiineffect

reduct

unaccept

honor

pentax

part

servic

year

month

model

customhorribl

pri

plastic

move

flimsi

coupl

Figure 4. Graph of tokens from Epinions digital camera reviews

Meronyms and holonyms are

discovered by heuristically searching for a maximal clique within the connected component that

defines a single concept. (see Table 3).

Model

Our basic model builds on the familiar definitions of a graph [2]. A graph G = (V,E) is a pair of

the set of vertices V and the set of edges E. An edge in E is a connection between two vertices

and may be represented as a pair (vi,vj) V. A path is a list of vertices in V where there is an

edge in E from each vertex to the next vertex in the list.

A graph is connected if there exists a path between any two vertices in the graph. S =

(V’,E’) is a subgraph of G if v V’ v V and vi,vj E’ vi,vj E and vi,vj V’. Gi is a

maximally connected subgraph of G if Gi is a connected subgraph S = (V’,E’) and for all vj V

and V’, there is no vk V' where vj, vk E. The connected components of G are the set Gi …


Gn of maximally connected subgraphs of G. By definition, for any two maximally connected

subgraphs Gi and Gj Vi Vj is empty and Gi … Gn = G.

A complete graph G = (V,E) is a graph for which there is an edge between every pair of

vertices. vi,vj V, vi,vj E. A clique S is a complete subgraph of G. S' = V',E' is a maximal

clique of G if S' is a clique and there is no vj V and V' such that vk V', vj, vk) E.

Having defined our notation, we can now construct a graph as described in the Approach

section. One graph is created from the set of all customer reviews (see Figure 2), and a separate

graph (with distinct connected components for each product feature) is created from the set of all

product descriptions (see Figure 3). Every customer review is reduced to a list of phrases

represented as string literals that are instances of the generic concept product feature. Phrases

are tokenized into individual words and then transformed into a graph. Each token represents a

vertex in the graph and edges are defined by the token pairs in the phrase vi, vj i j. A single

graph is constructed from all phrases of all lists of all customer review.

Likewise, every product description (one for each product instance in the category e.g.

the Canon S400 or the Olympus Camedia C 4000) is reduced to a set of phrase lists. Each phrase

list contains instances of a specific product feature hyponym. The product description graph

contains a connected component for each hyponym; and each such subgraph represents the

corresponding lists from all product description pages.

Pruning the graph (co-occurrences)

The graph constructed by the basic model is comprised of tokens taken from product reviews

that are directly input by customers. As a consequence, the set of vertices includes misspellings

and other inconsistencies that can lead to spurious edges between unrelated concepts or dilute the


weight of existing edges. Therefore, we actively prune tokens (vertices) as we go through the

process of parsing a line into a list of phrases, construct the graph for each phrase, aggregate

phrase-graphs into the graph for a line, and combine graphs for different lines.

Our first step in pruning involves a series of lexical processing steps. Beginning with a

line, we naively parse on list separators including commas, slashes, or semicolons. We apply a

simple heuristic to count list separators and assume that multiple instances of a separator

correspond to a list (e.g. two or more commas) while single instances are ignored.

For each resulting phrase in the line, we apply the following filters:

• Discard stop words to reduce noise in the graph

• Convert all tokens to lower case to address capitalization and proper nouns (e.g. 'Microsoft

Windows' is a required operation system for some digital camera software)

• Apply Porter stemming to address plural forms as well as simple misspellings. Note that

introducing a dictionary, thesaurus, or Wordnet-type variants might also be of use. However,

in this initial work, we wanted to limit the use of external sources that might introduce

domain dependence. In this way, we focus on the robustness of domain-independent lexical

and graph-based processing.

• Discard any phrase that contains a non-alphanumeric or punctuation feature not eliminated

by parsing phrases. Special exceptions were generated by hand for decimal numbers and

hyphenated terms. Examples of discard phrases taken from the digital camera product

feature Computer Requirements include: 'iboook(tm),' 'windows®98 second edition (se),'

and 'windows 98*.' The intuition is that our data set ranges from several hundred to several

thousand phrases depending upon the starting feature concept. Therefore, we can safely


discard outlier phrases. Discarding does raise the question of an optimal sample size of

phrases; we revisit this issue in the Evaluation and Discussion sections below.

The final pruning step is based on the co-occurrence of terms within a line. We assume

that each phrase in a line addresses a distinct product feature or concept. By extension, we

conclude that a single token that appears in multiple phrases of the same line is therefore not

representative of any of the phrases in which it appears. For example, a customer review line

might include the phrases 'great battery life' and 'great price.' In our graph-based representation,

the token 'great' would connect subgraphs representing two distinct concepts: battery life and

price. As a consequence, we prune tokens that appear in multiple phrases of the same line.

More formally, we apply the co-occurrence heuristic when combining the graphs of the

constituent phrases into the graph for the corresponding line. A line containing n phrases should

produce a graph of n connected components. Any vertex that appears in more than one clique of

the graph is discarded. Note that we can also test for a violation by checking whether the

number of connected components is fewer than the number of phrases.

Our application of co-occurrence runs counter to traditional research in information

retrieval and text mining. Co-occurrence is traditionally interpreted as an indicator of

interestingness [15, 29]. We believe that the traditional notions of intra-document frequency and

inter-document frequency have a different interpretation for text corpora like customer reviews.

We use co-occurrence to identify tokens that introduce noise into the graph and acknowledge

that the assumption may be unique to data sets that are domain specific, short in length, and

dense in token significance.


Learning hyponyms

We equate the task of finding hyponyms (is-a relationships) with finding the connected

components of the graph. We find the connected components by arranging the vertices of the

graph in descending order by degree and repeatedly executing a breadth first search (BFS) to

find the distinct connected components. For example, in the case of a graph representing the list

for phrases describing the product description concept battery life, we can distinguish the

subclasses of battery life by hour and battery life by number of images. The phrases in Table 1

are excerpted from the 201 Epinions.com digital camera product descriptions in our data set of

nearly 600 digital cameras that provide phrases for the feature concept battery life.

Phrases Connected components by time

Connected components by images

2.5 hr 1.66 hr 2.5 hour 1.66 hr 250 shot 250 imag 250 shot 650 shot 180 imag 2.5 hr shot accord 135 imag pentax shot 390 imag 250 imag 2.5 hour pentax shot 180 imag shot accord 135 imag 1.67 hour 1.67 hour 650 shot 390 imag

Table 1. Connected components for representative phrases of Battery Life taken from Epinions Product Descriptions

The hypothesis that connected components isolate concepts breaks down in the general

graph generated from customer review phrases because of the tremendous expressiveness of the

English language. Figure 4 illustrates how the token redey, associated with the concept of

redeye reduction, is connected to the concept of warranty by the path redey, reduct, ineffec,

warranty. We address this problem in two ways. First, the co-occurrence heuristic decreases

noise by deleting spurious tokens such as 'great' or 'good' that might otherwise link distinct

concepts. Second, instead of searching for connected components, we cap the BFS at a depth of

two (which corresponds to the average length of our pruned phrases). Table 2 identifies the

phrases that define the concept of redeye reduction based upon a breadth first search of depth

two over the pruned graph generated from 8695 Epinions.com digital camera customer reviews.


path length pruned phrases from the graph of customer reviews

0 redey

1 redey indoor reduct redey

2 indoor wb indoor optim reduct ineffect unaccept reduct indoor unus indoor tint yellow indoor focuss troubl Table 2. BFS of depth 2 in the customer review graph for the concept of redeye reduction

Learning meronyms/holonyms

Finding meronyms and holonyms is addressed by heuristically searching for a maximal clique in

the graph of a single hyponym. The (strong) hypothesis is that if the data set is large enough, we

can discover at least one instance of the Cartesian product of all feature attributes that define the

concept; and that we can find that in a maximal clique. Colloquially, there may not be a single

phrase that includes an instance of all feature attributes, but transitive relationships (edges

between different phrases) could complete the relation. Because the general problem of finding

a maximum clique is NP-complete, we adopt a simple heuristic for finding a maximal clique that

is presented as Figure 5 [4, 31]. We iterate repeatedly, selecting the largest such maximal clique.

Application to the hyponyms of battery life in Table 1 is trivial. For connected

components by time, because 'hr' and 'hour' never appear in the same phrase, the maximal clique

(which happens to be the maximum in this case) is represented by a token (integer) for duration

and the token for units ('hr' or 'hour'). The resulting meronym/holonym relationships are

included as Table 3. Note that the technique does not assign names to the concepts, although in

many cases, concept names naturally emerge. Note also a limitation of the strong assumption

regarding the maximal clique. For connected components by images, the tokens 'accord' and

'pentax' are necessarily forced, by default, into the meronym/holonym concept quantity. We take

up this issue again in the Discussion below.


Table 4 shows excerpted results from applying the maximal clique heuristic to the graph

for battery type. Beginning from 575 product descriptions containing phrases for battery type,

we pruned the list to 21 tokens that combined to form 63 distinct phrases. As seen in Table 4,

phrases need not have a token for each meronym/holonym concept. Several phrases in Table 4

do not include the token 'recharg' and one phrase does not have a quantity. Logical constraints

can assist in assigning the tokens of a given phrase to their constituent meronym/holonym

concepts. However, there are times that such logical constraints are insufficient. Additional

information is required to assign the token 'kcrv3' to size. Notice also that battery voltage, as

captured by the phrase '9v' is omitted from the discovered meronym/holonym concepts. By first

applying the connectedness criterion for hyponyms, we would remove the phrase '9v' altogether,

but it is not clear that voltages are a hyponym of battery type rather than a meronym/holonym in

the same sense as battery chemistry or size. We discuss relaxing the strong clique-assumption

below.

# input g: the graph as a list of vertex edge pairs # v1: as the first vertex in the clique gen_maximal_clique(v1, g): v_list = edge_list(v1, g) # pairwise adjacent vertex of v1 ordered by degree. c_list = edge_list(v1, g) # candidate vertices for addition to the clique clique = [v1] # initialize the clique return iterate_clique(clique, c_list, v_list, graph) iterate_clique(clique, c_list, v_list, graph) if there are no candidates to add and no vertices to check, return c v_car is the first vertex in v_list v_cdr is the remainder of v_list e_car is the edge list of v_car if v_car is in c_list, v_car is adjacent to everything in the clique so far add(v_car, clique) update c_list to adjacent nodes of the new clique: intersect(c_list, e_car) recurse w/ v_cdr else recurse w/ v_cdr

Figure 5. Algorithm to find a maximal clique [31]


Connected components by time

Connected components by images

Maximal clique 1.66 hr 250 shot Meronym/holonym concepts names quantity units quantity units Token assignment 1.66 hr 250 shot 2.5 hr 650 shot 2.5 hour accord shot 1.67 hour pentax shot 250 imag 135 imag 390 imag 180 imag Table 3. Meronym/holonym relationships for hyponyms of the feature Battery Life

Initial phrases Connected components (CC) by

rechargeable or not CC by voltage

Maximal clique recharg nimh aa 4 9v Meronym/holonym concepts names

recharg chemistry size qty voltage

recharg nimh aa 4 recharg nimh aa 4 aa recharg alkalin 4 recharg alkalin aa nimh recharg 2 recharg nimh 2 2 aaa alkalin alkalin aaa 2 aa lithium 4 lithium aa 4 recharg lithium built recharg lithium built 2 kcrv3 kcrv3 2

Token assignments

9v 9v Table 4. Meronym/holonym relationships for Battery Type

Evaluation

In this section, we outline on-going and future work to develop evaluation metrics. Historically,

evaluation of research in ontology induction was quite limited. Criteria were limited to

subjective assessments of the researchers and/or domain experts [19, 24, 27]. Assessment might

simply entail publishing representative results as we provided in the previous section. Only

more recently has research in ontology induction begun to develop more objective metrics.

Modica, Gal and Jamil [25] use precision and recall to systematically document the effectiveness

of different strategies at correctly incorporating new concepts from a representative data source

into an existing hierarchy ontology. Popescu, Yates and Etzioni [28] develop multiple

techniques to extract and construct sub-class structures from text. They define their reference


standard for correctness as the union of all sub-class structures from all techniques and use

precision and recall as an internal consistency check.

We describe three different techniques for evaluating the validity of our co-occurrence

heuristic, the accuracy of ontology relationships, and the robustness of the induced ontologies

respectively. Moreover, while all three evaluation approaches are applicable to the existing

category of digital cameras, we have also collected reviews and product description information

for additional product categories including DVD players, camcorders, toaster ovens, and vacuum

cleaners. The intention is to evaluate the assumptions and heuristics across multiple product

categories that reflect the same inter-document and intra-document characteristics.

Validity of the co-occurrence heuristic

In our approach, we described a co-occurrence heuristic for discarding irrelevant terms in a pre-

processing step to the connectedness algorithm for categorizing phrases into distinct subclasses.

While we currently capture the tokens that fail the co-occurrence heuristic, we need to analyze

the discarded tokens to determine the Type I and Type II errors. Type I errors in this context

would discard terms that are indeed significant and Type II errors would report terms that linked

concepts within the graph that could/should have been discarded but were missed using this

heuristic. A more sophisticated, probabilistic model that considers not only co-occurrence within

a single review but co-occurrences between reviews might provide a better filter for terms to

discard. Because of the size of the data set and possible bias as a function of particular reviews

or classes of subsets of products, we also would like to check the heuristic by repeatedly

sampling from smaller data sets for validation.


Accuracy of the ontology relationships

The traditional approach to validating ontologies involves asking domain experts to vet the

results. For an automated system that is intended to capture knowledge from multiple,

heterogeneous product domains, soliciting input from domain experts is not scalable. Instead,

we propose to compare our induced ontologies to publicly available expert-generated ontologies:

professional reviews and on-line buying guides. We can calculate metrics with respect to both

the product features as well as the categories or levels into which individual product features are

subdivided. Precision asks, "how many of the induced features and levels are actually used in

professional reviews and on-line buying guides." Recall asks "how many of the features and

levels used in practice are automatically induced?"

We have collected 22 different buying guides from print and on-line sources including

those by the Consumer Union (which publishes Consumer Reports), Consumer Electronics HQ,

and Forbes [6, 7, 23]. Collectively, the guides document 56 different product features, some of

which are considered subclasses by some recommenders and distinct features by others. For

example, some experts identify optical zoom and digital zoom as unique product attributes while

others classify optical and digital zoom as subclasses of zoom.

In addition, we have identified 11 different on-line shopping and recommendation sites

for digital cameras and a comparable number for the other product categories for which we have

data. As depicted in Figure 3, on-line shopping sites and recommendation sites, such as

Epinions.com [1], provide browsing interfaces that decompose product attributes into categories

for customers to easily subdivide. As with professional reviews, we can compare our induced

product features and categories to those used in on-line interfaces.


Figure 3. Epinions.com Digital Camera Selection Wizard

Robustness of the induced ontology

While we can validate the accuracy of automatically induced ontologies with respect to one or

more external sources, our ultimate objective is to support the automated extraction of features

from on-line reviews for the purpose of tailoring usage-based recommendations [18]. As a

consequence, whether our automated ontologies align with external sources or not, the true test is

our ability to facilitate usage-based recommendations.

To that end, from our initial set of customer reviews, we have created a hand labeled set

of 594 digital camera customer reviews from Epinions.com. The data set was developed by two

independent student coders and includes inter-coder overlap on 203 of the reviews. The data set

includes 21 different digital camera products ranging in price from $45 to $1000 and covers


resolutions of 1 MP through 6+ MP. The coders manually extracted and categorized both

product features and customer uses. Categories were assigned based upon product features and

levels (subclasses) as defined by the expert reviews used to externally validate ontology

accuracy.

The manual data set not only constitutes a reference for assessing the accuracy of

ontology-based extraction but also provides us with a means for incrementally evaluating our

usage-based recommendations [18]. Because use-extraction involves an approach independent

of ontology-based feature extraction, we can combine manually extracted usage data with

automatically extracted feature data to set a lower-bound on the salience of our resulting

recommendations.

Related work

There is a large body of work on the topic of ontology induction that appears in different

disciplines under different names. Linguists learn specific taxonomic relationships such as

hyponyms and meronyms. Others use the term 'concept hierarchy.' 'Clustering categorical data'

defines a third body of similar research. Although the terminology may differ, the underlying

objective is the same: to group terms into concepts and to identify different types of

relationships between concepts. We can loosely separate the literature based upon whether the

source data being categorized is relational or not. The different approaches will also vary in the

degree to which they are supervised or unsupervised.


Non-relational data.

Most of the approaches to ontology induction that begin with non-relational data identify two

distinct processes: extract data and then learn relationships. Supervision may be introduced in

either of the two processes.

One approach is to begin with a single seed ontology and then to grow that ontology by

using supervised learning techniques to extract data from relevant Web pages. Missikoff and

Navigli [24] begin with a generic reference ontology and then extract from domain-specific web

pages to tailor the resulting ontology. By contrast, Modica, Gal and Jamil [25]seed their

ontology with a single, representative Web source. They use the source HTML to define concept

names and the underlying Document Object Model (DOM) to define the initial relationship

between concepts. Incremental refinement is managed using techniques borrowed from the

database schema matching research literature. In particular, they use unsupervised techniques

such as term similarity, concept (instance) matching, and external sources such as thesauruses.

Rather than explicitly separating extraction from concept learning, it is possible to

combine the two steps to define extraction patterns that are explicitly indicative of a particular

conceptual relation. Hearst did early work applying this approach to learning hyponym

relationships [16]. Maedche and Staab [20, 22] use lexical analysis and syntactic patterns to

extract sets of candidate hyponym concept pairs. They then apply frequent item-set analysis [3].

The confidence and support measures from association rule mining are used to estimate the

validity of a particular concept pair. Instead of association rule mining, KnowItAll [9, 28] draws

upon external sources such as WordNet and labeled training data to assess the validity of

relationships. Specifically, the frequency of a term pair is used to calculate the probability that

the pair is an instance of a particular sub-class relationship.


While Maedche and Staab use generic extraction rules that are indicative of class

relationships in many different domains (e.g. X such as Y), Finkelstein-Landau and Morin [10]

explicitly draw on supervised techniques to learn domain-specific patterns of hyponyms. They

identify a training set of specific sentences containing relationship instances from which to build

their extraction patterns. Instead of an automated concept assessment technique, Finkelstein-

Landau and Morin rely upon iteration with experts to evaluate the resulting concept pairs.

Finkelstein-Landau and Morin contrast their supervised approach with an unsupervised

technique that returns to the earlier separation of extraction from relationship learning.

Extraction is performed using shallow natural language processing (NLP) techniques such as

tokenizers, Parts-of-Speech (POS) labeling, and chunkers. Relationships are defined based upon

statistical measures of co-occurrence both within a single document and between different

documents.

Like Finkelstein-Landau and Morin's unsupervised approach, we separate "information

extraction" from relationship learning. While we do rely upon lexical analysis to clean the data,

by focusing exclusively on customer review data and using a very narrow representation, we

greatly simplify the extraction step. In this way, we do not need a seed ontology like Missikoff

and Navigli and we do not explicitly need to leverage the source HTML or underlying DOM as

Modica, Gal, and Jamil do. At the same time, we believe that the heterogeneity of products

reviewed in on-line forums present a reasonable test for the robustness of our approach.

More significantly, many of the lexico-syntactic rules used for extracting hyponym

concept pairs are less applicable in the domain of customer reviews where grammatical

conventions might break down and text, as in the case of lists of Pros or Cons, are simply lists of

phrases with no associated syntactic or semantic context. A customer might simply write that


"NiMH rechargeables are great" without reference to the concept's hypernym. As a

consequence, techniques to learn syntactic patterns that are representative of hyponym

relationships like Hearst or Maedche and Staab might prove less applicable.

By contrast, phrases in customer reviews often encode 'part-of' meronym relationships.

We leverage co-occurrences within a single review as a heuristic for discarding certain tokens

and discuss the use of co-occurrences between reviews as a weighting factor to relax some of our

stronger assumptions about cliques.

Relational data

Processing relational data greatly facilitates the use of unsupervised techniques by leveraging

knowledge of the structure to simplify the task.

Han and Fu [14] and Lu [19] develop routines for structuring both numerical and nominal

data. For numerical data, they use binning strategies to induce categories within a single

attribute domain. For example, how to categorize different levels of digital camera resolution

ranging from less than 1 megapixel to more than 8 megapixels. In binning, one categorizing

criteria might be to evenly distribute the domain into a pre-defined number of categories. For

nominal data, they do not categorize the values within a particular domain. Instead, they attempt

to learn the hyponym relationship between attribute domains. Based upon the cardinality and the

frequency histogram of each attribute domain, they define a partial order on the attribute

domains that define the hierarchical relationship between domains.

Suryanto and Compton [32] begin with a database in the form of a rule-based classifier.

Their objective is to define a taxonomy on the classes that are identified by the classifier. They

distinguish between subsumption (the hyponym relationship), mutual exclusivity (disjoint


subclasses), and similarity. The intuition is that concepts with similar classification rules are

similar and the derivation to subclasses follows from transitivity of the rules.

Ganti, Gehrke, and Ramakrishnan [11] as well as Gibson, Kleinberg, and Raghavan [12]

assume a relational table and attempt to find subcategories within a particular attribute domain

Di. This differs from the work by Han and Fu and Lu, which define orderings between attribute

domains. Ganti et al. define the 'inter-attribute' summary as a weighted graph where all values of

all domains in the relation are vertices in the graph. The weight of an edge between ai in Di and

aj in Dj where i ≠ j, is the number of tuples in which ai and aj appear. In addition, they assume

a'priori knowledge of the relation and use that knowledge to calculate the 'intra-attribute'

summary between two attribute values in the same domain. Finally, their relations have no nulls

and the mapping of every value to its corresponding attribute domain is known. Gibson et al. use

the same relational assumptions but instantiate multiple instances of a vertex-weighted graph

which they call basins. They iterate over the set of basins until achieving a fixed-point where

vertex weights are separated into distinct positive and negative subsets defining the domain

subclasses.

Contrasting the work that assumes a priori knowledge of the relations, our approach

bears similarities to schema induction. Using the metaphor of learning relational schemas, each

concept is a relation; the corresponding phrases, extracted from the text of customer reviews or

product descriptions, are tuples of the relation. Finding the largest maximal clique defines the

relational schema. However, our tuples have many nulls. Thus, though we begin with a graph

representation that is similar to Ganti's, our graph is incomplete. Furthermore, we exploit the

graph in a different way. First, we use the principle of connectedness to cluster the tuples of a


relation into hyponyms rather than categorizing the values of a single attribute domain. Second,

we cannot leverage intra-attribute knowledge because the schema is unknown.

To the degree that we can induce a relational structure, Gibson et al. and Ganti et al. offer

a complementary strategy to our connectedness approach for learning hyponym relations within

a domain. Where Han and Fu and Lu define a partial order over the attribute domains in the

schema, we treat the attribute domains as meronyms. In general, neither case is always true.

Discussion and future work

Earlier, we outlined a number of quantitative evaluation measures that are the subject of ongoing

work. In this section, we reflect on the preliminary results as represented by the features and

levels induced on a data set of more than 8000 digital camera product reviews from

Epinions.com covering more than 500 digital camera products, each with a product description

page also from Epinions.com. Even without more quantitative evaluation metrics, a number of

interesting conceptual and pragmatic limitations are revealed by the exploratory work on digital

camera reviews and descriptions. We conclude by presenting some natural extensions to the

current work beyond use-centric mining.

Conceptual issues

Conceptually, we can separate our observations into comments on hyponyms and comments on

meronym/holonyms. For hyponyms, we rely upon a depth-constrained BFS of the token graph.

As a consequence, certain phrases can appear in the subgraphs of more than one hyponym

concept. This overlap exists because we resort to a bound on the BFS rather than relying upon

connected components to distinguish hyponyms. As seen in Table 2, certain phrases, such as

'indoor focuss troubl' are most likely not relevant to the concept redeye reduction. Two solutions


are to either refine the strategy for parsing phrases and pruning the tree (see below) or to

integrate edge weights into the construction of hyponym subgraphs.

Other issues emerge regarding our approach to inducing meronyms and holonyms. First,

we make the very strong assumption that a heuristic search for a maximal clique over the graph

of phrases that instantiate a concept will capture the complete set of meronyms and holonyms for

that concept. Although the graph of a concept is comparatively small, there is no guarantee that

our heuristic process does not miss a larger maximal clique that more completely represents the

appropriate concept attributes.

In actuality, our search for a maximal clique makes the implicit assumption that the set of

all meronyms and holonyms is captured by a maximum clique. In practice, this assumption may

not prove reasonable. As demonstrated in Table 4 with the phrase '9v,' there is not necessarily

any reason to expect that a particular feature attribute, in even a single instance, will necessarily

appear across all pairwise attributes. Instead, we suggest using edge weights, in concert with co-

occurrence, to search for dense subgraphs that may prove more suitable.

The limitations of our assumption about maximal cliques are exacerbated by the

incompleteness of phrases in reviews and product descriptions. Unlike the relational tuples in

related work by Ganti et al. and by Gibson et al. that do not contain nulls, phrases are quite

sparse. For a sufficiently complex concept, it is unusual for a single phrase to comprise tokens

that instantiate every meronym or holonym of the complex concept.

Not only is discovering meronym/holonyms difficult. For sparse phrases, assigning

tokens to the appropriate meronym or holonym is a separate challenge. First, the maximal clique

might miss the relevant concept. As seen in Table 4, the maximal clique does not include the

voltage concept. Moreover, there is sometimes little contextual information to guide token


assignment. For example, absent additional information, the token 'kcrv3' could refer to the

concept of chemistry or size or even rechargeable. Finally, the data set may simply include

erroneous data. Our list of phrases for battery type includes the phrase 'neck strap.'

We are currently experimenting with two approaches to the challenges related to inducing

meronym/holonyms. First, we are attempting to identify additional sources of relation

information to assign the tokens in a phrase to the appropriate meronym or holonym. Inter-

document edge frequency is one source of additional information. A second source of

information involves foreign key relationships between one concept and other concepts. For

example, a review has an associated product description. That description contains additional

data and may in turn link to product descriptions (and reviews) for similar items (e.g. made by

the same manufacturer). A second approach to inducing meronym and holonym relationships

involves traditional clustering. If we can identify appropriate similarity measures, it is possible

to cluster tokens to maximize within-attribute-domain similarity while minimizing between-

attribute-domain similarity.

Pragmatic issues

In addition to conceptual issues, our early efforts reveal a number of pragmatic considerations.

First, our technique is strongly dependent upon good parsing. The robustness of our approach in

the face of parsing is an open question. For example, we parse the line '2 aa lithium ion, nimh,

alkaline batteries' into three phrases: '2 aa lithium ion,' 'nimh', and 'alkaline batteries.' The

grammatically correct decomposition would distribute the prefix '2 aa' and the suffix 'batteries'

across all three chemistry types (lithium ion, nimh, and alkaline). Moreover, the parse reveals

the limitations of our graph approach to hyphenated terms. 'lithium-ion' is a single vertex in the

graph whereas 'lithium' and 'ion' are distinct vertices. As suggested earlier, additional text


processing to address spelling mistakes such as 'focuss' in Figure 4 and again in Table 2 will also

help. Alternatively, one might also suggest that our technique is complementary to other

approaches that develop more sophisticated parsing techniques.

Second, the technique may prove sensitive to data sample size. We rely upon a large data

set to yield the phrases from which we identify a maximal clique. The large data set is also a

boon because we discard phrases quite liberally to minimize the effects of naïve parsing (see

above). To measure the sensitivity to sample size, we propose to perform cross-validation on

smaller samples of reviews and product descriptions. We can plot the trade-off between sample

size and ontology performance measures to identify diminishing returns and attempt to estimate

a minimal number of required reviews. Care needs to be taken in ensuring sufficient

heterogeneity in the sample selection with respect to different product features and the

corresponding feature attributes.

Finally, the generalizability of our approach may appear limited at first glance. However,

our approach, with its attendant limitations, is quite general with respect to domain knowledge.

The process is designed to apply across heterogeneous product domains, and we are currently

testing the approach on other product categories. Moreover, while the strong assumptions that

we make about co-occurrence may appear limiting, there are other domains where similar

phrase-like text strings prevail as opposed to prose. Progress notes in medical records are one

such example. Finally, the technique is most susceptible to parsing (see above). To the degree

that complementary approaches can provide phrases as input, a graph-based approach may still

prove tractable.

Future work


In addition to a more rigorous, quantitative analysis of our current approach as well as steps to

address some of the issues that emerged from early trials, there are a number of ways in which

we might enrich the relationships that we are learning.

For example, we currently learn hyponym relationships between classes of features.

However, with respect to a single instance, some subclasses are mutually exclusive while others

are not. For example, the feature camera type might have the following subclasses: slr,

standard, and compact. Digital cameras also have the feature lens of which fixed focus and zoom

are two such subclasses. A single instance of a digital camera would only be categorized in a

single type subclass but could support multiple interface types. From a marketing and

recommendation perspective, it would be particularly useful to extend our induced ontology to

distinguish between mutually exclusive subclasses and those that are not.

The two features of camera type and lens also exhibit a second dimension of the

hyponym relationship. Some subclasses are actually ordinal in nature. slr, standard, and

compact, are decreasing in size. fixed focus and zoom are not similarly ordered. Learning

ordinal subcategories is particularly useful because of the closely related task of aligning

different orderings. From a recommendation standpoint, aligning orderings is critical because

different customers may address a concept using parallel ordinal categories. For example, one

customer might seek a small, medium, or large digital camera whereas another might seek a

compact, standard, or professional. The problem is most commonly manifested in digital

cameras where resolution is mapped to digital photo enlargement. Some customers seek a 3.0

MP camera whereas others seek a resolution appropriate for printing high quality 8 x 10

enlargements.


From aligning parallel orderings we might extend to aligning or merging entire

ontologies. There are many different sources of on-line customer reviews, and one of our end

objectives is to support use-centric mining across reviews gathered from heterogeneous sources.

Doing so, however may also require learning ontologies unique to distinct sites and then aligning

those ontologies [Modica, 2001 #14]. Ontology alignment can help ensure that the review

vectors derived from distinct sources are remain comparable.

While there are many buying guides that provide recommendations for specific products

or services, most guides tend to rely upon domain-specific experts. Unfortunately, reliance upon

experts is not scalable. Automated support for product-specific recommendations is necessitated

by the heterogeneity of product categories and the increasing complexity within a category. A

critical step in providing automated support lies in simply understanding the language used to

describe a particular product category. In this paper, we discussed a generic, graph-based

approach to learning the ontology for specific product categories for the purpose of creating use-

centric product recommendations.

Acknowledgements: The author would like to thank Alan Abrahams, Sean McGrew, Ellen Ngai, Sojeong Song, and the OPIM-IDT group for their invaluable help. This work was supported in part by the University of Pennsylvania Research Foundation.

References [1] "Epinions.com Customer Reviews," accessed: 5 July 2004, 2004, http://

www.epinions.com/ Digital_Cameras. [2] P. Black, "Dictionary of Algorithms and Data Structures," 3 Jan 2005, accessed:

February, 2005, www.nist.gov/dads. [3] C. Borgelt and R. Kruse, "Induction of Association Rules: Apriori Implementation,"

presented at 15th Conf on Computational Statistics (Compstat), Berlin, Germany, 2002. [4] D. Brelaz, "New Methods to Color the Vertices of a Graph," Communications of the

ACM, vol. 22, pp. 251-256, 1979. [5] K.-W. Cheung, J. T. Kwok, M. H. Law, and K.-C. Tsui, "Mining customer product

ratings for personalized marketing," Decision Support Systems, vol. 35, pp. 231-243, 2003.

[6] Consumer_Electronics_HQ, "Buying Your First Digital Camera: The Basics," accessed: 19 November, 2004, http://www.digitalcamera-hq.com/hqguides/firsttime-buyer.html.


[7] Consumer_Union, "Digital Cameras," Consumer Reports Buying Guide 2005, pp.33-36; 237-240.

[8] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, Y.-K. Ng, D. Quass, and R. D. Smith, "Conceptual-model-based data extraction," Data & Knowledge Engineering, vol. 33, pp. 227-251, 1999.

[9] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, "Web-scale information extraction in KnowItAll (Preliminary Results)," presented at WWW 2004, New York, New York, 2004.

[10] M. Finkelstein-Landau and E. Morin, "Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods," presented at International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, 1999.

[11] V. Ganti, J. Gehrke, and R. Ramakrishnan, "CACTUS - Clustering categorial data using summaries," presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999.

[12] D. Gibson, J. Kleinberg, and P. Raghavan, "Clustering categorical data: an approach based on dynamical systems," presented at 24th International Conference on Very Large Databases (VLDB), 1998.

[13] T. Gruber, "Toward principles for the design of ontologies used for knowledge sharing, Stanford KSL-93-04," presented at International Workshop on Formal Ontology, 1993.

[14] J. Han and Y. Fu, "Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases," presented at AAAI 94 Workshop on Knowledge Discovery in Databases (KDD94), 1994.

[15] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers, 2001.

[16] M. Hearst, "Automatic acquisition of hyponyms from large text corpora," presented at Fourteenth International Conference on Computation Linguistics (COLING), Nantes, France, 1992.

[17] M. Hu and B. Liu, "Mining and Summarizing Customer Reviews," presented at KDD04, Seattle, WA, 2004.

[18] T. Lee, "Use-centric mining of customer reviews," presented at Workshop on Information Technology and Systems (WITS), Washington, D.C., 2004.

[19] Y. Lu, Concept hierarchy in data mining: specification, generation and implementation, Thesis for the degree of Master of Science, Submitted to School of Computing Science, Simon Fraser University, Vancouver, BC, p. 106.

[20] A. Maedche and S. Staab, "Semi-automatic Engineering of Ontologies from Text," presented at Twelfth International Conference on Software Engineering and Knowledge Engineering (SEKE'2000), Chicago, 2000.

[21] A. Maedche and S. Staab, "Ontology learning for the Semantic Web," IEEE Intelligent Systems, pp. 72-79, 2001.

[22] A. Maedche, G. Neumann, and S. Staab, "Bootstrapping an Ontology-based Information Extraction System," in Intelligent Exploration of the Web, P. Szczepaniak, J. Segovia, J. Kacprzyk, and L. Zadeh, Eds. Heidelberg: Springer / Physica Verlag, 2002.

[23] S. Manes, "Digital camera guide," 11 June 2004, accessed: 19 November, 2004, http://msnbc.msn.com/id/4758940.


[24] M. Missikoff and R. Navigli, "Integrated approach to Web ontology learning and engineering," IEEE Computer, pp. 54-57, 2002.

[25] G. Modica, A. Gal, and H. Jamil, "The Use of Machine-Generated Ontologies in Dynamic Information Seeking," presented at CoopIS 2001, Trento, Italy, 2001.

[26] T. Nasukawa and J. Yi, "Sentiment Analysis: Capturing Favorability Using Natural Language Processing," presented at K-CAP`03, Sanibel Island, Florida, 2003.

[27] B. Omelayenko, "Learning of ontologies for the Web: the analysis of existent approaches," presented at ICDT 01 International Workshop on Web Dynamics, London, UK, 2001.

[28] A.-M. Popescu, A. Yates, and O. Etzioni, "Class extraction from the World Wide Web," presented at AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM), 2004.

[29] G. Salton and M. McGill, Introduction to modern information retrieval. New York: McGraw-Hill, 1983.

[30] E. Schwartz, "Edmunds.com deploys text mining tool for user forums," InfoWorld, 6 August 2004.

[31] S. Skiena, The Algorithm Design Manual. New York: TELOS, Springer-Verlag, 1998. [32] H. Suryanto and P. Compton, "Learning classification taxonomies from a classification

knowledge based system," presented at ECAI 2000 Workshop on Ontology Learning, Berlin, 2000.


ontology induction for mining experiential …gkmc.utah.edu/winter2005/papers/lee.pdfontology...

Documents