concept net

Investigating ConceptNet

Dustin Smith {[email protected]}; Advisor: Stan Thomas, Ph.D.December 2004

Wake Forest Department of Computer Science

Abstract

There are many interdependent levels of organiza-tion behind natural text; and, those which addressthe meaning are the most difficult for computers towork with. However, programs which deal with nat-ural language will inevitably make mistakes if theyignore semantics. Human communication makes useof thousands of common sense facts, beliefs, and as-sociations and the ConceptNet project attempts tomake this information available to software. In addi-tion to aiding Natural Language Processing (NLP),the natural language knowledge representation usedin ConceptNet make it ideal for various commonsense reasoning applications. This paper explores theorganization of ConceptNet and the CN2 project.Finally, it discusses future directions for Concept-Net and possible improvements.

1 Introduction

Programs which deal with natural language (e.g.,spelling and grammar validators, translators and in-formation indexing agents) will inevitably make mis-takes, no matter how sophisticated, unless they ad-dress the meaning behind the text. ConceptNet isa knowledge base that contains everyday knowledge,which is relevant to the semantic and pragmatic levelsof human language. ConceptNet has both immedi-ate applications (e.g., Interface Agents) as well as thepotential to aid Natural Language Processing (NLP).Ideally, conceptual databases will be used to enrichprograms which involve natural language so that theycan deal effectively with semantics.

ConceptNet is a freely available, large commonsense semantic network available to the ArtificialIntelligence (AI) community. To date it contains1.6 million common sense statements that were de-rived from the contributions of more than 14,000authors[7]. These concepts encompass thousands ofpieces of knowledge that most adults already know.

The ConceptNet knowledge base stemmed from

the related OMCS project which aimed to give com-puters access to knowledge like: “Touching fire willburn you”, “Hitting somebody will make them un-happy” and “A clock tells time.” The relationshipsConceptNet uses can be combined and processedby more sophisticated reasoning models. The datafrom ConceptNet was automatically derived fromthe OMCS-2 corpus and the current version is 2.1.

This paper is the result of an exploration into Con-ceptNet (Sections 1 and 2), its applications (3) andweaknesses, and some of the author’s attempt to im-prove the knowledge base (4,5).

1.1 History of ConceptNet

In 2000, researchers at MIT’s Media Lab be-gan the Open Mind Common Sense project(OMCS). OMCS is accessed through a website1

where the general Internet public can contribute com-mon sense knowledge by answering fill-in-the-blankquestionnaires[12]. For example, a web form wouldprompt the user with a statement such as: “A ham-mer is used to ”, instructing the person to fill inan appropriate response (e.g., hit, drive nails, pound).The responses are stored in a database, which is pe-riodically mined to generate predicate lists for thevarious ConceptNet, LifeNet, and StoryNetprojects.

The predicates which comprise ConceptNet’sknowledge base originated as patterns mapped fromOMCS that have been converted into semi-structurednatural language binary relations. Next, during thenormalization process, all of the predicates undergovarious filtering and lexical distillation. Verbs andnouns are reduced to their canonical base forms;determiners (e.g., a, the) and modals (e.g., may,could, will) are removed. Additionally, Part of Speech(POS) taggers are used to ensure predicates all cor-respond to a specific set of valid word orders.

Afterwards, during the relaxation phase, duplicateentries are merged. Through exploiting the underly-

1http://www.openmind.org/commonsense/

1

ing taxonomic structure, the IsA hierarchy is used to‘lift’ knowledge up from children nodes to their par-ents. If concepts X1, X2, X3 all have relationshipsin the format (IsA X∗ Y), which implies they are inthe same taxonomic subset, and (PropertyOf X∗ Z),indicating they all share the same property Z, thena new concept would be inferred and added to thepredicates taking the form (PropertyOf Y Z).

Each predicate also has two numeral metadata at-tributes, f and i, which have the default value of zero.Whenever the same relationship is derived from theOMCS corpus, the predicate’s f value is incremented.If during the relaxation phase a duplicate predicateis inferred, then i is incremented.

In addition to the concept predicates, Concept-Net contains a platform independent Python inter-face and developers’ API with natural language pars-ing software.

2 ConceptNet’s Applications

Ambiguity, redundancy, and noisy data underminethe feasibility of using ConceptNet for programswhich demand a high degree of accuracy. Despitethis, certain applications, such as non-intrusive In-terface Agents can still benefit from the commonsense knowledge that ConceptNet possesses. TheAPI that accompanies ConceptNet v2.1 permitscontextual realm-filtering, topic generation, analogy-making, projection, affect sensing, and conceptidentification/approximating[7]. As will be discussedin section 3.2, the defeasible knowledge within Con-ceptNet may be ideal for common sense reasoningamong natural language processing software in con-sequence of its ambiguity and redundancy.

2.1 Interface Agents

Recently a number of people, including those in-volved with the development of ConceptNet, haveused the knowledge base to build interactive applica-tions that deal intelligently with (often user-supplied)English text. These applications are dissimilar tothose of earlier common sense efforts, like Cyc. In-stead of the common sense component being the cen-ter of the program (e.g.,“question answering applica-tions” where the software seeks to correctly respondto human-level text), ConceptNet has been recom-mended for fail-safe agents which do not purport todeliver intelligent conclusions without exception. In-terface Agents are non-intrusive fail-safe agents whichrun within an interface and learn from a user’s activ-ity to provide assistance or help improve efficiency.

They run in the background, occasionally interject-ing the results of their common sense reasoning forthe user to consider. If their results are wrong or donot apply to the given case, the user can simply ignoretheir advice. They can accumulate various commonsense related inferences, making helpful suggestionsto the user, or ask the user to ‘fill-in’ missing infor-mation that may help produce better results[5].

Although the development of Interface Agents isless ambitious than all-encompassing projects likeCyC, there are still many uses for common sense rea-soning and applications which can be immediately re-alized. Additionally, the sort of common sense knowl-edge which ConceptNet aims to provide is crucialfor developing more intelligent NLP software.

2.2 Natural Language Processing

Human communication makes use of thousands ofcommon sense facts, beliefs and causality chains inspeech and written text. Because this informationis obvious to most people, it is rarely stated explic-itly. However, the meaning behind the communicatedmessage is heavily dependent on underlying assump-tions – which makes it very difficult for an program tomake more than a “shallow” assessment of the text.ConceptNet contains thousands of common senseassertions, and thus it should be considered for NLPapplications.

Here is an example statement which illustrates theneed for common sense knowledge in NLP programs:

The teacher said, “That was an A paper.”

The teacher’s remark would likely be rejected ifparsed by a grammar correcting software. The soft-ware is likely to conclude that the string “an A paper”is invalid on the syntactical level (seeing two articles).A typical human reader would have no trouble withit since they already have the knowledge: a paper isa type of assignment2, teachers grade their students’assignments, and grades can come in the form of let-ters.

Semantic resolution is challenging because it re-quires both a large amount of background knowledgeand human-like ways to reason with this knowledge.3

There are many different levels of organization be-hind both speech and text; but, those levels which en-compass the meaning behind words are the most dif-ficult to deal with from a computational standpoint.Listed are three such levels: [4]

2The predicates in ConceptNet are acknowledged to bedefeasible, in that they are not always true–they can be madevoid; often they are only veritable within certain contexts.

3Or, at least, a program will need enough good tricks toaccomplish its objective well.

2

• Semantics - the meaning behind words andgroups of words.

• Pragmatics - the use of language to accomplishtasks (i.e, to give a command, share an idea,draw attention, etc).

• Discourse - making sense of linguistic unitslarger than a single utterance.

The inability to deal with these three levels oflanguage is what holds NLP back from being ableto “understand what words mean.” Even seeminglycomputationally-savvy applications such as grammarcorrection software (since syntax is a formal set ofrules) will not always successfully parse a languagewhere colloquialisms and homonyms abound.

3 Organization of ConceptNet

ConceptNet expresses predicates in the form:

(Relationship Concept Concept)

The concepts themselves are often called nodes, ofwhich there are 300,000. All relationships are binary(having two arguments) and there are twenty rela-tionship types.

3.1 Relationships

The predicates in ConceptNet use 20 relationshiptypes which fall into 8 categories. Each category con-tains one or more types of relationships. Listed arethe categories and their corresponding relationshiptypes:

• K-Lines: ConceptuallyRelatedTo, ThematicKLine, Su-perThematicKLine

• Things: IsA, PartOf, PropertyOf, DefinedAs, MadeOf

• Spatial: LocationOf

• Events: SubeventOf, PrerequisiteEventOf, First-SubeventOf, LastSubeventOf

• Causal: EffectOf, DesirousEffectOf

• Affective: MotivationOf, DesireOf

• Functional: CapableOfReceivingAction, UsedFor

• Agents: CapableOf

These various types of relationships cover a vastspectrum of problem categories which are commonlyavailable to humans. The information can be usedto answer questions about where and what an objectis, what it is used for, and its possible motivationsand goals for a given action. Sequential informationis also available, so that for a given event, events thatcommonly precede or follow are related.

One problem with this organization is that allnodes are treated equally. This can lead to poorlydirected reasoning attempts, like: “What was thechair’s motivation for breaking?” The lack of theinformation’s consistency makes complex reasoningchains unlikely to succeed.

Also, some nodes have more conceptual relation-ships than others, thus they are more useful to rea-soning methods. Helpful information like this shouldbe available at a metadata level, so the more “dense”nodes (which have more relations) are tried beforethe sparsely connected nodes which are more likelyto lead to dead-ends. Unfortunately, the availablemetadata in ConceptNet is scarcely useful becauseits values are not well distributed among concepts.

In my CN2 project, I worked towards this objectiveby implementing a connectivity index. The methodused and its results are explained in 4.2.1.

3.2 Representing Knowledge withNatural Language

ConceptNet uses semi-structured natural languageto represent information. Its authors chose this for-mat because of the straightforwardness of natural lan-guage, which they also thought to be ideal for com-mon sense reasoning. Firstly, an explanation of why“how knowledge is represented” is such an importantissue:

Knowledge representation (KR) is a key-issue inAI. Essentially, all KR systems are methods for rep-resenting surrogates for entities in the real world (ora virtual domain). These models are never entirelyaccurate; they always contain some discrepancies oromissions, because perfect knowledge representationis impossible[2].4 Most importantly, the way theknowledge is represented entails the ways the knowl-edge can be manipulated. In other words, KR istightly-bound to the reasoning methods that are de-ployed on it.

Its authors wanted ConceptNet to cover commonsense knowledge, similar to the CyC project, whileincorporating the ease-of-use of WordNet5, whichuses a KR based in natural language. Using semi-structured natural language in their representationsmarked a divergence from the mainstream opinionsconcerning common sense reasoning in AI. Liu andSingh [6] make a good argument against using solely

4A surrogate representation can never have perfect fidelitywith the corresponding real world object it represents. Usuallyonly specific aspects of the object are represented (thus someaspects are omitted); moreover, even a full-blown replica of theexternal object would still differ from the original, at least inlocation.

5http://www.cogsci.princeton.edu/∼wn/

3

logic for common sense representations: For com-mon sense reasoning, which ConceptNet is builtfor, recursive definitions are commonplace, multipleanswers can exist simultaneously, and contradictionsare permitted. However, representations which uselogic require strict consistency among predicates intheir knowledge bases. The existence of two contra-dictory predicates would jeopardize the whole system.John McCarthy, a proponent of using logic in com-mon sense reasoning, used the following to illustratea representation which is problematic to logical con-sistency. Although most people will agree that thestatements Birds can fly and Penguins are birds areboth true, there is a logical inconsistency with real-ity that cannot be avoided while maintaining thesetwo beliefs[8]. Penguins and ostriches are anoma-lies among feathered vertebrates with wings; they arebirds that cannot fly.6

Artificial Intelligence researcher Marvin Minsky ar-gues that redundancy and inconsistency are prop-erties of human-like intelligence. In other words,the approach which ConceptNet takes is more ap-propriate for its objective than the other commonsense initiatives that use logical deduction exclu-sively. Minsky believes that the overlapping, inter-connecting networks of knowledge are what allow hu-man common sense reasoning to be effective in vari-ous situations[10]:

“If you ‘understand’ something in only oneway then you scarcely understand it at all.For then, if anything should go wrong, you’llhave no other place to go. But if you repre-sent something in multiple ways, then whenone of them fails you can switch to another–until you find one that works for you.

It’s the same when you solve a new kindof problem: along with refining the methodyou used, you also should try to find otherways to do it. Then whenever you get intotrouble, you’ll be able to switch to a dif-ferent technique. If you only know a singletechnique, then you’ll get stuck when thatmethod fails. But if you have multiple waysto proceed, then you can deal with morekinds of predicaments.”

The normalized natural language fragmentsmake ConceptNet a good candidate for NLPapplications– it is easier to map natural language to

6To be fair, McCarthy came up a solution for these atypicalcases – he defined a separate category of predicates that withthe prefix ‘ab–’, indicating abnormality. In turn, this approachhas an added complexity that alternatives do not.

itself than a closed domain of symbols. However,this connectionist approach also raises some imple-mentation issues. Singh and Liu point out that theambiguity in natural language will result in redun-dant relationships[6]. Additionally, there are manycontext-sensitive relationships that are simply falsewhen in the wrong word sense. These two factorslimit the immediate types of applications for Con-ceptNet; however, there are some ways to addressboth of these issues and some ideas are brought upin section 5 of this paper.

3.3 Overview: The Substance

The architecture of ConceptNet is well designed;however, the data within is often problematic. Whenexamning the actual data, very few nodes were in-herited based on the figures from the metadata. Ac-counting for all predicates in the non-concise files, allnodes had fewer than 4 inferences in each value forf and i, and very few nodes had more than one (fig.1).

Figure 1: Distribution of f and i values.

As a result, this metadata is not helpful for distin-guishing between two “otherwise identical” concepts.

The K-Line7 category contains almost three timesas many assertions as the other relations combined,and so they are located in a separate file in the Con-ceptNet kit. On page 5 is a breakdown of the full(non-concise) distribution of predicates in the variouscategories (fig. 2).

7K-Lines contextual identifiers that relate a given conceptto a theme. They are the most general of the concepts types.

4

Category AssertionsK-Lines 1,035,035

ConceptuallyRelatedTo 816,737SuperThematicKLine 160,181

ThematicKLine 58,117Functional 103,556CapableOfReceivingAction 57,600

UsedFor 45,956Agents 89,313

CapableOf 89,313Things 46,828

IsA 16,720PartOf 12,934

PropertyOf 9,135DefinedAs 6,520

MadeOf 1,519Events 35,317

SubEventOf 22,764FirstSubEventOf 4,453

PrerequisiteEventOf 4,092LastSubEventOf 4,008

Affective 31,196MotivationOf 24,483

DesireOf 6,713Spatial 28,805

LocationOf 28,805Causal 15,303

EffectOf 9,057DesirousEffectOf 6,246

Figure 2: Breakdown of concepts over categories.

4 CN2 Explorer

As part of an undergraduate research project, I in-vestigated methods to reorganize the data in Con-ceptNet so that reasoning methods could be moreeasily and consistently applied. Most of my work cen-tered around the IsA nodes; the objective was build aconsistent taxonomic hierarchy that categorized allof the concepts. This idea was similar to the in-ference scheme deployed in the relaxation phase ofConceptNet’s generation.

The idea is that a given node should be able toinherit relationships from its parent nodes (at anylevel). For example, with these concepts:

(IsA Governor Politician)

(IsA Politician Person)

(CapableOf Person Speaking)

A reasoning system should be able to infer thata governor, being a person, is capable of all the

things that people are (in this case: speaking). Thisstructure would allow the reasoning methods to treatnodes differently (e.g., recognize the distinction be-tween objects and agents) and also permit a moreeffective use of the existing relationships, without ne-cessitating repeating information for each individualconcept. By creating a well-formed ontological treeusing IsA relationships, all of the other relations inConceptNet would benefit. There were two majorissues that needed to be resolved first:

Removing Cyclic Relationships There were sev-eral nodes that had cyclic relationships like: (IsASomething Object) and (IsA Object Something).Existence of nodes like this would cause non-termination problems for reasoning agents if theytried to traverse this taxonomic tree. The “ob-ject/something” type problems could be elimi-nated with one SQL statement; however, cyclicrelationships that had more than one degree ofseparation were difficult to detect.

Dealing With Bad Data Unfortunately the masscollaboration from the public approach effectsthe quality of data. ConceptNet containsenough misspelled words, false concepts, andoverly-specific data that it is quarrelsome to or-ganize on a large-scale. These problems result ina lot of repeated information (i.e., duplicate in-formation with different spelling) that make anyeffort to bring consistency difficult.

In the following sections I explore the steps takento develop CN2, my findings, and some concludingremarks concerning future directions of this and sim-ilar projects.

4.1 From Python to SQL

As a client-side Python application, the potential of afull semantic network could be realized; however, thevolume of the predicate files impaired performancewhen rendering graph structures. The objectives ofCN2 necessitated modifying and querying large bun-dles of concepts at once, which is why SQL was mostappropriate.

In order to port the predicates to a file whichcould be rendered by a DBMS, conversion to comma-delimited file was necessary. I wrote a program toconvert predicates of the form:

(LocationOf ”plane” ”at airport” ”f=8;i=0;”)

(LocationOf ”army” ”in war” ”f=3;i=0;”)

5

into CSV format.8 The data was then transferredinto a Microsoft SQL Server 2000 database.

4.2 Adding Metadata

When the data is listed in a database instead of agraph, access to information behind each concept islimited. For example, it is impossible to directly com-pute in a declarative language (like SQL) exactly howmany hierarchial levels were below or above a givenIsA node. Fortunately, there were two alternatives:

1. The specific DBMS used in the project permitteda more powerful language called TransactionalSQL (T-SQL) which included iteration and con-ditional looping.

2. Building automated scripts (which continuallyself-loaded until they reached completion) inColdFusion allowed automation of tasks and fullaccess to a procedural language.

For IsA concepts, it was helpful to have informationconcerning the concept’s distinctiveness at hand. Iadded a metadata field which was the total of howmany other nodes shared that same parent node. Thenodes with the highest value for this were person (96),place (95), instrument (86), animal (73), and tool (64).In other words, there were 96 different assertions ofthe form:

(IsA X Person)

Additionally, for each IsA concept, boolean valuesfor metadata Up and Down were added. Given a node,these specified whether or not there were any parentor children nodes, respectively.

4.2.1 “Connectivity” Index

Another metadata attributes which was added to allnodes, not just IsA relationships, was a weighted in-dex value. The index value is a relative index whichspecifies the number of connections the first concepthas in a given assertion compared to a base, “aver-age”, value.

If each node had one relationship for each relation-ship type9, they would have a total of 17. Using thisas the average value, any node that had 17 relation-ships would have a connectivity index value of 100(the base value). Each index value is calculated as(100 ∗ Connections)/17.

8Available for download and execution on Windows:http://www.ogghelp.com/dsmith/conceptnet/predToCSV.zip

9The K-Line category of relationships were ignored alto-gether.

Interestingly there was a rather large number ofindependent concepts that did not have any connec-tions at all (which proved to be one of the mostdifficult problems in reasoning with knowledge fromConceptNet). Of the 191,334 distinct nodes in themodified data set:10 2,520 concepts had more than17 connections (above base index), 206 were equal,and 188,708 had below 17 (most of which had none).In other words, much of the data is not well con-nected to the rest. The real average of relations perconcept was 1.175. Here are the “top 5” most con-nected concepts, ordered in descending order of theirconnectivity index value:

# Node Connections Index Value1 person 17,838 104,9292 human 1,369 8,0523 child 1,186 6,9764 man 1,086 6,3885 dog 971 5,711

Slightly over 10% of all relationships involved thesetop 5 concepts.

What can be gathered from this information (be-sides evidence that dog is “man’s best friend”)? Thedistribution of concepts was rather heavily biased to-wards concepts of specific types: those which involvepeople.

5 Concluding Remarks

5.1 Re-organization

The future direction of ConceptNet will depend onhow effective its data can be regulated. There is greatpotential based on its architecture–many of the typesof relationships it represents are indeed of practicalvalue; however, much of the helpful common senseknowledge is often spoiled by the minority of noisydata. Fortunately there are many possible ways tohelp eliminate bad data and improve the quality ofexisting and future data.

One approach attacks the problem from where theinformation originates: those involved with the masscollaboration. Perhaps methods to increase the mo-tivation of the knowledge base’s contributors, includ-ing ways to assure them that their contributions havebeen purposeful thus far, will stimulate their desireto contribute quality information.

Other possible approaches deal with the actualinfrastructure. For example, a semi-structured ap-proach, such as using the IsA relations to form a hi-erarchial taxonomy from which all other nodes are

10This is after applying many cleaning methods to eliminateredundant data.

6

interconnected, will permit the development of moresophisticated common sense reasoning methods to beexecuted upon the data. Merging ConceptNet andWordNet may expedite this effort, as the latterproject already offers a well organized IsA ontology.(Something similar was already accomplished with anearly edition of OMCS data11)

Another option is to approach the problem of reor-ganization in the same spirit as the rest of the OMCScollective, through mass collaboration. Chklovski de-veloped a method of data validation in his Learnerknowledge acquisition project. From a web-based in-terface, statements would be inferred (via cumulativeanalogy) from the existing data and human contrib-utors would verify the generated assertions[1]. I havealso taken this approach with CN2’s online parser.12

From user input, it can infer motivations by travers-ing up to two levels of IsA connections. The user maythen delete incorrect or fill-in missing information. Ifthis approach is taken, however, it should be done inonly one location (to avoid splitting the project), atwhich many users already contribute.

5.2 Targeting Relevant Applications

A knowledge base’s relevance to types of problemsdetermines which applications it is most appropriatefor. Some have claimed that the lack of targeting rel-evant information is what has stunted the progressof the CyC project: “The initial CyC philosophyof simply entering knowledge regardless of its possi-ble uses is arguably one of the main reasons it hasfailed to have a significant impact so far[11].” In a re-cent independent study which surveyed three differ-ent knowledge representation and reasoning systems,those which were designed specifically for their objec-tive produced better results than the massive amountof unspecialized knowledge in the CyC system[3].Based on the findings described in 4.2.1, Concept-Net may be very useful for applications which dealwith social, interpersonal information.

Targeting a specific domain of knowledge can bedone at the knowledge acquisition level and from theperspective of which projects users of ConceptNetchoose to implement. In any case, focusing on Con-ceptNet’s current strengths may set the unprece-dented course for development of programs which ex-hibit social intelligence.

11http://www.eturner.net/omcsnetcpp/wordnet/12http://www.ogghelp.com/cn2/

References

[1] Chklovski, T. Using analogy to ac-quire commonsense knowledge from hu-man contributors. Tech. Rep. AITR-2003-002, MIT AI Lab, Feb. 2003. Avaialbleonline at ftp://publications.ai.mit.edu/ai-publications/2003/AITR-2003-002.pdf.

[2] Davis, R., Shrobe, H., and Szolovits, P.What is a knowledge representation? AI Maga-zine 14, 1 (1993), 17.

[3] Friedland, N. Project halo: Towards a digitalaristotle. AI Magazine 25, 4 (2004), 29.

[4] Jurafsky, D., and Martin, J. H. Speech andLanguage Processing: An Introduction to Nat-ural Language Processing, Computational Lin-guistics, and Speech Recognition. Prentice Hall,2000.

[5] Lieberman, H., Liu, H., Singh, P., andBarry, B. Beating some common sense into in-teractive applications. AI Magazine 25, 4 (2004),63.

[6] Liu, H., and Singh, P. Commonsense rea-soning in and over natural language. In Pro-ceedings of the 8th International Conferenceon Knowledge-Based Intelligent Information &Engineering Systems (KES-2004) (Wellington,New Zealand, 2004), Springer-Verlag.

[7] Liu, H., and Singh, P. Conceptnet–a praticialcommonsense reasoning tool-kit. BT TechnologyJournal 22, 4 (2004), 211.

[8] McCarthy, J. Applications of circumscriptionto formalizing common sense knowledge. Arti-ficial Intelligence 28 (1986), 89–116. Reprintedin [9].

[9] McCarthy, J. Formalization of commonsense, papers by John McCarthy edited by V. Lif-schitz. Ablex, 1990.

[10] Misnky, M. The Emotion Machine. Forthcom-ing; Simon & Schuster, 2005.

[11] Richardson, M., and Domingos, P. Build-ing large knowledge bases by mass collaboration,2003.

[12] Singh, P., Lin, T., Mueller, E., Lim, G.,Perkins, T., and Zhu, W. Open mind com-mon sense: Knowledge acquisition from the gen-eral public, 2002.

7

concept net

Documents