an article about the cyc project

Jared FriedmanOctober 22, 2003

Harvard Science Review

The Sole Contender for AI

"AI has been brain-dead since the 1970s,"1 said Marvin Minksy, one of the undisputed founding fathers of the field of artificial intelligence (AI), at a recent speech at Boston University. Minsky was referring to the fact that while AI has had success in narrow, expert domains, like playing chess or composing music, it remains stumped by the ultimate goal of building a single computer system that is generally intelligent about the world. While there are many AI researchers working on various sub-problems of the field, it is surprising how few are actually tackling the age-old dream of building a computer one could just have a conversation with, like HAL 9000 in 2001: A Space Odyssey. Indeed, Minsky went on to say that, in his opinion, only one project was on the right track to solving the deep problems of AI: the Cyc project.

The Cyc (derived from en-cyc-lopedia) project is based on the idea that the primary impediment to a HAL-like system is the lack of commonsense knowledge in computers. In AI, commonsense knowledge is loosely the set of facts that any normal adult person would be expected to know, such as "when people die, they stay dead," "a dog is a type of animal," and "submarines are meant to travel under water." While it is difficult to estimate the size of this set of facts, it is known to be very large, perhaps in the ten millions, and there is currently no computer system that possesses it. Time and again, AI has met a roadblock in this lack of commonsense knowledge. The hallmark of AI, expert systems, which are programs that attempt to replicate tasks that human experts do such as giving medical advice or making travel plans, are known for their brittleness - their tendency to break down in unfamiliar situations. A classic example is the program MYCIN, which diagnoses blood diseases better than most human experts but would be likely to prescribe penicillin for the ailments of a rusting car. The lack of commonsense knowledge also limits the accuracy of foreign language translation systems. For instance, a program that attempts to translate the phrase "bats and other small mammals" into Spanish must be able to tell whether the word "bat" refers to an animal "murciélago" or to a baseball bat "bate," and to do this it must know that the former meaning is kind of small mammal and the latter is not. Because of the commonsense problem, even the best computerized translators give very mediocre performance and commercially viable expert systems are rare.

The Cyc project is an attempt to break this bottleneck on AI once and for all by programming the whole of commonsense knowledge into a computer. Commonsense facts in Cyc are stored in formal logic, and are hand-entered by human knowledge engineers. An inference engine is able to reason with these facts to answer queries, and a natural language interface can translate between English and Cyc's internal representation

1 "AI Founder Blasts Modern Research." Wired News, May 13, 2003. <http://www.wired.com/news/technology/0,1282,58714,00.html> (cited 27 Sept., 2003).

language. Potential applications for Cyc range from the prosaic, such as a sanity-checker for spreadsheets that would know that "24,562" is not a reasonable value for a person's age, to the world-changing, such as a question-answering system that could use information on the internet to respond to questions like "What is the second-tallest mountain on earth and how high is it?" with a single answer "K2, 28,251 ft."

The Cyc project has been working on this problem for about twenty years, and Cyc now contains about 1.5 million facts, or assertions. This is by far the largest collection of commonsense knowledge in the world. Cycorp, the company that produces Cyc, has about eighty employees, most of whom are ontologists, who spend their time adding to and improving the Cyc knowledge base (KB). While Cyc's KB contains a significant fraction of commonsense knowledge, and Cyc has in general made great strides towards its ultimate goal, it remains a work in progress. In particular, Cyc has yet to be successfully deployed in a commercial situation, and its development is still funded primarily by DARPA, the Defense Advanced Research Projects Agency. Cyc's KB is large enough now to answer a substantial percentage of randomly chosen commonsense questions, like "What color is grass?", "Do fish fly?", "What shape is the earth?". Cyc's natural language system can almost always translate Cyc's internal knowledge and thinking into English, but Cyc cannot yet understand most normal English sentences.

Figure 1. Five Example Sentences in CycL

1. (#$isa #$HurricaneAndrew #$Hurricane) 2. (#$genls #$Hurricane #$RainProcess) 3. (#$relationAllExists #$eventOccursAt #$RainProcess #$RainyLocation) 4. (#$genls #$RainyLocation #$CloudyLocation)5. (#$disjointWith #$CloudyLocation #$SunnyLocation)

Screenshot of Cyc. This is the Cyc KB browser, where users can browse and improve Cyc's knowledge. This page displays OpenCyc's knowledge of dogs.

Commonsense knowledge in Cyc is stored in a proprietary formal language called CycL. CycL consists of about 100,000 constants, each of which corresponds to a commonsense concept, like #$Dog, #$Hammer, or #$Shouting. Combining Cyc constants into expressions, Cycorp's ontological engineers can state facts about the real world, such as in the example sentences in Figure 1. Cyc's inference engine can then reason with the stored information to answer questions even if the answers are not directly stated

in the KB. So for instance, say that some user testing Cyc's knowledge asked Cyc if it was cloudy during Hurricane Andrew. Even though no ontologist would assert such a fact directly, Cyc can still use the example sentences to derive it. Cyc would reason that (1) Hurricane Andrew is a hurricane, and (2) all hurricanes are rain processes, so Hurricane Andrew must be a rain process. Sentence three tells Cyc that all rain processes must occur at some rainy location, so Andrew must also occur at a rainy location. But (4) says that all rainy locations are also cloudy locations, so Cyc can conclude that Andrew must have occurred at a cloudy spot.

It may surprise you to learn that if asked whether the sun was shining during the middle of Hurricane Andrew, Cyc will not be able to use (5), which says that cloudy locations are never sunny, to prove that it was not. The problem here is a subtle logical flaw in the third sentence, which does not rule out the possibility that Andrew might have occurred at some other location (which might be sunny) in addition to the cloudy one. This is exactly the kind of problem that Cycorp's ontologists constantly confront. Subtle holes in Cyc's knowledge are omnipresent, and tremendously difficult to find and patch. Subtle contradictions are just as much of a problem. For example, combining (4) and (5), we can conclude that if it is rainy outside, it is not sunny. This may be a good general rule of thumb, but it obviously breaks down in the case of sun showers. Considering every such exceptional case is a daunting but necessary task for Cycorp's ontologists.

Another pressing problem for Cycorp is the speed of inference. As Cyc's knowledge base has grown to an enormous size, it has become increasingly difficult for the inference engine to search through the entire KB, and inference in Cyc has slowed to a crawl. While simple inferences still usually finish in seconds, more complicated ones can take hours. To see what takes so long, consider the last example, in which we asked Cyc to prove that Hurricane Andrew could not have occurred in a sunny location. What would happen if, in addition to the facts in fig. 1., Cyc also knew the somewhat arcane but true fact that hurricanes never occur in the area approximately within five degrees of the equator? Cyc's inference engine might then decide that the best way to prove that it wasn't sunny during Hurricane Andrew was to prove that it is never sunny more than five degrees from the equator. Cyc might have all sorts of facts that it could use to try to prove this new statement, and it might spend a great deal of time working on the problem before giving up and trying to prove that the hurricane wasn't sunny a different way.

Among Cycorp employees, this problem is sometimes called the "menstruating television problem,"2 after a famous long inference that, when stopped mid-way for sheer impatience, turned out to have decided that it could prove the question it had been asked if only it could prove that televisions could menstruate, and to have proceeded to spend a great deal of time trying to prove this dubious claim. Hope is not lost, though, for Cycorp is working hard to make the inference engine smarter, so that it wastes less time trying to prove things that are obviously false. Many people at Cycorp conjecture a speedup of as much as a hundred-fold in the foreseeable future.

Even more critical for Cyc's ultimate success, though, will be its support for natural languages, like English. Successful reasoning about hurricanes is impressive to a few, but ordinary people do not know CycL and will be interested in Cyc only if they can ask such questions in English. But translating the free structure of a natural language into a rigid formalism like CycL is a famously difficult problem of AI. Part of the trouble is that the sentences translated into CycL don't look anything like the original English. To ask Cyc "Was it cloudy during Hurricane Andrew?", you would have to say:

(#$implies (#$eventOccursAt #$HurricaneAndrew ?LOC) (#$isa ?LOC #$CloudyLocation))

There is just no straightforward algorithm for this conversion. Most randomly chosen newspaper sentences can be translated into CycL without adding additional vocabulary - an enormous and unparalleled achievement. However, the CycL translations of most newspaper sentences fill at least a page and are very complex. Cycorp has made great strides in the translation of CycL to English, which is now very good. But the harder problem of English to CycL will likely be a stickler for Cycorp in the years to come.

In academia, Cyc has always been controversial. For one, it has been developed entirely in the private sector, and Cycorp has not published or participated in AI

2 Personal correspondence with Daniel Mahler, a Cycorp employee.

conferences in years. Cycorp largely put an end to this criticism two years ago, when they released a version of Cyc online for free at www.opencyc.org. OpenCyc is similar to full Cyc, but its knowledge base is just a few percent of the full KB and its functionality is greatly reduced. Since Cyc's success lies in the completeness of its knowledge base, the only people who really know the extent of Cyc's progress are Cycorp employees.

But most of the criticism leveled at Cyc attacks the lack of a theoretical basis for Cyc. There are many famously unsolved problems in the representation of commonsense knowledge and Cycorp does not claim to have complete solutions. No one knows how humans store and reason with commonsense knowledge, and there is no reason to believe that Cyc's methods are anything like a human's. Since there is no sentient being that thinks like Cyc, it is impossible to be certain that any amount of knowledge could make Cyc intelligent.

Cyc's inference engine has also been a controversial topic. To allow CycL to be expressive enough to represent almost any conceivable knowledge, Cycorp had to give up both soundness and completeness in inference. This means that if you ask Cyc a question and it does not reply for a while, you cannot in general know whether it is unable to answer the question or whether it might respond if just given a little more time. It is also theoretically possible that Cyc's inference engine could give erroneous answers to questions, even if all the relevant knowledge in Cyc were correct.

Cycorp, however, bashes right back at the naysayers. Cyc, they say, is fundamentally an engineering project and is not meant to resolve theoretical issues. The subtle philosophical problems that AI researchers have spent so much time worrying about rarely come up in practice. Cycorp's president has accused AI

researchers of "physics envy," of insisting that there must be some "free lunch" approach

Inference in Cyc is unpredictable. This diagram shows an inference tree that might have been created in an attempt to prove that Hurrican Andrew wasn't sunny. Currently, the engine is stuck down a nearly bottomless rabbit hole - a proof method that, while plausible, will not succeed.

http://www.opencyc.org/

to achieving machine intelligence.3 Academic AI has made little fundamental progress in recent years, he argues, because they are stuck in a circle of philosophizing and unwilling to do the hard work of actually creating artificial intelligence. The debate has been heated on both sides, but ultimately, argument is powerless to resolve the issue - the only way to know whether Cyc will work is to try it and see, and that is exactly what Cycorp is doing.

3 "The Know-It-All Machine", Lingua Franca. Volume 11, No. 6 (Sept. 2001).

an article about the cyc project

Documents