This paper is intended to be an introductory tutorial on the Very Large Knowledge Base (VLKB) called CYC. Described herein is the reasoning for the origination of the CYC project, the intended usefulness of the project (application areas), how CYC is being constructed, and a brief introduction to the supporting tools that have been developed to interact with the CYC knowledge base.
Many Knowledge Bases (KB) have been developed to help people solve problems in very specific applications. These are relatively simple to build since the knowledge required by the specific system needs to be only those facts required to solve the problem in that particular application. An example of this type of KB would be one that contains only the information needed to diagnose particular fungal and bacterial infections. The KB would need to know about the different attributes of the microscopic organisms and their affects on the host but would not need to know that the grass is green or that the earth revolves around the sun.
The CYC common sense Knowledge Base takes the opposite approach. CYC is being created to hold information that most people would consider to be common sense knowledge. The idea is to create a KB that would supply the basic knowledge needed to be applicable to many different applications. By building a KB with this general knowledge, it is hoped that the KB will be able to learn (create new inferences) by itself and be able to tell when it does not have enough information in a particular domain to resolve a problem.
The CYC project was started by Doug Lenat at MCC (Microelectronics and Computer Technology Corporation, now Cycorp, Inc.) in Austin Texas as a research project in 1984. The approach used to capture this common sense information for the project was to "capture all of the knowledge -- both implicit and explicit -- in a hundred randomly selected articles in the EnCYClopaedia Britannica." [Ginsberg] It was believed that if one could create a knowledge base that would contain enough implicit knowledge, a computer program accessing this knowledge base would be able to read and make sense of the rest of the Encyclopaedia Britannica without any additional information being given to it by humans. The ambitiousness of this project can be best conceptualized through an example -- the sentence, "Napolean died in 1821 - Wellington was saddened" required two months to enter into the knowledge base all the information needed to explain the concepts of life and death! [Rajkumar & Shah] When first conceptualized, the project was estimated to be able to be able to be completed in 10 years at a cost of approximately fifty million dollars. [Ginsberg]
CYC is a very large knowledge base (VLKB) that contains over 1.5 Million "facts, rules-of-thumb and heuristics for reasoning about the objects and events of everyday life." [Cycorp (2)] It is meant to provide a broad range of in-depth, but general understanding that can be used as base-level knowledge for other computer programs. Many of the expert systems that exist contain only the information they need to solve problems within a very narrow range of knowledge. When the fringes of this knowledge are explored, typically the program fails - failing to find a solution or (sometimes) worse, finding the wrong solution. It was reasoned that if a KB were developed that contained a broad set of common knowledge, that other programs could use this knowledge base (through some common language) as a resource of information that would not have to be built into the expert system and would enable the more specific application to handle problems that were on the fringes of it's own KB.
In order to successfully build a knowledge base of this magnitude that could be used by other expert systems, a rigorous approach was needed. The project was broken up into five essential components: the CYC KB, the CycL representation language, the CYC inference engine, the CYC interface tools and the CYC application modules. Other knowledge bases are built with frames (generic objects that assume certain default values for those objects unless specific information about those objects is present), but it was found that this did not scale well for KBs of CYC's size. Instead, the CYC KB was built with terms and assertions that relate to those terms. The team building the CYC KB think of it as a "sea of assertions, with each assertion being no more "about" one of the terms involved than another." [Cycorp (2)]
This set of terms and assertions needed to be stored in a manner that allowed them to be easily used, so CycL was developed. It is a large, flexible representation language that is a first-order-predicate calculus with extensions to "handle equality, default reasoning, skolemization and some second-order features." [Cycorp (2)] Entry of information into the knowledge base can be done via an English-to-CycL html parser, one of Cycorp's interface tools. Once the information is entered into the KB there are other interface tools to search and edit the stored information (e.g., an HTML browser to surf the "sea" of information and an English generator that restates the rules/inferences used for answering a particular query). [Cycorp (2)]
The CYC KB queries use an inference engine that "performs general logical deduction (including modus ponens, modus tolens, and universal and existential quantification)" [Cycorp (2)] This inference engine uses a best-first search as well as proprietary heuristics and "microtheories" in order to restrict the searches. [Mayfield, et al] Since the basic goal of this project was to allow other expert systems (KBs) to use the information stored in the CYC KB, the ability to query the KB is an important one. There needs to be a common language between systems that desire to share information. Since this area of AI is still in it's infancy, it is too early to choose a standard representation language (such as KIF, extended SQL and LOOM), so instead, a common or translatable language is needed to allow this information exchange. Mayfield, et. al., decided to use the Knowledge Query and Manipulation Language (KQML) as an interface between different knowledge bases being used in their project. The team was able to get several Cyc-based agents (knowledge bases) to reason together to solve problems that none of the individual knowledge bases would have been able to solve by themselves.
In a report by Vaughn Pratt of his visit with Ramanathan Guha in 1994 for a demo of the CYC system, he noted that the system was correctly able to identify inconsistencies in two different sources of information - the first was a spreadsheet that indicated that a (fictitious) organization had destroyed a village while a second spreadsheet identified that same organization as a pacifist organization. Even though the information was found in two different sources, the system had correctly identified that a pacifist organization would not destroy a village - therefore one of the sources of information was incorrect.
In another portion of the demo, the team attempted to retrieve photos. Photos are especially difficult to manage, particularly for companies whose main business is photos and therefore have thousands of them to store and retrieve. Most photo storage systems will store photos using an identifying caption, but trying to catalog and then find an appropriate photo (via that caption) later on is sometimes quite difficult. The CYC system allows the user to describe the photo (six axioms were used for the demo) and then later on use a querying system to find (sometimes via inferences) appropriate photos. For instance, a query for "someone at risk for skin cancer" could retrieve a photo of a girl reclining on a beach even though the photo may not have been "cataloged" with any information directly related to cancer. In one part of the demo, the photo retrieval was done on just 20 photos, and a query of "A tree" did fail to find a photo whose caption was "A girl with presents in front of a Christmas tree." [Pratt] Even though the CYC project was to be completed in ten years (by 1994), it does appear that the KB applications were not yet able to fully utilize the system as of 1994.
The documentation available at Cycorp's web site indicates that Cycorp sees the CYC project as a "very long-term, high-risk gamble that has begun to pay off." [Cycorp (1)] In a related website, Lenat & Guha identified ten possible applications of the CYC KB that they envisioned being able to be implemented by the year 1999 (including direct mail marketing, smart spreadsheets and machine translation of technical documents). Although some of the noted applications do seem to be being realized, there is no documentation on CYC's web site that this author has found that indicates that the CYC KB is being used in these applications. The developers and proponents of the CYC project seem to have had a positive outlook on the applicability of the CYC KB, but it appears that the interest in the CYC KB that existed in the first ten years of it's existence has waned. There was a lot of information and on the CYC project early in it's life, and it appears that there have been attempts to use the knowledge gained while building this KB, but the CYC KB itself does not appear to have survived.
Cycorp(1), "Applications for CYC", Cycorp, Inc, http://www.cyc.com/applications.html
Cycorp(2), "The CYC Technology", Cycorp, Inc, http://www.cyc.com
Ginsberg, Matthew L., Essentials of Artificial Intelligence. Morgan Kaufmann, 1993. ISBN 1-55860-221-6
Lenat, D.B., Guha, R.V., "Ideas for Applying CYC", Cycorp, Inc, http://www.cyc.com/tech-reports/act-cyc-407-91/act-cyc-407-91.html
Mayfield, James; Finin, Tim; Narayanaswamy, Rajkumar; Shah, Chetan; MacCartney, William; Goolsbey, Keith, "The Cycic Friends Network: getting Cyc agents to reason together", University Of Maryland - Baltimore County, http://www.cs.umbc.edu/~cikm/iia/submitted/viewing/mayfield
Pratt, Vaughan, "CYC Report", Stanford University, April 16, 1994, http://www.cs.umbc.edu/~narayan/proj/cyc-critic.html
Rajkumar & Shah, "A Study to assess the usefulness of CYC in a mediated architecture", University Of Maryland - Baltimore County, CYC KQML Project, http://www.cs.umbc.edu/~narayan/proj/doc.html