BAYESIAN LEARNING

~or~

Where There's Smoke, There's Fire


Abstract

Uncertainty has presented a difficult obstacle in artificial intelligence. Bayesian learning outlines a mathematically solid method for dealing with uncertainty based upon Bayes' Theorem. The theory establishes a means for calculating the probability an event will occur in the future given some evidence based upon prior occurrences of the event and the posterior probability that the evidence will predict the event. Its use in artificial intelligence has been met with success in a number of research areas and applications including the development of cognitive models and neural networks. At the same time, the theory has been criticized for being philosophically unrealistic and logistically inefficient.

Bayesian Learning

The aim of artificial intelligence is to provide a computational model of intelligent behavior (Pearl, 1988). Expert systems are designed to embody the knowledge of an expert in a given field. But how do people become experts themselves?

While artificial intelligence can produce Ph.D. quality experts, a more difficult challenge lies in creating a naive observer. The common sense people use in everyday reasoning provides one of the most difficult challenges in building intelligent systems. Common sense reasoning is often based on incomplete knowledge and is powerfully broad in its use. Intelligent systems have historically been successful in specific domains with well defined structures. To make them succeed in a broad arena, they would need either a greater base of knowledge or be able to deal with uncertainty and learn. In light of the fact that the former option is more demanding in resources and assumes that all the appropriate knowledge is obtainable, the latter is an attractive option.

Probability theories offer an intuitive guide to changing the beliefs in a system of knowledge in the presence of partial or uncertain information. They allow intelligent systems flexibility and a logical way to update their database of knowledge. The appeal of probability theories in AI lies in the way they express the qualitative relationship among beliefs and can process these relationships to draw conclusions (Pearl, 1988).

One of the most formalized probabilistic theories used in AI relates to Bayes' theorem. Bayesian methods have been used for a variety of AI applications across many disciplines including cognitive modeling, medical diagnosis, learning causal networks, and finance.

Two years after his death, in 1763, Rev. Thomas Bayes' "Essay Toward solving a Problem in the Doctrine of Chances" was published. Bayes is regarded as the first to use probability inductively and established a mathematical basis for probability inference which he outlined in this now famous paper. The idea behind Bayes' method is simple; the probability that an event will occur in future trials can be calculated from the frequency with which it has occurred in prior trails. Let's consider some everyday knowledge to outline Bayes' rule: where there's smoke, there's fire. We use this everyday cliche to suggest cause and effect. But how are such relationships learned in and from everyday experience? Conditional probability provides a way to estimate the likelihood of some outcome given a particular situation. Bayes' theorem further refines this idea by incorporating the past occurrences where some outcome was observed given a particular situation. Taking our cliche, we can present it in Bayesian terms as follows:

                                  P(fire|smoke) = {P(smoke|fire) x P(fire)}/P(smoke)

This equation states the probability that we will find a fire given an observation of smoke is based upon three factors. The numerator is the product of the prior probability of observing fire alone (P(fire)), and the likelihood, or posterior probability, of observing smoke when there is a fire (P(smoke|fire)). The denominator is a normalizing factor and can be represented in a number of other ways (Pearl, 1988). The terms estimate the prior odds that either event will occur independently and measure how often smoke is an indicator of a fire. After any observation of smoke or fire, the prior probabilities are updated and the likelihood estimator is adjusted.

In human information processing, it is theorized that knowledge stored in memory is strengthened and associations are made according to Bayesian principals (Anderson, 1991). This is the basis for making associations between memories based on repeated situations where the memories are used. In this way, people are able to adapt to their environments by gaining new information through experience and can form new associations and draw conclusions under uncertain conditions.

The ACT theory of human cognition developed by John Anderson at Carnegie Mellon University has been developed as a cognitive model in a system called ACT-R (which, ironically, stands for nothing). ACT-R is essentially as expert system that uses declarative knowledge coded as facts and procedural knowledge coded into production rules. The ACT theory of cognition uses Bayesian learning as a foundation for human learning and the ACT-R systems implements Bayesian learning in problem solving and conflict resolution. Associated with each fact and production in the model are base levels of activation and associations strengths which govern the likelihood a unit will be retrieved and estimate how useful it will be given a situation. These parameters are determined by Bayesian methods. ACT-R has been very successful in replicating the results of humans in problem solving experiments and has been used to develop intelligent tutors for teaching a variety of cognitive skills including computer programming, geometry, and algebra (Anderson, 1993).

In the field of neural networks, Bayesian learning has been used to design networks that learn models of complex relationships. Neural networks are parameterized by weights and biases that define what function is computed by the network from inputs to outputs. These networks are "taught" by adjusting the weights and biases for each node in the network based on Bayesian learning. The networks are given prior and posterior probabilities derived from any number of methods (Neal, 1996) then run through training sets to refine their parameters. Such networks have been used in determining the effect of air pollution on housing prices and to classify the origin of glass fragments found at the scene of a crime (Neal, 1996).

Despite the success of Bayesian learning in some areas of AI, there has been debate about the use of probability theories as a whole. Since McCarthy and Hayes proclaimed probabilities to be "epistimologically inadequate" (McCarthy & Hayes, 1969), they have been regarded with academic suspicion. Probability theories have been criticized for requiring massive amounts of data for accurate statistical calculations. Some critics claim that all possibilities in a given situation must be enumerated for reliable predictions. It has been noted by researchers that people are actually poor estimators of probability (Kreuger, 1981) which questions the assumptions behind the uses of Bayesian learning in cognitive models. Another common critique of Bayesian learning concerns the assignment of prior and posterior probabilities before any evidence for calculating the values has been collected. Although many procedures have used, there appears to be no general consensus on the most appropriate method.

Research continues in the uses of Bayesian learning in artificial intelligence. It provides a mathematically solid way to deal with uncertainty in problem solving which has traditionally been an obstacle for AI research. Although some doubt surrounds the philosophical foundations of probability theories as relevant in AI, the performance of intelligent systems that use them speaks to their benefit and success.

References

Anderson, J. R. (1993). Rules of the Mind. New Jersey: Lawrence Erlbaum Associates.

Anderson, J. R. (1991). Is human cognition adaptive? Behavior and Brain Sciences, 14, 471-517.

Krueger, L. E. (1984). Perceived numerosity: A comparison of magnitude production, magnitude estimation, and discrimination judgements. Perception and Psychophysics, 35(6), 536-542.

McCarthy, J. & Hayes, P. (1969). Some philosophical problems from the stand point of artificial intelligence. In B. Metltzer & D. Michie (Eds.), Machine Intelligence Vol. 4, (pp. 463-502). Edinburgh, U.K.: Edinburgh University Press.

Neal, R. M. (1996). Bayesian Learning for Neural Networks. New York: Springer-Verlag.

Pearl, J. P. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, California, USA: Morgan Kaufmann Publishers, Inc..