Bayesian Learning

Li M. Fu

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Introduction

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Topics

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Bayesian Classification and Decision

The Bayes decision rule is the rule that selects the category with minimum conditional risk. In the case of minimum-error-rate classification, the rule will select the category with the maximum posterior probability. Suppose there are k classes, c1, c2, ..., ck. Given a feature vector x, the minimum-error-rate rule will assign it to class cj if
Prob(cj | x) > Prob(ci | x) for all i =\ j
Here, the posterior probability is used as the discriminant function. An alternative criterion for minimum-error-rate classification is to choose class cj so that
Prob(x | cj)Prob(cj) > Prob(x | ci)Prob(ci) for all i =\ j
which is derived from well-known Bayes theorem:
	       Prob(x | c)Prob(c)
Prob(c | x) = -------------------
		  Prob(x)
Note that the risk factor can be incorporated into the function for consideration.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Naive Bayes Classifier: An Example

A naive Bayes classiers adopts the assumption of conditional independence. Given that
P(pneumonia) = 0.01, P(flu) = 0.05
P(cough | pneumonia) = 0.9, P(fever | pneumonia) = 0.9,
P(chest-pain | pneumonia) = 0.8,
P(cough | flu) = 0.5, P(fever | flu) = 0.9,
P(chest-pain | flu) = 0.1,
suppose a patient had cough, fever, but no chest pain. What is the probability ratio between pneumonia and flu? What is the best diagnosis?

[Solution:]
		  0.01 * 0.9 * 0.9 * (1 - 0.8)
Probability ratio = ----------------------------  = 0.08
		  0.05 * 0.5 * 0.9 * (1 - 0.1)
So flu is at least ten times more likely than pneumonia.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

A Case Study: Text Classification

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Bayesian Belief Networks

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Bayesian Belief Networks: Information Propagation and Inference

Pearl (1986) also devised a parallel-distributed approach for updating belief values in a causal network according to the Bayes theorem; hence it is called a Bayesian network. This scheme provides both forward (cause-to-effect) and backward (effect-to-cause) information propagation so that information can arrive at any node in the network and transmit to all other nodes in the network. In each node, the probability (belief value) of a variable value V after observing evidence E is computed by the Bayes theorem as follows:
P(V|E) = a * P(E|V)P(V)
where P is the probability function and a is a normalizing factor. Normalization makes the sum of the probabilities of all exhaustive and mutually exclusive values equal one. To see information propagation, consider an example. In a simple network, suppose variable A is a causal variable connected to both variables B and C. The belief value of A can propagate to derive that of B by
P(B) = P(B|A)P(A) + P(B|not A)P(not A)
and
P(not B) = P(not B|A)P(A) + P(not B|not A)P(not A)
with subsequent normalization for ensuring
P(B) + P(not B) = 1
The same is true for deriving the belief value of C. The link pointing from node A to node B is characterized by the conditional probabilities P(B|A), P(B|not A), P(not B|A), and P(not B|not A) (often they are represented as a matrix). The message passed from a parent node to a child node is called a pi message. For example, the pi message sent from node A to node B consists of P(A) and P(not A) modulated by the conditional probability matrix. This illustrates forward information propagation. Suppose information E1 arrives at node B. The probability of B is updated by the Bayes theorem:
P(B|E1) = a1 * P(E1|B)P(B)
and
P(not B|E1) = a2 * P(E1|not B)P(not B)
This information can propagate to node A using the relation
P(E1|A) = P(E1|B)P(B|A) + P(E1|not B)P(not B|A)
and
P(E1|not A) = P(E1|B)P(B|not A) + P(E1|not B)P(not B|not A)
Then the probability of A is updated by the Bayes theorem. The message passed from a child node to a parent node is called a lambda message. For example, the lambda message received at node A from node B consists of P(E1|A) and P(E1|not A). Information propagates backwards this time. The same evidence should not be used more than once in updating the belief value at the same node. For example, suppose new information E2 arrives at node C. This information is passed backwards to node A, and in turn passed forward to node B. To avoid counting information E1 twice at node B, the pi message sent from node A to node B at this point should be divided by the lambda message (on a term-by-term basis) sent from node B to node A earlier.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

EM Algorithm

EM Algorithm

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Summary

Other Supplementary Material

Basic Ideas