Reinforcement Learning

Li M. Fu

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

What is reinforcement learning?

It is concerned with how an autonomous agent like a robot learns or adaptes by receiving global feedbacks from an environment.
It differs from other function approximation tasks in the following respects:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The Basic Idea:

The agent learns to choose optimal actions that maximize the reward accumulated over time by the agent.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Basic Learning Elements:

Basic functions: Optimal policy:
pi* = argmaxpi Vpi(s), for all s
Absorbing state G:
G = s(G, pi(G))

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Learning Automata

In one paradigm, reinforcement learning refers to a class of algorithms for learning automata. The automaton takes one of a set of actions according to its corresponding probabilities. The environment or teacher responds to the automaton's action by showing ``success'' (+1) or ``failure'' (-1). We may call this feedback from the environment the global reinforcement signal. The automaton learns by altering the probabilities associated with the actions in response to the reinforcement signal. The learning process continues until the automaton behaves properly.

The extent to which a local decision or action is rewarded or penalized depends on how it correlates with the reinforcement signal. If enough samples are taken, the noise caused by the variations in the other variables is averaged out, and the effect due to a single variable becomes evident. Therefore, with a sufficiently long learning process, an optimal probability can be learned for each local variable.

In contrast to backpropagation learning, reinforcement learning does not have to compute derivatives. This feature makes it suitable in a complex system where derivatives are difficult to obtain. On the other hand, reinforcement learning is very inefficient in a large system. Additionally, it may get stuck in a local optimum.

An example of this class of algorithms is the linear reward-penalty algorithm (Narendra and Thathachar 1974). In this algorithm, assume that there are r actions,

a1, a2, ..., ar
and the respective probabilities for these actions are
P1, P2, ..., Pr
Let P(k) denote the probability at iteration k. Suppose at iteration k, the action taken is
ai
When the reinforcement signal is +1 (i.e., positive feedback), the learning rule is
Pi(k + 1) = Pi(k) + a[1 - Pi(k)]
Pj(k + 1) = (1 - a)Pj(k), for j \= i 
If the reinforcement signal is -1 (i.e., negative feedback), the learning rule is
Pi(k + 1) = (1 - b)Pi(k) 
Pj(k + 1) = b/(r - 1) + (1 - b)Pj(k), for j \= i 
In the above rules, a and b are learning rates. Thus, in the case of positive feedback, the probability of the current action is increased with relative decrease in the probabilities of the other actions. The adjustment is reversed in the case of negative feedback.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Q Learning

Start with
pi* = argmaxa[r(s, a) + dV*(s(s, a)]
Define
Q(s, a) = r(s, a) + dV*(s(s, a))
Then
pi* = argmaxaQ(s, a)
Suppose we define
V*(s) = maxa'Q(s, a')
Then we can rewrite the Q function as
Q(s, a) = r(s, a) + dmaxa'Q(s(s, a), a')
In each iteration of learning, we do
Q'(s, a) <-- r(s, a) + dmaxa'Q'(s(s, a), a')
and we update the Q function table S X A.
It can be proven that Q' converges to Q in the limit. That is,
Q' --> Q
Once the Q function is learned, the optimal policy is learned since
pi*(s) = argmaxaQ(s, a)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Q Learning Algorithm

Comments: Similar to dynamic programming techniques.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Q Learning Example

Consider the textbook example. How to update the Q function, Q'(s1, move-to-right)? where s1 corresponds to the left upper position. Assume that the reward decay constant = 0.9.
Q'(s1, move-to-right) = r(s1, move-to-right) + dmaxa'Q'(s2, a')
where s2 corresponds to the middle upper position. Note that s(s1, move-to-right) = s2.
It is given that
r(s1, move-to-right) = 0
The current Q function table shows that
Q'(s2, move-to-right) = 100
Q'(s2, move-to-left) = 63
Q'(s2, move-down) = 81
Thus,
maxa' Q'(s2, a') = 100
It leads to
Q'(s1, move-to-right) = 0 + 0.9 * 100 = 90

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Convergence

The convergence property of the Q learning algorithm is implied by the following two facts:
For all s, a, n, Q'n+1(s, a) > = Q'n(s, a)
where n denotes the nth iteration.
For all s, a, n 0 < = Q'n(s, a) < = Q(s, a)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Temporal Difference Learning

A kind of problem is to predict future outcome by using past experience. This problem is quite common, for example, in game playing, weather forecast, and stock market prediction. Dealing with this problem, we are interested in on-line prediction. That is, we continue to make our predictions and adjust our belief and not wait until the last minute. The solution to this problem is especially important in the application domains where data continue to flow in and we have to make our decision or action in a real-time manner.

Sutton (1988) describes temporal difference methods which calculate parameter adjustment based on the difference between temporally successive predictions of the output, instead of the difference between the predicted and the desired outputs, and are thus different from backpropagation learning.

In game playing, for example, backpropagation cannot be used to update the weights after each move, since the outcome is not available until the end of the game. Even though it is possible to update the weights at the end, this approach requires the storage of all the information along the way. The temporal difference method, on the other hand, need not store all information. It saves weight update information at each move and performs on-line prediction. Besides storage efficiency, it is claimed that temporal difference methods are computationally more efficient than ordinary methods and converge faster to an optimum for multistep problems, the problems for which we make prediction several steps ahead of time. For example, predicting Sunday's weather based on Wednesday is a multistep problem. It can be argued that most prediction problems are of this type.

How to apply temporal difference to Q learning?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Stochastic Models

The deterministic model can be extended to a stochastic model by taking the expected value for the reward function.

However, the control parameter for future adjustment must decay over time to ensure convergence.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Generalization from Examples

To improve generalization of Q learning, a possible idea is to use a neural network as a substitute for the table. However, this approach may not guarantee convergence (?).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Relationship to Dynamic Programming

The novel aspects of Q learning is that it assumes no knowledge about state transition and reward functions, which must be learned from the environment.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Supplemental Web Material

Reinforcement Learning