%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
s: S X A --> A s(si, ai) = si+1
r: S X A --> R (real number) r(si, ai) = ri
pi: S --> A pi(si) = ai
Vpi(si) = ri + dri+1 + d2ri+2 + ...where d is the decay constant.
pi* = argmaxpi Vpi(s), for all sAbsorbing state G:
G = s(G, pi(G))
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The extent to which a local decision or action is rewarded or penalized depends on how it correlates with the reinforcement signal. If enough samples are taken, the noise caused by the variations in the other variables is averaged out, and the effect due to a single variable becomes evident. Therefore, with a sufficiently long learning process, an optimal probability can be learned for each local variable.
In contrast to backpropagation learning, reinforcement learning does not have to compute derivatives. This feature makes it suitable in a complex system where derivatives are difficult to obtain. On the other hand, reinforcement learning is very inefficient in a large system. Additionally, it may get stuck in a local optimum.
An example of this class of algorithms is the linear reward-penalty algorithm (Narendra and Thathachar 1974). In this algorithm, assume that there are r actions,
a1, a2, ..., arand the respective probabilities for these actions are
P1, P2, ..., PrLet P(k) denote the probability at iteration k. Suppose at iteration k, the action taken is
aiWhen the reinforcement signal is +1 (i.e., positive feedback), the learning rule is
Pi(k + 1) = Pi(k) + a[1 - Pi(k)] Pj(k + 1) = (1 - a)Pj(k), for j \= iIf the reinforcement signal is -1 (i.e., negative feedback), the learning rule is
Pi(k + 1) = (1 - b)Pi(k) Pj(k + 1) = b/(r - 1) + (1 - b)Pj(k), for j \= iIn the above rules, a and b are learning rates. Thus, in the case of positive feedback, the probability of the current action is increased with relative decrease in the probabilities of the other actions. The adjustment is reversed in the case of negative feedback.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
pi* = argmaxa[r(s, a) + dV*(s(s, a)]Define
Q(s, a) = r(s, a) + dV*(s(s, a))Then
pi* = argmaxaQ(s, a)Suppose we define
V*(s) = maxa'Q(s, a')Then we can rewrite the Q function as
Q(s, a) = r(s, a) + dmaxa'Q(s(s, a), a')In each iteration of learning, we do
Q'(s, a) <-- r(s, a) + dmaxa'Q'(s(s, a), a')and we update the Q function table S X A.
Q' --> QOnce the Q function is learned, the optimal policy is learned since
pi*(s) = argmaxaQ(s, a)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Q'(s1, move-to-right) = r(s1, move-to-right) + dmaxa'Q'(s2, a')where s2 corresponds to the middle upper position. Note that s(s1, move-to-right) = s2.
r(s1, move-to-right) = 0The current Q function table shows that
Q'(s2, move-to-right) = 100 Q'(s2, move-to-left) = 63 Q'(s2, move-down) = 81Thus,
maxa' Q'(s2, a') = 100It leads to
Q'(s1, move-to-right) = 0 + 0.9 * 100 = 90
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For all s, a, n, Q'n+1(s, a) > = Q'n(s, a) where n denotes the nth iteration. For all s, a, n 0 < = Q'n(s, a) < = Q(s, a)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Sutton (1988) describes temporal difference methods which calculate parameter adjustment based on the difference between temporally successive predictions of the output, instead of the difference between the predicted and the desired outputs, and are thus different from backpropagation learning.
In game playing, for example, backpropagation cannot be used to update the weights after each move, since the outcome is not available until the end of the game. Even though it is possible to update the weights at the end, this approach requires the storage of all the information along the way. The temporal difference method, on the other hand, need not store all information. It saves weight update information at each move and performs on-line prediction. Besides storage efficiency, it is claimed that temporal difference methods are computationally more efficient than ordinary methods and converge faster to an optimum for multistep problems, the problems for which we make prediction several steps ahead of time. For example, predicting Sunday's weather based on Wednesday is a multistep problem. It can be argued that most prediction problems are of this type.
How to apply temporal difference to Q learning?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
However, the control parameter for future adjustment must decay over time to ensure convergence.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%