SwiftTD

SwiftTD: A Fast and Robust Algorithm for Temporal Difference Learning

Khurram Javed, Arsalan Sharifnassab, Richard S. Sutton

SwiftTD is a fast and robust algorithm for temporal difference learning. It can learn from large step-sizes effectively, without diverging. They key to its success is (1) its ability to increase the step-size of important signals, while reducing it for less important ones, and (2) a bound on the rate of learning that prevents updates that are too large, and can cause numerical instability and divergence. This demo highlights the capabilites of SwiftTD for learning to predict sum of discounted future rewards. The user can select an Atari game, and visualize the learning process. The user can also change the hyperparameters of the algorithm, and see the effect on learning. Finally, the user can visualize the step-sizes that the algorithm learns.

Control Center

Select game

Hyperparameters

Meta step-size:
-2 Step-size decay:
0.9 Learning rate bound:
0.5 Initial Step-size:
-7 Speed:
1 Lambda:
0.9

Agent state construction

The canvas above shows the gameplay frame, and the meta-learned step-sizes. The step-sizes tell us which pixels are getting credit for learning. If you let the learner run for a while, you will see that the learner assigns credit to pixels that are important for predicting the onset of reward.

Metrics

We visualize the reward signal, predictions and the rate of learning.

Predictions

The graph above shows the predictions of the system, and the reward. The y-axis on the left, -3 to +3, is for predictions, and the other y-axis is for the reward. After learning, the agent's prediction should anticipate the onset of reward.

Rate of Learning

We define learning rate as the effective rate of learning of the system. A learning rate of 0.5 indicates that the learner will minimize the error of a prediction by half after doing update towards the target. Learning rate over 1 means that the agent overshoots the target after the update. Ideally, we don't want to overshoot. The following graph shows the per-step learning rate of the system.