Random Walk n-Step TD

A parameter study of n-step Temporal-Difference learning (Sutton & Barto, Figure 7.2).

About the Simulation

This application explores the simple **Random Walk** environment. An agent starts in the center of a 19-state chain and moves left or right with 50% probability on each step. The goal is to learn the **true value** of each state, which is the probability of reaching the right end (+1 reward) versus the left end (-1 reward). This serves as a perfect testbed for comparing reinforcement learning algorithms.

Online vs. Offline Updates

This experiment compares two ways of applying the n-step TD update: Online (Standard TD): The value function is updated immediately after each calculation. This allows the algorithm to use the newest information right away. Offline (MC-style): Updates are calculated using the value function from the start of the episode and are only applied at the very end. This is slower but can sometimes be more stable.

Run Experiment

Click the button below to run the full parameter study. The simulation will test various values for `n` (n-steps) and `α` (alpha) to find the best-performing combination for each update method.

RMS Error vs. α (Online)

Shows the final error for the online update method. Lower values are better. Each line represents a different `n`.

RMS Error vs. α (Offline)

Shows the final error for the offline update method. Note how performance changes compared to the online version.

Learned vs. True State Values

This chart shows the final learned value function for the best parameter combination found (solid blue line), compared against the mathematically perfect true values (dashed grey line).