Actor Critic model

Based on Richard Sutton (1996), Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. Simbrain implementation by Jeff Yoshimi and Jonathon Vickrey.

Getting started

A model which learns the location of rewarding stimuli. Do a few runs through 5 trials using the "run" button on the control panel. Using default values, the rat should figure out how to get the cheese.

Parameters and what they mean

Epsilon: Probability of taking a random action. 0 for no random actions; 1 for all random actions. Doya, 2007 suggests this may be related to noradernaline (or Norepinephrine) which regulates overall arousal (it decreases when asleep, rises when awake, etc.)

Learning rate:How much weights are updated at each time step. Doya, 2007 suggests this may be related to acetlycholine in the brain, which regulates some forms of plasticity.

Discount Factor (gamma): Determines how "future oriented" the agent is. Range is 0-1. For values closer to 0 the agent is more focused on immediate rewards, it is "short-sighted" or impulsive. For values closer to 1 the agent is more focused on distant rewards. The agent will not learn when gamma is 0 because it only cares about immediate reward and never learns to attach values to states leading to reward.

Higher values of gamma produce better results in this model. The agent thinks ahead, and takes actions that will lead in the long run to cheese; the agent attaches value to states associated with other states that are associated via a chain of actions to the cheese.

Tanaka et. al. have related the discount factor to serotonin in the brain:

to elucidate the role of serotonin in the evaluation of delayed rewards, we performed a functional brain imaging experiment in which subjects chose small-immediate or large-delayed liquid rewards under dietary regulation of tryptophan, a precursor of serotonin. A model-based analysis revealed that the activity of the ventral part of the striatum was correlated with reward prediction at shorter time scales, and this correlated activity was stronger at low serotonin levels. By contrast, the activity of the dorsal part of the striatum was correlated with reward prediction at longer time scales, and this correlated activity was stronger at high serotonin levels. (Tanaka et. al, 2007, our emphasis).

Reward, Value, TD Error

The activation of the reward neuron is shown in the red time series. In this simulation, it only goes up when the agent is on top of the cheese.

The activation of the value neuron is shown in the green time series. When it goes up the agent is either experiencing reward or is in a state that it believes will lead to reward.

The activation of the td-error neuron is shown in the blue time series. When an unexpected amount of value occurs, td-error is positive and the network learns to associate the current state and action with more value. The last action is also reinforced. Conversely, if less than the expected value occurs td-error is negative and the network learns to associate the current state and action with less value. The last action is diminished.

Changes in values with learning

Running many trials will tend to increase value (green), bring reward (red) more frequently, and error (blue) less frequently. Of course these factors change as the parameters are changed.