Reinforcement learning

Reinforcement learning is a subfield of machine learning concerned with how an agent can learn to make a sequence of decisions in an environment so as to maximize some notion of cumulative reward. The key characteristic of reinforcement learning is that the agent learns by receiving feedback in the form of rewards and punishments from the environment, rather than being explicitly told what actions to take.

Formally, a reinforcement learning problem can be defined as a Markov decision process (MDP), which consists of five elements:

At each time step t, the agent observes the current state st, selects an action at based on some policy π, receives a reward rt, and transitions to a new state st+1 according to the transition function T. The goal of the agent is to learn a policy that maximizes the expected cumulative reward over some finite or infinite time horizon, defined as:

Gt = rt+1 + γrt+2 + γ^2rt+3 + ... = ∑k=0∞ γ^k rt+k+1

The optimal policy π* is the policy that maximizes the expected cumulative reward from any state, and the value function Vπ(s) is the expected cumulative reward starting from state s and following policy π. The optimal value function V*(s) is the maximum value function over all policies.

Reinforcement learning algorithms typically use a trial-and-error approach to iteratively update the policy or value function based on the observed rewards and transitions. Popular algorithms include Q-learning, SARSA, and policy gradient methods such as REINFORCE. These algorithms can be further categorized as model-based or model-free depending on whether they make use of the transition function T and reward function R to explicitly model the environment or learn directly from experience.


One example of reinforcement learning is the classic problem of training an agent to play a game of tic-tac-toe. In this problem, the state space S consists of all possible configurations of the tic-tac-toe board, the action space A consists of all possible moves that the agent can make, and the reward function R assigns a reward of +1 for a win, -1 for a loss, and 0 for a draw.

The goal of the agent is to learn a policy π that maximizes the expected cumulative reward over the course of the game. To do this, the agent can use a value function Vπ(s) that estimates the expected cumulative reward from state s by following policy π.

One popular algorithm for solving this problem is Q-learning, which iteratively updates an estimate of the optimal action-value function Q*(s,a) by computing the expected value of taking action a in state s and then following the optimal policy thereafter. The update rule is:

Q*(s,a) ← Q*(s,a) + α [r + γ maxa' Q*(s',a') - Q*(s,a)]

where r is the reward received for taking action a in state s and transitioning to state s', and α and γ are hyperparameters that control the learning rate and discount factor, respectively.

During training, the agent selects actions according to an exploration policy that balances between exploiting the current estimate of the optimal action-value function and exploring new actions. As training progresses, the agent's estimate of the optimal action-value function converges to the true optimal values, and the agent's policy becomes more and more deterministic.

Once the agent has learned an optimal policy, it can use this policy to play tic-tac-toe against human opponents or other agents. The agent's performance can be evaluated by measuring its win rate against a set of opponents, and the agent's policy and value function can be further improved by continuing to train it against a variety of opponents with different playing styles.