Markov Decision Process

An MDP is a mathematical framework for modeling decision-making under uncertainty. It consists of a set of states, actions, and transition probabilities. At each time step, an agent selects an action in a given state, leading to a new state and a reward. The process follows the Markov property, where future states depend only on the current state and action. The goal is to find an optimal policy that maximizes the expected cumulative reward over time. MDPs are widely used in fields such as artificial intelligence and robotics for modeling and solving sequential decision problems.

Model

An MDP is defined as a tuple $(S, A, T, R, γ)$ where

$S$ is the set of the environmental states
$A$ is the set of actions that can be performed (can depend on the state)
$T : S \times A \to S$ is a transition function
$R : S \times A \to R$ is a reward function
$γ$ is a discount factor

The objective of an agent in this setup is to maximize the sum of rewards by learning what to do: in each state find the optimal policy.

Mathematical framework

Let $t$ be an integer and denote by:

$S_{t}$ the state of the agent at time $t$
$R_{t}$ the immediate reward at time $t$
$p (s^{'}, r | s, a) = P (S_{t} = s^{'}, R_{t} = r | S_{t - 1} = s, A_{t - 1} = a)$
$p (s^{'} | s, a) = P (S_{t} = s^{'} | S_{t - 1} = s, A_{t - 1} = a) = \sum_{r} p (s^{'}, r | s, a)$
$r (s, a) = E (R_{t} | S_{t - 1} = s, A_{t - 1} = a)$
$G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$
$π (a | s)$ : probability of choosing action $a$ in state $s$ (policy)

Bellman equation

Denote by $Q^{⋆}$ an optimal $Q$ -function, that is an optimal choice of action $a$ in state $s$ . Then $Q^{⋆}$ satisfies the Bellman equation below:

Q^{⋆} (s, a) = r + γ max_{a^{'}} Q^{⋆} (a^{'}, s^{'}) .

This is a consequence of the law of total probabilities and the fact that for a given policy $π$ :

q_{π} (s, a) = E_{π} (R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a)

If the environment is stochastic, that is if taking action $a$ in state $s$ will lead to a stochastic state $s^{'}$ , the Bellman equation becomes:

Q^{⋆} (s, a) = \sum_{s^{'}} p (s^{'} | s, a) [r + γ max_{a^{'}} Q^{⋆} (a^{'}, s^{'})] .

Markov Decision Process

Model

Mathematical framework

Bellman equation

See also