Markov Decision Process
An MDP is a mathematical framework for modeling decision-making under uncertainty. It consists of a set of states, actions, and transition probabilities. At each time step, an agent selects an action in a given state, leading to a new state and a reward. The process follows the Markov property, where future states depend only on the current state and action. The goal is to find an optimal policy that maximizes the expected cumulative reward over time. MDPs are widely used in fields such as artificial intelligence and robotics for modeling and solving sequential decision problems.
Model
An MDP is defined as a tuple
is the set of the environmental states is the set of actions that can be performed (can depend on the state) is a transition function is a reward function is a discount factor
The objective of an agent in this setup is to maximize the sum of rewards by learning what to do: in each state find the optimal policy.
Mathematical framework
Let
the state of the agent at time the immediate reward at time : probability of choosing action in state (policy)
Bellman equation
Denote by
This is a consequence of the law of total probabilities and the fact that for a given policy
If the environment is stochastic, that is if taking action