Artificial Intelligence : Notes
  • Supervised Learning
    • Trees
      • AdaBoost
      • ID3
      • Random Forests
    • Convolutional Neural Networks
    • DNN for Classification
    • K-Nearest Neighbors
    • LDA
    • Logistic Regression
    • Perceptron
    • QDA
    • SVM
  • Unsupervised Learning
    • DBSCAN
    • Deep Autoencoder
    • Generative Adversarial Networks (GAN)
    • K-Means Clustering
    • Linear Regression
    • Principal Component Analysis (PCA)
    • Restricted Boltzmann Machines (RBM)
  • Reinforcement Learning
    • Markov Decision Process
      • Model
      • Mathematical framework
      • Bellman equation
      • See also
    • Q-Learning
    • Deep Q-Learning
  • Ensemble Strategies
    • Ensemble Learning
    • Fine-tuning and resampling
  • Other Techniques
    • Expectation-Maximization
    • Recurrent Neural Networks

Markov Decision Process

An MDP is a mathematical framework for modeling decision-making under uncertainty. It consists of a set of states, actions, and transition probabilities. At each time step, an agent selects an action in a given state, leading to a new state and a reward. The process follows the Markov property, where future states depend only on the current state and action. The goal is to find an optimal policy that maximizes the expected cumulative reward over time. MDPs are widely used in fields such as artificial intelligence and robotics for modeling and solving sequential decision problems.

Model

An MDP is defined as a tuple (S,A,T,R,γ) where

  • S is the set of the environmental states
  • A is the set of actions that can be performed (can depend on the state)
  • T:S×A→S is a transition function
  • R:S×A→R is a reward function
  • γ is a discount factor

The objective of an agent in this setup is to maximize the sum of rewards by learning what to do: in each state find the optimal policy.

Mathematical framework

Let t be an integer and denote by:

  • St the state of the agent at time t
  • Rt the immediate reward at time t
  • p(s′,r|s,a)=P(St=s′,Rt=r|St−1=s,At−1=a)
  • p(s′|s,a)=P(St=s′|St−1=s,At−1=a)=∑rp(s′,r|s,a)
  • r(s,a)=E(Rt|St−1=s,At−1=a)
  • Gt=∑k=0∞γkRt+k+1
  • π(a|s) : probability of choosing action a in state s (policy)

Bellman equation

Denote by Q⋆ an optimal Q-function, that is an optimal choice of action a in state s. Then Q⋆ satisfies the Bellman equation below:

Q⋆(s,a)=r+γmaxa′Q⋆(a′,s′).

This is a consequence of the law of total probabilities and the fact that for a given policy π:

qπ(s,a)=Eπ(Rt+1+γGt+1|St=s,At=a)

If the environment is stochastic, that is if taking action a in state s will lead to a stochastic state s′, the Bellman equation becomes:

Q⋆(s,a)=∑s′p(s′|s,a)[r+γmaxa′Q⋆(a′,s′)].

See also

  • Q-Learning
  • Deep Q-Learning
Next
Q-Learning