Markov Decision Processes

The multi-armed bandit problem is a nonassociative problem, that is it does not involve learning to action in more than one situation. Consider the newsvendor problem, the number of newspapers in the morning is always 0, regardless of the previous day's orders and sales. However, many real-world problems are associative. Therefore, we need to learn to choose different actions in different situations.

Markov decision processes (MDPs) are associative problems in which the action taken in the current period affects the future states and rewards. MDPs are idealized models of reinforcement learning problems. In MDPs, complete knowledge of the environment is available.

Agent-Environment Interface

MDPs are used to formulate sequential decision-making problems. The agent interacts with the environment to achieve a goal.

  • Agent: The learner and decision-maker

  • Environment: Everything outside the agent

The agent interacts with the environment in discrete time t=0,1,2,t = 0, 1, 2, \ldots. At each time step tt, the agent receives a representation of the environment's state StSS_t \in \mathcal{S}, selects an action AtA(St)A_t \in \mathcal{A}(S_t), and receives a reward Rt+1RRR_{t+1} \in \mathcal{R} \subset \mathbb{R} and the next state St+1S_{t+1}.

The interaction between the agent and the environment can be summarized as trajectory:

S0,A0,R1,S1,A1,R2,S2,A2,R3,S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots

Formulation of MDP

A Markov decision process (MDP) is a tuple (S,A,p,r,γ)(\mathcal{S}, \mathcal{A}, p, r, \gamma) where:

  • S\mathcal{S}: Set of states

  • A\mathcal{A}: Set of actions

  • p(s,rs,a)p(s', r | s, a): Dynamics function p:S×R×S×A[0,1]p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]

  • r(s,a)r(s, a): Reward function r:S×ARr: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}

  • γ\gamma: Discount factor γ[0,1]\gamma \in [0, 1]

Dynamics function

The function function pp defines the probability of transitioning to state ss' and receiving reward rr given state ss and action aa.

p(s,rs,a)=Pr{St=s,Rt=rSt1=s,At1=a}p(s', r | s, a) = \Pr\{S_{t} = s', R_{t} = r | S_{t-1} = s, A_{t-1} = a\}

Markov Property:

The state has the Markov property if the state includes all relevant information from the interaction history that may affect the future.

Note that pp captures all the environment's dynamics completely. The possibility of each possible StS_t and RtR_t depends only on the preceding state St1S_{t-1} and action At1A_{t-1}.

Reward function

The reward function r(s,a)r(s, a) defines the expected rewards given state ss and action aa.

r(s,a)=E[RtSt=s,At=a]=rRrsSp(s,rs,a)\begin{aligned} r(s, a) &= \mathbb{E}[R_{t} | S_{t} = s, A_{t} = a] \\ &= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r | s, a) \end{aligned}

Return

The objective of the agent is to maximize the expected value of the cumulative sum of a received scalar signal (reward).

To formalize the objective, we define the return GtG_t as certain function of the rewards received after time tt. The simplest form of return is the sum of rewards:

Gt=Rt+1+Rt+2+Rt+3++RTG_t = R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_T

where TT is the final time step. Note that this definition of return is suitable for tasks that will eventually end. In such cases, each episode ends in a terminal state.

There are also MDPs that do not have terminal states. We call such MDPs continuing tasks. In continuing tasks, the return that we defined above could be infinite. To handle such cases, we introduce the concept of discounted return:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \\ &= \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \end{aligned}

where γ\gamma is the discount factor that satisfies 0γ10 \leq \gamma \leq 1.

The discount return can be represented as the following recursive form:

Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \ldots) \\ &= R_{t+1} + \gamma G_{t+1} \end{aligned}

Policy and Value function

Policy

The agent interacts with the environment by selecting actions. To describe the agent's behavior, we introduce the concept of policy. A policy is a mapping from states to probabilities of selecting each possible action. A policy is denoted by π\pi.

π(as)=Pr{At=aSt=s}\pi(a | s) = \Pr\{A_t = a | S_t = s\}

Value function

The value function is the expected return when starting in state ss and following policy π\pi thereafter. The state-value function for policy π\pi is denoted by vπ(s)v_{\pi}(s).

vπ(s)=Eπ[GtSt=s]v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]

Similarly, we use a action-value function to represent the expected return when starting in state ss, taking action aa, and following policy π\pi thereafter. The action-value function for policy π\pi is denoted by qπ(s,a)q_{\pi}(s, a).

qπ(s,a)=Eπ[GtSt=s,At=a]q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]

An important property of value functions is that they satisfy recursive relationships. The value of a state can be expressed in terms of the value of its possible successor states:

vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]=aπ(as)s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s]]=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]\begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ &= \mathbb{E}_{\pi}[{R_{t+1} + \gamma G_{t+1} | S_t = s}] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned}

Bellman Equation:

The last equation is known as the Bellman equation.

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')]

It is write in recursive form that indicates the relationship between vπ(s)v_{\pi}(s) and all the possible successor states' values vπ(s)v_{\pi}(s').

Optimal Policies and Optimal Value Functions

For all sSs \in \mathcal{S}, if vπ(s)vπ(s)v_{\pi}(s) \geq v_{\pi'}(s), then π\pi is better than or equal to π\pi', denoted by ππ\pi \geq \pi'. There is always at least one policy that is better than or equal to all other policies. This policy is called the optimal policy and denoted by π\pi_*.

Using the concept of optimal policy, we can define the optimal state-value function and optimal action-value function as follows:

v(s)=maxπvπ(s),sSv_*(s) = \max_{\pi} v_{\pi}(s), \quad \forall s \in \mathcal{S}
q(s,a)=maxπqπ(s,a),sS,aAq_*(s, a) = \max_{\pi} q_{\pi}(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}

Bellman Optimality Equation

v(s)=maxaq(s,a)=maxaEπ[GtSt=s,At=a]=maxaEπ[Rt+1+γGt+1St=s,At=a]=maxaE[Rt+1+γvSt+1St=s,At=a]=maxas,rp(s,rs,a)[r+γv(s)]\begin{aligned} v_*(s) &= \max_{a} q_*(s, a) \\ &= \max_{a} \mathbb{E}_{\pi_*}[G_t | S_t = s, A_t = a] \\ &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ &= \max_{a} \mathbb{E}[R_{t+1} + \gamma v_*{S_{t+1}} | S_t = s, A_t = a] \\ &= \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \end{aligned}

The last equation is known as the Bellman optimality equation.

Bellman optimality equation:

The last equation is known as the Bellman optimality equation for the state-value function.

v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

The Bellman optimality equation is a system of nonlinear equations. The solution to the system of equations is the optimal value function. For a finite MDP that has nn states, the system of equations has nn equations and nn unknowns.

In addition, the Bellman optimality equation for the action-value function is:

q(s,a)=s,rp(s,rs,a)[r+γmaxaq(s,a)]q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]

Note that if we know the optimal value function v(s)v_*(s), for all sSs \in \mathcal{S}, we can easily find the optimal policy π(s)\pi_*(s) by selecting the action that maximizes the right-hand side of the Bellman optimality equation.

π(s)=argmaxas,rp(s,rs,a)[r+γv(s)]\pi_*(s) = \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

Linear Programming

The Bellman optimality equation can be solved using linear programming. This is a less frequently used method for solving MDP. The idea is that

  • If v(s)r(s,a)+γsp(ss,a)v(s)v(s) \geq r(s, a) + \gamma \sum_{s'} p(s' | s, a) v(s') for all sSs \in S and aAa \in A, then v(s)v(s) is an upper bound on v(s)v_*(s).

  • v(s)v_*(s) must be the smallest such solution

The linear programming formulation of the Bellman optimality equation is as follows:

minimize sSαsv(s)s.t. v(s)r(s,a)+γsp(ss,a)v(s),sS,aAThe constants αs are arbitrary positive numbers.\begin{aligned} \textnormal{minimize } & \sum_{s \in S} \alpha_s v(s) \\ \textnormal{s.t. } & v(s) \geq r(s, a) + \gamma \sum_{s'} p(s' | s, a) v(s'), \forall s \in S, \forall a \in A \\ & \textnormal{The constants $\alpha_s$ are arbitrary positive numbers.} \end{aligned}

Notes:

  • Linear programming methods can also be used to solve MDPs

  • Linear programming methods become impractical at a much smaller number of states than do DP methods (by a factor of about 100).

Application: Cliff Walking Problem

The goal is to find the optimal policy for moving an agent from a starting position to a goal position as quickly as possible while avoiding falling off a cliff.

# r: reward matrix, n_state * n_action
# p: transition probability matrix, n_state * n_action * n_state
# gamma: discount factor

from gurobipy import GRB, Model, quicksum

def lp_solver(r, p, gamma):

    action_set = set(range(r.shape[1]))
    state_set = set(range(r.shape[0]))
    n_state = len(state_set)

    # create a model instance
    model = Model()

    # create variables
    for s in range(n_state):
        model.addVar(name=f'v_{s}', lb=-GRB.INFINITY)
    
    # update the model
    model.update()

    # create constraints
    for state in state_set:
        for action in action_set:
            model.addConstr(model.getVarByName(f'v_{state}') >= quicksum(
                gamma * p[state, action, next_state] * model.getVarByName(f'v_{next_state}') for next_state in state_set ) + r[state, action])

    # set objective
    model.setObjective(quicksum(model.getVarByName(
        f'v_{state}') for state in state_set ), GRB.MINIMIZE)

    # optimize
    model.optimize()

    return model
from gurobipy import GRB, Model, quicksum
import gymnasium as gym
import numpy as np
from lp_solver import lp_solver

# create an environment
env = gym.make('CliffWalking-v0')
n_state = env.unwrapped.nS
n_action = env.unwrapped.nA
state_set = set(range(n_state))
action_set = set(range(n_action))
# The player cannot be at the cliff, nor at the goal 
terminal_state_set = [47] 
unreachable_state_set = [37, 38, 39, 40, 41, 42, 43, 44, 45, 46]
# the reachable state set is the set of all states except the cliff and the goal.
# only the states in the reachable state set are considered in the optimization problem
reachable_state_set = set(set(state_set) - set(terminal_state_set) - set(unreachable_state_set))

# set parameters
gamma = 1

# initialize reward and transition probability
r = np.zeros((n_state, n_action))
p = np.zeros((n_state, n_action, n_state))

for state in reachable_state_set:
    for action in action_set:
        for prob, next_state, reward, terminated in env.unwrapped.P[state][action]:
            r[state, action] += prob * reward
            p[state, action, next_state] += prob

# solve the mdp problem using linear programming
model = lp_solver(r, p, gamma)

# state value
value_function = {}
for state in reachable_state_set:
    value_function[state] = model.getVarByName(f'v_{state}').x

policy = {}
for state in terminal_state_set:
    value_function[47] = 0
    
for state in reachable_state_set:
    q_max_value = -np.inf
    for action in action_set:
        q_value_temp = sum([prob * (reward + gamma * value_function[next_state])
                            for prob, next_state, reward, terminated in env.unwrapped.P[state][action]])
        if q_value_temp > q_max_value:
            q_max_value = q_value_temp
            policy[state] = action
        
# print value function 4*12, 1 digital after decimal point

print('value function = ')
for i in range(4):
    for j in range(12):
        if i * 12 + j in value_function:
            print('{:.1f}'.format(value_function[i * 12 + j]), end='\t')
        else:
            print('x', end='\t')
    print()

print('optimal policy = ')
for i in range(4):
    for j in range(12):
        if i * 12 + j in policy:
            print(policy[i * 12 + j], end='\t')
        else:
            print('x', end='\t')
    print()

model.write("model.lp")

Summary

  • Markov decision processes (MDPs) are used to formulate sequential decision-making problems.

  • MDPs are a tuple (S,A,p,r,γ)(\mathcal{S}, \mathcal{A}, p, r, \gamma).

  • The Bellman optimality equation is a system of nonlinear equations that can be solved to find the optimal value function.

  • Linear programming can be used to find the optimal value function.

Last updated