The multi-armed bandit problem is a nonassociative problem, that is it does not involve learning to action in more than one situation. Consider the newsvendor problem, the number of newspapers in the morning is always 0, regardless of the previous day's orders and sales. However, many real-world problems are associative. Therefore, we need to learn to choose different actions in different situations.
Markov decision processes (MDPs) are associative problems in which the action taken in the current period affects the future states and rewards. MDPs are idealized models of reinforcement learning problems. In MDPs, complete knowledge of the environment is available.
Agent-Environment Interface
MDPs are used to formulate sequential decision-making problems. The agent interacts with the environment to achieve a goal.
Agent: The learner and decision-maker
Environment: Everything outside the agent
The agent interacts with the environment in discrete time t=0,1,2,…. At each time step t, the agent receives a representation of the environment's stateSt∈S, selects an actionAt∈A(St), and receives a rewardRt+1∈R⊂R and the next state St+1.
The interaction between the agent and the environment can be summarized as trajectory:
S0,A0,R1,S1,A1,R2,S2,A2,R3,…
Formulation of MDP
A Markov decision process (MDP) is a tuple (S,A,p,r,γ) where:
S: Set of states
A: Set of actions
p(s′,r∣s,a): Dynamics function p:S×R×S×A→[0,1]
r(s,a): Reward function r:S×A→R
γ: Discount factor γ∈[0,1]
Dynamics function
The function function p defines the probability of transitioning to state s′ and receiving reward r given state s and action a.
p(s′,r∣s,a)=Pr{St=s′,Rt=r∣St−1=s,At−1=a}
Markov Property:
The state has the Markov property if the state includes all relevant information from the interaction history that may affect the future.
Note that p captures all the environment's dynamics completely. The possibility of each possible St and Rt depends only on the preceding state St−1 and action At−1.
Reward function
The reward function r(s,a) defines the expected rewards given state s and action a.
The objective of the agent is to maximize the expected value of the cumulative sum of a received scalar signal (reward).
To formalize the objective, we define the returnGt as certain function of the rewards received after time t. The simplest form of return is the sum of rewards:
Gt=Rt+1+Rt+2+Rt+3+…+RT
where T is the final time step. Note that this definition of return is suitable for tasks that will eventually end. In such cases, each episode ends in a terminal state.
There are also MDPs that do not have terminal states. We call such MDPs continuing tasks. In continuing tasks, the return that we defined above could be infinite. To handle such cases, we introduce the concept of discounted return:
Gt=Rt+1+γRt+2+γ2Rt+3+…=k=0∑∞γkRt+k+1
where γ is the discount factor that satisfies 0≤γ≤1.
The discount return can be represented as the following recursive form:
The agent interacts with the environment by selecting actions. To describe the agent's behavior, we introduce the concept of policy. A policy is a mapping from states to probabilities of selecting each possible action. A policy is denoted by π.
π(a∣s)=Pr{At=a∣St=s}
Value function
The value function is the expected return when starting in state s and following policy π thereafter. The state-value function for policy π is denoted by vπ(s).
vπ(s)=Eπ[Gt∣St=s]
Similarly, we use a action-value function to represent the expected return when starting in state s, taking action a, and following policy π thereafter. The action-value function for policy π is denoted by qπ(s,a).
qπ(s,a)=Eπ[Gt∣St=s,At=a]
An important property of value functions is that they satisfy recursive relationships. The value of a state can be expressed in terms of the value of its possible successor states:
The last equation is known as the Bellman equation.
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
It is write in recursive form that indicates the relationship between vπ(s) and all the possible successor states' values vπ(s′).
Optimal Policies and Optimal Value Functions
For all s∈S, if vπ(s)≥vπ′(s), then π is better than or equal to π′, denoted by π≥π′. There is always at least one policy that is better than or equal to all other policies. This policy is called the optimal policy and denoted by π∗.
Using the concept of optimal policy, we can define the optimal state-value function and optimal action-value function as follows:
The last equation is known as the Bellman optimality equation.
Bellman optimality equation:
The last equation is known as the Bellman optimality equation for the state-value function.
v∗(s)=amaxs′,r∑p(s′,r∣s,a)[r+γv∗(s′)]
The Bellman optimality equation is a system of nonlinear equations. The solution to the system of equations is the optimal value function. For a finite MDP that has n states, the system of equations has n equations and n unknowns.
In addition, the Bellman optimality equation for the action-value function is:
q∗(s,a)=s′,r∑p(s′,r∣s,a)[r+γa′maxq∗(s′,a′)]
Note that if we know the optimal value function v∗(s), for all s∈S, we can easily find the optimal policy π∗(s) by selecting the action that maximizes the right-hand side of the Bellman optimality equation.
π∗(s)=argamaxs′,r∑p(s′,r∣s,a)[r+γv∗(s′)]
Linear Programming
The Bellman optimality equation can be solved using linear programming. This is a less frequently used method for solving MDP. The idea is that
If v(s)≥r(s,a)+γ∑s′p(s′∣s,a)v(s′) for all s∈S and a∈A, then v(s) is an upper bound on v∗(s).
v∗(s) must be the smallest such solution
The linear programming formulation of the Bellman optimality equation is as follows:
minimize s.t. s∈S∑αsv(s)v(s)≥r(s,a)+γs′∑p(s′∣s,a)v(s′),∀s∈S,∀a∈AThe constants αs are arbitrary positive numbers.
Notes:
Linear programming methods can also be used to solve MDPs
Linear programming methods become impractical at a much smaller number of states than do DP methods (by a factor of about 100).
Application: Cliff Walking Problem
The goal is to find the optimal policy for moving an agent from a starting position to a goal position as quickly as possible while avoiding falling off a cliff.
# r: reward matrix, n_state * n_action# p: transition probability matrix, n_state * n_action * n_state# gamma: discount factorfrom gurobipy import GRB, Model, quicksumdeflp_solver(r,p,gamma): action_set =set(range(r.shape[1])) state_set =set(range(r.shape[0])) n_state =len(state_set)# create a model instance model =Model()# create variablesfor s inrange(n_state): model.addVar(name=f'v_{s}', lb=-GRB.INFINITY)# update the model model.update()# create constraintsfor state in state_set:for action in action_set: model.addConstr(model.getVarByName(f'v_{state}') >=quicksum( gamma * p[state, action, next_state] * model.getVarByName(f'v_{next_state}') for next_state in state_set ) + r[state, action])
# set objective model.setObjective(quicksum(model.getVarByName(f'v_{state}') for state in state_set ), GRB.MINIMIZE)# optimize model.optimize()return model
from gurobipy import GRB, Model, quicksumimport gymnasium as gymimport numpy as npfrom lp_solver import lp_solver# create an environmentenv = gym.make('CliffWalking-v0')n_state = env.unwrapped.nSn_action = env.unwrapped.nAstate_set =set(range(n_state))action_set =set(range(n_action))# The player cannot be at the cliff, nor at the goal terminal_state_set = [47] unreachable_state_set = [37,38,39,40,41,42,43,44,45,46]# the reachable state set is the set of all states except the cliff and the goal.# only the states in the reachable state set are considered in the optimization problemreachable_state_set =set(set(state_set) -set(terminal_state_set) -set(unreachable_state_set))# set parametersgamma =1# initialize reward and transition probabilityr = np.zeros((n_state, n_action))p = np.zeros((n_state, n_action, n_state))for state in reachable_state_set:for action in action_set:for prob, next_state, reward, terminated in env.unwrapped.P[state][action]: r[state, action]+= prob * reward p[state, action, next_state]+= prob# solve the mdp problem using linear programmingmodel =lp_solver(r, p, gamma)# state valuevalue_function ={}for state in reachable_state_set: value_function[state]= model.getVarByName(f'v_{state}').xpolicy ={}for state in terminal_state_set: value_function[47]=0for state in reachable_state_set: q_max_value =-np.inffor action in action_set: q_value_temp =sum([prob * (reward + gamma * value_function[next_state])for prob, next_state, reward, terminated in env.unwrapped.P[state][action]])if q_value_temp > q_max_value: q_max_value = q_value_temp policy[state]= action# print value function 4*12, 1 digital after decimal pointprint('value function = ')for i inrange(4):for j inrange(12):if i *12+ j in value_function:print('{:.1f}'.format(value_function[i *12+ j]), end='\t')else:print('x', end='\t')print()print('optimal policy = ')for i inrange(4):for j inrange(12):if i *12+ j in policy:print(policy[i *12+ j], end='\t')else:print('x', end='\t')print()model.write("model.lp")
Summary
Markov decision processes (MDPs) are used to formulate sequential decision-making problems.
MDPs are a tuple (S,A,p,r,γ).
The Bellman optimality equation is a system of nonlinear equations that can be solved to find the optimal value function.
Linear programming can be used to find the optimal value function.