The objective of policy evaluation is to compute the state-value function v_Ļ for an arbitrary policy Ļ. Recall that the state-value function for sāS is defined as
In MDPs, the environment's dynamics are completely known, given an arbitrary policy $\pi$, so the only unknown in the above equation is the state-value function vĻā(s),āsāS. Consequently, the equation below is a system of ā£Sā£ linear equations in ā£Sā£ unknowns,
This system of equations can be solved straightforwardly using linear algebra techniques.
In addition, iterative solution methods can also be used to solve the system of equations. First, we initialize the state-value function arbitrarily, say v0ā(s)=0,āsāS. Then, we iteratively update the state-value function using the following update rule,
Iteratively updating the state-value function using the above update rule will generate a sequence of state-value functions, v0ā,v1ā,v2ā,ā¦, which will converge to the true state-value function vĻā as kāā. This algorithm is known as iterative policy evaluation.
Policy Improvement
Given a policy Ļ, the iterative policy evaluation algorithm can be used to estimate the state-value function vĻā. The state-value function vĻā describes expected return from each state under policy Ļ.
Once the state-value function vĻā is estimated, can we improve the policy to get better expected return? The answer is yes.
For a given state sāS, we choose the action aāA and use Ļ thereafter. The value is given by
Note that EĻā²ā[Rt+1ā+Ī³Rt+2ā+Ī³2vĻā(St+2ā)ā£Stā=s] means that we choose actions according to Ļā² and get Rt+1ā and Rt+2ā and then use Ļ thereafter.
With the policy improvement theorem, given a policy Ļ and its state-value function vĻā, we can construct a new policy Ļā² that is as good as, or better than, policy Ļ. The new policy Ļā² is constructed by selecting the action that maximizes the state-action value function qĻā(s,a) for each state sāS,
This equation is the same as the Bellman optimality equation. Therefore, vĻā²ā(s)=vāā(s), and Ļā²=Ļāā.
Policy Iteration
Given an arbitrary policy Ļ, we can use the iterative policy evaluation algorithm to estimate the state-value function vĻā. Then, we can use the policy improvement algorithm to construct a new policy Ļā² that is as good as, or better than, policy Ļ. Iteratively applying the policy evaluation and policy improvement algorithms will generate a sequence of policies, Ļ0ā,Ļ1ā,Ļ2ā,ā¦. If the new policy Ļā² is the same as the old policy Ļ, then we have found the optimal policy Ļāā.
This algorithm is known as policy iteration. The pseudo-code for policy iteration is as follows,