Dynamic Programming

Policy Evaluation

The objective of policy evaluation is to compute the state-value function v_Ļ€v\_{\pi} for an arbitrary policy Ļ€\pi. Recall that the state-value function for sāˆˆSs \in \mathcal{S} is defined as

vĻ€(s)=EĻ€[Gtāˆ£St=s]=EĻ€[Rt+1+Ī³Gt+1āˆ£St=s]=āˆ‘aĻ€(aāˆ£s)āˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³EĻ€[Gt+1āˆ£St+1=sā€²]]=āˆ‘aĻ€(aāˆ£s)āˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)]\begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ &= \mathbb{E}_{\pi}[{R_{t+1} + \gamma G_{t+1} | S_t = s}] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned}

In MDPs, the environment's dynamics are completely known, given an arbitrary policy $\pi$, so the only unknown in the above equation is the state-value function vĻ€(s),āˆ€sāˆˆSv_{\pi}(s), \forall s \in \mathcal{S}. Consequently, the equation below is a system of āˆ£Sāˆ£|\mathcal{S}| linear equations in āˆ£Sāˆ£|\mathcal{S}| unknowns,

vĻ€(s)=āˆ‘aĻ€(aāˆ£s)āˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)],āˆ€sāˆˆSv_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')], \forall s \in \mathcal{S}

This system of equations can be solved straightforwardly using linear algebra techniques.

In addition, iterative solution methods can also be used to solve the system of equations. First, we initialize the state-value function arbitrarily, say v0(s)=0,āˆ€sāˆˆSv_0(s) = 0, \forall s \in \mathcal{S}. Then, we iteratively update the state-value function using the following update rule,

vk+1(s)=āˆ‘aĻ€(aāˆ£s)āˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)],āˆ€sāˆˆSv_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')], \forall s \in \mathcal{S}

Iteratively updating the state-value function using the above update rule will generate a sequence of state-value functions, v0,v1,v2,ā€¦v_0, v_1, v_2, \ldots, which will converge to the true state-value function vĻ€v_{\pi} as kā†’āˆžk \rightarrow \infty. This algorithm is known as iterative policy evaluation.

Policy Improvement

Given a policy Ļ€\pi, the iterative policy evaluation algorithm can be used to estimate the state-value function vĻ€v_{\pi}. The state-value function vĻ€v_{\pi} describes expected return from each state under policy Ļ€\pi.

Once the state-value function vĻ€v_{\pi} is estimated, can we improve the policy to get better expected return? The answer is yes.

For a given state sāˆˆSs \in \mathcal{S}, we choose the action aāˆˆAa \in \mathcal{A} and use Ļ€\pi thereafter. The value is given by

qĻ€(s,a)=āˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)]q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')]

Policy Improvement Theorem:

Let Ļ€\pi and Ļ€ā€²\pi' be any pair of deterministic policies such that, for all sāˆˆSs \in \mathcal{S},

qĻ€(s,Ļ€ā€²(s))ā‰„vĻ€(s)q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)

Then the policy Ļ€ā€²\pi' must be as good as, or better than, policy Ļ€\pi. Therefore, for all sāˆˆSs \in \mathcal{S},

vĻ€ā€²(s)ā‰„vĻ€(s)v_{\pi'}(s) \geq v_{\pi}(s)

Furthermore, if qĻ€(s,Ļ€ā€²(s))>vĻ€(s)q_{\pi}(s, \pi'(s)) > v_{\pi}(s) for at least one state sāˆˆSs \in \mathcal{S}, then the policy Ļ€ā€²\pi' is strictly better than policy Ļ€\pi.

Proof:

vĻ€(s)ā‰¤qĻ€(s,Ļ€ā€²(s))=E[Rt+1+Ī³vĻ€(St+1)āˆ£St=s,At=Ļ€ā€²(s)]=EĻ€ā€²[Rt+1+Ī³vĻ€(St+1)āˆ£St=s]ā‰¤EĻ€ā€²[Rt+1+Ī³qĻ€(St+1,Ļ€ā€²(St+1))āˆ£St=s]=EĻ€ā€²[Rt+1+Ī³Rt+2+Ī³2vĻ€(St+2)āˆ£St=s]ā‰¤EĻ€ā€²[Rt+1+Ī³Rt+2+Ī³2qĻ€(St+2,Ļ€ā€²(St+2))āˆ£St=s]ā€¦ā‰¤EĻ€ā€²[Rt+1+Ī³Rt+2+Ī³2Rt+3+ā€¦āˆ£St=s]=vĻ€ā€²(s)\begin{aligned} v_{\pi}(s) & \leq q_{\pi}(s, \pi'(s)) \\ & = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = \pi'(s)] \\ & = \mathbb{E}_{\pi'}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\ & \leq \mathbb{E}_{\pi'}[{R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi'(S_{t+1})) | S_t = s}] \\ & = \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t = s] \\ & \leq \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\pi}(S_{t+2}, \pi'(S_{t+2})) | S_t = s] \\ & \dots \\ & \leq \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s] \\ & = v_{\pi'}(s) \end{aligned}

Note that EĻ€ā€²[Rt+1+Ī³Rt+2+Ī³2vĻ€(St+2)āˆ£St=s]\mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t = s] means that we choose actions according to Ļ€ā€²\pi' and get Rt+1R_{t+1} and Rt+2R_{t+2} and then use Ļ€\pi thereafter.

With the policy improvement theorem, given a policy Ļ€\pi and its state-value function vĻ€v_{\pi}, we can construct a new policy Ļ€ā€²\pi' that is as good as, or better than, policy Ļ€\pi. The new policy Ļ€ā€²\pi' is constructed by selecting the action that maximizes the state-action value function qĻ€(s,a)q_{\pi}(s, a) for each state sāˆˆSs \in \mathcal{S},

Ļ€ā€²(s)=argā”maxā”aqĻ€(s,a)=argā”maxā”aāˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)]\begin{aligned} \pi'(s) &= \arg\max_{a} q_{\pi}(s, a) \\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned}

This algorithm is known as policy improvement.

If the new policy Ļ€ā€²\pi' is the same as the old policy Ļ€\pi, Ļ€ā€²=Ļ€\pi' = \pi, then we have the following equation for all sāˆˆSs \in \mathcal{S},

Ļ€ā€²(s)=argā”maxā”aāˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€(sā€²)]=argā”maxā”aāˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€ā€²(sā€²)]\begin{aligned} \pi'(s) &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi'}(s')] \\ \end{aligned}

Hence,

vĻ€ā€²(s)=maxā”aāˆ‘sā€²,rp(sā€²,rāˆ£s,a)[r+Ī³vĻ€ā€²(sā€²)]\begin{aligned} v_{\pi'}(s) &= \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi'}(s')] \\ \end{aligned}

This equation is the same as the Bellman optimality equation. Therefore, vĻ€ā€²(s)=vāˆ—(s)v_{\pi'}(s) = v_{*}(s), and Ļ€ā€²=Ļ€āˆ—\pi' = \pi_{*}.

Policy Iteration

Given an arbitrary policy Ļ€\pi, we can use the iterative policy evaluation algorithm to estimate the state-value function vĻ€v_{\pi}. Then, we can use the policy improvement algorithm to construct a new policy Ļ€ā€²\pi' that is as good as, or better than, policy Ļ€\pi. Iteratively applying the policy evaluation and policy improvement algorithms will generate a sequence of policies, Ļ€0,Ļ€1,Ļ€2,ā€¦\pi_0, \pi_1, \pi_2, \ldots. If the new policy Ļ€ā€²\pi' is the same as the old policy Ļ€\pi, then we have found the optimal policy Ļ€āˆ—\pi_{*}.

This algorithm is known as policy iteration. The pseudo-code for policy iteration is as follows,

Last updated