Dynamic Programming

Policy Evaluation

The objective of policy evaluation is to compute the state-value function $v\_{\pi}$ for an arbitrary policy $\pi$ . Recall that the state-value function for $s \in \mathcal{S}$ is defined as

\begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ &= \mathbb{E}_{\pi}[{R_{t+1} + \gamma G_{t+1} | S_t = s}] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned}

In MDPs, the environment's dynamics are completely known, given an arbitrary policy $\pi$, so the only unknown in the above equation is the state-value function $v_{\pi}(s), \forall s \in \mathcal{S}$ . Consequently, the equation below is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns,

v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')], \forall s \in \mathcal{S}

This system of equations can be solved straightforwardly using linear algebra techniques.

In addition, iterative solution methods can also be used to solve the system of equations. First, we initialize the state-value function arbitrarily, say $v_0(s) = 0, \forall s \in \mathcal{S}$ . Then, we iteratively update the state-value function using the following update rule,

v_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')], \forall s \in \mathcal{S}

Iteratively updating the state-value function using the above update rule will generate a sequence of state-value functions, $v_0, v_1, v_2, \ldots$ , which will converge to the true state-value function $v_{\pi}$ as $k \rightarrow \infty$ . This algorithm is known as iterative policy evaluation.

Policy Improvement

Given a policy $\pi$ , the iterative policy evaluation algorithm can be used to estimate the state-value function $v_{\pi}$ . The state-value function $v_{\pi}$ describes expected return from each state under policy $\pi$ .

Once the state-value function $v_{\pi}$ is estimated, can we improve the policy to get better expected return? The answer is yes.

For a given state $s \in \mathcal{S}$ , we choose the action $a \in \mathcal{A}$ and use $\pi$ thereafter. The value is given by

q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')]

Policy Improvement Theorem:

Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $s \in \mathcal{S}$ ,

q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s)

Then the policy $\pi'$ must be as good as, or better than, policy $\pi$ . Therefore, for all $s \in \mathcal{S}$ ,

v_{\pi'}(s) \geq v_{\pi}(s)

Furthermore, if $q_{\pi}(s, \pi'(s)) > v_{\pi}(s)$ for at least one state $s \in \mathcal{S}$ , then the policy $\pi'$ is strictly better than policy $\pi$ .

Proof:

\begin{aligned} v_{\pi}(s) & \leq q_{\pi}(s, \pi'(s)) \\ & = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = \pi'(s)] \\ & = \mathbb{E}_{\pi'}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\ & \leq \mathbb{E}_{\pi'}[{R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi'(S_{t+1})) | S_t = s}] \\ & = \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t = s] \\ & \leq \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\pi}(S_{t+2}, \pi'(S_{t+2})) | S_t = s] \\ & \dots \\ & \leq \mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s] \\ & = v_{\pi'}(s) \end{aligned}

Note that $\mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t = s]$ means that we choose actions according to $\pi'$ and get $R_{t+1}$ and $R_{t+2}$ and then use $\pi$ thereafter.

With the policy improvement theorem, given a policy $\pi$ and its state-value function $v_{\pi}$ , we can construct a new policy $\pi'$ that is as good as, or better than, policy $\pi$ . The new policy $\pi'$ is constructed by selecting the action that maximizes the state-action value function $q_{\pi}(s, a)$ for each state $s \in \mathcal{S}$ ,

\begin{aligned} \pi'(s) &= \arg\max_{a} q_{\pi}(s, a) \\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned}

This algorithm is known as policy improvement.

If the new policy $\pi'$ is the same as the old policy $\pi$ , $\pi' = \pi$ , then we have the following equation for all $s \in \mathcal{S}$ ,

\begin{aligned} \pi'(s) &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi'}(s')] \\ \end{aligned}

Hence,

\begin{aligned} v_{\pi'}(s) &= \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi'}(s')] \\ \end{aligned}

This equation is the same as the Bellman optimality equation. Therefore, $v_{\pi'}(s) = v_{*}(s)$ , and $\pi' = \pi_{*}$ .

Policy Iteration

Given an arbitrary policy $\pi$ , we can use the iterative policy evaluation algorithm to estimate the state-value function $v_{\pi}$ . Then, we can use the policy improvement algorithm to construct a new policy $\pi'$ that is as good as, or better than, policy $\pi$ . Iteratively applying the policy evaluation and policy improvement algorithms will generate a sequence of policies, $\pi_0, \pi_1, \pi_2, \ldots$ . If the new policy $\pi'$ is the same as the old policy $\pi$ , then we have found the optimal policy $\pi_{*}$ .

This algorithm is known as policy iteration. The pseudo-code for policy iteration is as follows,

PreviousMarkov Decision Processes NextTemporal-Difference Learning

Last updated 15 hours ago