Foundations of Interpretable and Reliable Machine Learning

State Planning Policy Reinforcement Learning

1. Introduction

Question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that the modern state-of-the-art reinforcement learning algorithms (DDPG, SAC, TD3, PPO) are burdened with very serious issues, including

  • the trained policies are very unstable and brittle with respect to perturbations,
  • it is challenging to transfer the trained policies,
  • for mission critical applications, it is hard to provide safety guarantees (e.g. constraint satisfaction),
  • the trained policies (usually artificial neural networks) are usually not interpretable or hard to interpret,

Deployment of reinforcement learning in safety critical industrial applications and real life scenarios requires developing new approaches or significantly improving existing approaches. The critical applications we have in mind include

  • Self driving cars developed by most of the major car makers
  • Autonomous space-ships, developed for example by NASA
  • Robotic arms developed for medical surgeries

2. State Planing Policy Reinforcement Learning

State Planing Policy Reinforcement Learning (SPP-RL) is based on the principle of training an actor that operates in the state space (i.e. maps the current state to the desired target state) \[ \pi\colon \mathcal{S}\to\mathcal{S}, \] contrary to traditional RL algorithms in which the policy maps states to actions $\pi\colon\mathcal{S}\to\mathcal{A}$. Diagram with our SPP-RL method is presented on diagram in Fig. 1.

Fig.1 Diagram illustrating our SPP-RL method

Fig.2 Our SPP-RL method pseudo-code

  • We demonstrate that SPP approach is competitive to classical RL algorithms and can enable various new applications in safety and constrained RL domains
  • We present the general SPP-RL algorithm pseudo-code in Fig. 2. We remark that the RL part can be replaced with virtually any RL algorithm (on-policy as well as off-policy), for example we have already have developed SPP-DDPG, SPP-SAC and SPP-TD3 algorithms based on Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Twin Delayed DDPG (TD3) algorithms.
  • In spite of performing search for a policy in much larger space (the state-space), empirically there is no curse of dimensionality, SPP method is feasible as we report comparable or higher average returns on the set of MuJoCo benchmarks (see Fig. 3),
  • In particular, the strongest support are results obtained for Ant-v3 and Humanoid-v3 environments, where SPP-DDPG converges to significantly larger average returns than vanilla DDPG (vanilla implementation based on OpenAI spinning-up resources),

3. Experiments

Results of comparing our SPP-DDPG (orange color top) and SPP-SAC (green color bottom) with vanilla implementations (OpenAI spinning-up) on a set of MuJoCo benchmarks. Average returns from 10 seeds, mean in solid curve, and std. dev. in shaded area.

Fig. 3 results obtained on a set of MuJoCo benchmarks
(a) HalfCheetah-v2, (SPP-)DDPG
(b) Ant-v2, (SPP-)DDPG
(c) Humanoid-v3, (SPP-)DDPG
(d) HalfCheetah-v2, (SPP-)SAC
(e) Ant-v2, (SPP-)SAC
(f) Humanoid-v3, (SPP-)SAC
environment $dim(\mathcal{S})$ $dim(\mathcal{A})$
HalfCheetah-v2 17 6
Ant-v2 111 8
Humanoid-v3 376 17

Below we present a short clip with recording of trained policies.

[1] Jacek Cyranka, Jacek Płocharczyk, Misha Zanka, SSP-RL, preprint (send on request)
[2] SPPRL software.

4. Computational Research

There are many opportunities for research on the SPP methods, below we present the most promising ones. Potential for very interesting experiments.

  • Interpretability provided by the target-states outputted by the policy, which enables predictability of the actor actions, to show interpretability experimentally we plan to perform a computation using the Ant-Maze environment,
  • Safety RL -- train for policies that are guaranteeing safety , the agent is behaving under safety constraints, i.e. is not going to hit a unsafe region, and is going to avoid moving enemies, see Fig. 4 with example environments,
  • Transfer -- transfer to different, even slightly modified, task or environment is a formidable task for classical RL methods, the transfer issue of policies can be adressed using our SPP approach,
  • Develop policies using more relevant and 'physics informed' neural architectures like Neural ODEs or SIREN,
  • Create a new environment -- a testbed for interpretable and safe RL methods, see Fig. 5 below,

4. Theoretical Research

  • Formal Verification -- we can attempt a formal verification of trained policies, i.e. mathematically verify that the policy will satisfy the safety constrains within some fixed time horizon,
  • Theoretical guarantees provide convergence guarantees of the Bellman iterates for our variant of the Q-function \[ Q^{\pi, CM}(s_t, a_t) = \mathbb{E}_{r_{i\ge t},s_{i>t}\sim E,\ z_{i>t}\sim \pi,\ a_i = CM(s_i,z_i)}{\left[R_t|s_t,a_t\right]}. \]
  • combine the SPP approach with the the Hamilton-Jacobi-Bellman algorithm algorith for RL safety, where a variational method is being combined with the $Q$-function optimization in order to maximize the return under saefty constraints \[ 0 = \min\left\{l(x)-V(x,t), \frac{\partial V}{\partial t}+\max_{u\in\mathcal{U}}{\nabla_x{V^Tf(x,u)}}\right\}. \]
Fig. 4 a testbed environment for safe and interpretable RL methods