Foundations of Interpretable and Reliable Machine Learning

State Planning Policy Reinforcement Learning

1. Introduction

Question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that the modern state-of-the-art reinforcement learning algorithms
(DDPG, SAC, TD3, PPO) are burdened with very serious issues, including

the trained policies are very unstable and brittle with respect to perturbations,

it is challenging to transfer the trained policies,

for mission critical applications, it is hard to provide safety guarantees (e.g. constraint satisfaction),

the trained policies (usually artificial neural networks) are usually not interpretable or hard to interpret,

Deployment of reinforcement learning in safety critical industrial applications
and real life scenarios requires developing new approaches or significantly improving existing approaches.
The critical applications we have in mind include

Self driving cars developed by most of the major car makers

Autonomous space-ships, developed for example by NASA

Robotic arms developed for medical surgeries

2. State Planing Policy Reinforcement Learning

State Planing Policy Reinforcement Learning (SPP-RL) is based on the principle of training an actor that operates in the
state space (i.e. maps the current state to the desired target state)
\[
\pi\colon \mathcal{S}\to\mathcal{S},
\]
contrary to traditional RL algorithms in which the policy maps states to actions $\pi\colon\mathcal{S}\to\mathcal{A}$. Diagram with our SPP-RL method is presented
on diagram in Fig. 1.

Fig.1 Diagram illustrating our SPP-RL method

Fig.2 Our SPP-RL method pseudo-code

We demonstrate that SPP approach is competitive to classical RL algorithms and can enable various new applications in safety and constrained RL domains

We present the general SPP-RL algorithm pseudo-code in Fig. 2. We remark that the RL part can be replaced with virtually
any RL algorithm (on-policy as well as off-policy), for example we have already have developed SPP-DDPG, SPP-SAC and SPP-TD3 algorithms based on
Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Twin Delayed DDPG (TD3) algorithms.

In spite of performing search for a policy in much larger space (the state-space), empirically there is no curse of dimensionality,
SPP method is feasible as we report comparable or higher average returns on the set of MuJoCo benchmarks (see Fig. 3),

In particular, the strongest support are results obtained for Ant-v3 and Humanoid-v3 environments, where SPP-DDPG converges to significantly larger average returns
than vanilla DDPG (vanilla implementation based on OpenAI spinning-up resources),

3. Experiments

Results of comparing our SPP-DDPG (orange color top) and SPP-SAC (green color bottom) with vanilla implementations (OpenAI spinning-up) on a set of MuJoCo benchmarks.
Average returns from 10 seeds, mean in solid curve, and std. dev. in shaded area.

Fig. 3 results obtained on a set of MuJoCo benchmarks

environment

$dim(\mathcal{S})$

$dim(\mathcal{A})$

HalfCheetah-v2

17

6

Ant-v2

111

8

Humanoid-v3

376

17

Below we present a short clip with recording of trained policies.

[1] Jacek Cyranka, Jacek Płocharczyk, Misha Zanka, SSP-RL, preprint (send on request)
[2] SPPRL software. https://github.com/MIMUW-RL/spp-rl

4. Computational Research

There are many opportunities for research on the SPP methods, below we present the most promising ones.
Potential for very interesting experiments.

Interpretability provided by the target-states outputted by the policy, which enables predictability of
the actor actions, to show interpretability experimentally we plan to perform a computation using the Ant-Maze environment,

Safety RL -- train for policies that are guaranteeing safety , the agent is behaving under safety constraints,
i.e. is not going to hit a unsafe region, and is going to avoid moving enemies, see Fig. 4 with example environments,

Transfer -- transfer to different, even slightly modified, task or environment is a formidable task for classical RL methods,
the transfer issue of policies can be adressed using our SPP approach,

Develop policies using more relevant and 'physics informed' neural architectures like Neural ODEs or SIREN,

Create a new environment -- a testbed for interpretable and safe RL methods, see Fig. 5 below,

4. Theoretical Research

Formal Verification -- we can attempt a formal verification of trained policies, i.e. mathematically verify that
the policy will satisfy the safety constrains within some fixed time horizon,

Theoretical guarantees provide convergence guarantees of the Bellman iterates for our variant of the Q-function
\[
Q^{\pi, CM}(s_t, a_t) = \mathbb{E}_{r_{i\ge t},s_{i>t}\sim E,\ z_{i>t}\sim \pi,\ a_i = CM(s_i,z_i)}{\left[R_t|s_t,a_t\right]}.
\]

combine the SPP approach with the the Hamilton-Jacobi-Bellman algorithm algorith for RL safety, where
a variational method is being combined with the $Q$-function optimization in order to maximize the return under saefty constraints
\[
0 = \min\left\{l(x)-V(x,t), \frac{\partial V}{\partial t}+\max_{u\in\mathcal{U}}{\nabla_x{V^Tf(x,u)}}\right\}.
\]