Foundations of Interpretable and Reliable Machine Learning

Question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that policies trained using reinforcement learning (RL) to solve simulated robotics problems (MuJoCo) are extremely brittle and unstable, i.e. your solution will most likely break down after perturbing a bit (e.g. poking the robot) or transferring it to a similar task. It is often impossible to provide any safety guarantees for constraint satisfaction or an interpretation of how the trained policies work.

Challenges in Computation
State Planning Policy (SPP) approach to Reinforcement Learning is based on the novel principle of training an actor that operates entirely in the state space (i.e. maps the current state to the desired target state) like $\pi\colon \mathcal{S}\to\mathcal{S}$. Some preliminary results show that SPP-DDPG, SPP-SAC and SPP-TD3 algorithms based on SOTA obtain returns that are at least competetive, in some cases (like SPP-DDPG on Humanoid) behave significantly better than the classical DDPG algorithm.
Current experimental challenges include: demonstrating interpretability of the trained policies using Ant-Maze environment;
perform a safety-RL experiment -- train the agent behaving under safety constraints, i.e. staying away from unsafe region, avoiding moving enemies; implementation of a novel safety environment -- a test-bed for constrained and safe policies.

Challenges in Theory
There are many opportunities for work on theoretical foundations of the developing SPP approach. This area is yet unexplored, we present below the most promising research directions aiming at important theorems.
Work towards theorems providing convergence guarantees of the Bellman iterates for our variant of the Q-function that involves an inverse dynamics control model.
Establish link between State Planning Policies and Hamilton-Jacobi-Bellman optimization in order to compute safety guaranteeing policies;
Formal Verification -- we can attempt a formal verification of trained policies, i.e. extract from the neural network rules and mathematically verify that the policy will satisfy the safety constrains within some fixed time horizon;

$Q^{\pi, CM}(s_t, a_t) = \mathbb{E}_{r_{i\ge t},s_{i>t}\sim E,\ z_{i>t}\sim \pi,\ a_i = CM(s_i,z_i)}{\left[R_t|s_t,a_t\right]};$
$0 = \min\left\{l(x)-V(x,t), \frac{\partial V}{\partial t}+\max_{u\in\mathcal{U}}{\nabla_x{V^Tf(x,u)}}\right\}.$