Foundations of Interpretable and Reliable Machine Learning

Question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that policies trained using reinforcement learning (RL) to solve simulated robotics problems (MuJoCo)
are extremely brittle and unstable, i.e. your solution will most likely break down after perturbing a bit (e.g. poking the robot)
or transferring it to a similar task. It is often impossible to provide any safety guarantees for constraint
satisfaction or an interpretation of how the trained policies work.

To address this issues we created State Planning Policy Reinforcement Learning.

Challenges in Computation

State Planning Policy (SPP) approach to Reinforcement Learning is based on the novel principle of training an actor that
operates entirely in the state space (i.e. maps the current state to the desired target state) like $\pi\colon \mathcal{S}\to\mathcal{S}$.
Some preliminary results show that SPP-DDPG, SPP-SAC and SPP-TD3 algorithms based on
SOTA obtain returns that are at least competetive, in some cases (like SPP-DDPG on Humanoid) behave significantly better than the classical DDPG algorithm.
Current experimental challenges include: demonstrating interpretability of the trained policies using Ant-Maze
environment; perform a safety-RL experiment -- train the agent behaving under safety constraints,
i.e. staying away from unsafe region, avoiding moving enemies; implementation of a novel safety environment -- a test-bed for
constrained and safe policies.

Challenges in Theory

There are many opportunities for work on theoretical foundations of the developing SPP approach.
This area is yet unexplored, we present below the most promising research directions aiming at important
theorems.
Work towards theorems providing convergence guarantees of the Bellman iterates for our variant of the Q-function
that involves an inverse dynamics control model.
Establish link between State Planning Policies and Hamilton-Jacobi-Bellman optimization in order to compute
safety guaranteeing policies; Formal Verification -- we can attempt a formal verification of trained policies, i.e. extract from the neural network
rules and mathematically verify that
the policy will satisfy the safety constrains within some fixed time horizon;