*Question we address*: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that policies trained using reinforcement learning (RL) to solve simulated robotics problems (MuJoCo)
are

*extremely brittle and unstable*, i.e. your solution will most likely break down after perturbing a bit (e.g. poking the robot)
or transferring it to a similar task. It is often impossible to provide any

*safety guarantees* for constraint
satisfaction or an interpretation of how the trained policies work.

##### To address this issues we created State Planning Policy Reinforcement Learning.

##### Challenges in Theory

There are many opportunities for work on *theoretical foundations* of the developing SPP approach.
This area is yet unexplored, we present below the most promising research directions aiming at important
theorems.

Work towards theorems providing convergence guarantees of the *Bellman iterates* for our variant of the Q-function
that involves an inverse dynamics control model.

Establish link between *State Planning Policies* and *Hamilton-Jacobi-Bellman* optimization in order to compute
safety guaranteeing policies;

*Formal Verification* -- we can attempt a formal verification of trained policies, i.e. extract from the neural network
rules and mathematically verify that
the policy will satisfy the safety constrains within some fixed time horizon;

\[
Q^{\pi, CM}(s_t, a_t) = \mathbb{E}_{r_{i\ge t},s_{i>t}\sim E,\ z_{i>t}\sim \pi,\ a_i = CM(s_i,z_i)}{\left[R_t|s_t,a_t\right]};
\]

\[
0 = \min\left\{l(x)-V(x,t), \frac{\partial V}{\partial t}+\max_{u\in\mathcal{U}}{\nabla_x{V^Tf(x,u)}}\right\}.
\]