Foundations of Interpretable and Reliable Machine Learning

Question we address: How to develop physics-informed reinforcement learning algorithms that guarantee safety and interpretability ?

It is widely known that policies trained using reinforcement learning (RL) to solve simulated robotics problems (MuJoCo) are extremely brittle and unstable, i.e. your solution will most likely break down after perturbing a bit (e.g. poking the robot) or transferring it to a similar task. It is often impossible to provide any safety guarantees for constraint satisfaction or an interpretation of how the trained policies work.
To address this issues we created State Planning Policy Reinforcement Learning.

Humanoid-v3, (SPP-)DDPG
Challenges in Computation
State Planning Policy (SPP) approach to Reinforcement Learning is based on the novel principle of training an actor that operates entirely in the state space (i.e. maps the current state to the desired target state) like $\pi\colon \mathcal{S}\to\mathcal{S}$. Some preliminary results show that SPP-DDPG, SPP-SAC and SPP-TD3 algorithms based on SOTA obtain returns that are at least competetive, in some cases (like SPP-DDPG on Humanoid) behave significantly better than the classical DDPG algorithm.
Current experimental challenges include: demonstrating interpretability of the trained policies using Ant-Maze environment;
perform a safety-RL experiment -- train the agent behaving under safety constraints, i.e. staying away from unsafe region, avoiding moving enemies; implementation of a novel safety environment -- a test-bed for constrained and safe policies.

Challenges in Theory
There are many opportunities for work on theoretical foundations of the developing SPP approach. This area is yet unexplored, we present below the most promising research directions aiming at important theorems.
Work towards theorems providing convergence guarantees of the Bellman iterates for our variant of the Q-function that involves an inverse dynamics control model.
Establish link between State Planning Policies and Hamilton-Jacobi-Bellman optimization in order to compute safety guaranteeing policies;
Formal Verification -- we can attempt a formal verification of trained policies, i.e. extract from the neural network rules and mathematically verify that the policy will satisfy the safety constrains within some fixed time horizon;

\[ Q^{\pi, CM}(s_t, a_t) = \mathbb{E}_{r_{i\ge t},s_{i>t}\sim E,\ z_{i>t}\sim \pi,\ a_i = CM(s_i,z_i)}{\left[R_t|s_t,a_t\right]}; \]
\[ 0 = \min\left\{l(x)-V(x,t), \frac{\partial V}{\partial t}+\max_{u\in\mathcal{U}}{\nabla_x{V^Tf(x,u)}}\right\}. \]

See here for more details.

I am searching for a student to join my interdisciplinary research group at University of Warsaw, Poland.
Project in collaboration with Prof. Henryk Michalewski (Google & University of Warsaw).
  • position is suitable for students that want to get involved in ML/RL research,
  • full-time monthly salary at least 6000 PLN gross +, negotiable and highly dependant on the prospect candidate qualifications,
  • possibility of combining the position with the PhD school run at University of Warsaw, Poland,
  • work towards results publishable on major CS conferences, with aim at CAV, ICML, NIPS, ICLR, ICRA, AAAI,
  • worldwide collaboration with renowned academic institutions (including UC San Diego, Stony Brook, Rutgers, TU Wien), and industry (Google,,
  • access to personal computer and computational resources.
  • MSc title or work towards MSc in progress,
  • passion for research,
  • proficient in Python programming,
  • know fundamentals of machine learning, neural networks and reinforcement learning algorithms (e.g. the Deep Learning book),
  • know fundamentals of mathematics (calculus, linear algebra, basic real analysis),
  • capability of working individually and self-study,

Please see the detailed project description below, and reach me in case of any questions,
apply by sending CV and a motivation letter explaining why are you interested in this post to my jcyranka at gmail account.

Best, Jacek Cyranka