Visualization of long-term model predictions from four state-of-the-art probabilistic system identification methods. The observed system output time series (blue) is displayed together with the model's predictive distribution (mean +/- 2 std in red). Results from the proposed Multi-Step Gaussian Process (MSGP) [ ] optimization scheme are shown in the bottom right plot.

Andreas Doerr (Project Leader),
Sebastian Trimpe (Project Leader),
Stefan Schaal,
Christian Daniel
(Bosch Center for Artificial Intelligence),
Duy Nguyen-Tuong
(Bosch Center for Artificial Intelligence),
Marc Toussaint
(Machine Learning & Robotics Lab)

Model-based Reinforcement Learning (RL) algorithms make efficient use of the observed system interaction data by constructing a model of the underlying dynamics. Data-efficiency has been shown to greatly improve over model-free (e.g. policy gradient) or value function based methods. At the same time, incorporation of uncertainty is essential to mitigate effects of sparse and non i.i.d. data and prevent model bias.

Learning probabilistic predictive models from time-series data on real systems is however a challenging task due to noise and delays, unobserved system states, and complex, non-linear, and potentially discontinuous dynamics (e.g. friction and stiction in joints).

In this branch of work, we strive to optimize the model learning process by leveraging structure from the subsequent RL problem. The RL problem is expressed by the expected, discounted cost to go, given by

\begin{equation}

J = \min \mathbb{E}_{p(y_{1:T} \mid x_0, \theta_\pi)} \{ \sum_{t=0}^T \gamma^t c(x_t, u_t) \}\,,

\label{eq:cost}

\end{equation}

with cost function $c(\cdot, \cdot)$ on system state $x_t$ and action $u_t$. The expectation is taken over the predicted system output $y_t$ over a finite horizon $T$ when starting in the initial state $x_0$ and executing a policy $\pi$, parametrized by $\theta_\pi$.

In [ ], we exploit three main ideas to improve model learning in particular for model-based RL:

- Optimize for long-term predictions given a feedback policy.
- Restrict model learning to the input manifold reachable by the specific policy.
- Incorporate the approximations made for computing the expected discounted cost into the model learning.

The proposed model learning framework MSGP could be shown to enable robust, iterative RL without prior knowledge on a real world robotic manipulator. At the same time, state-of-the-art predictive performance is demonstrated in a benchmark of synthetic and real-world datasets.

2 results

**Probabilistic Recurrent State-Space Models**
*ArXiv e-prints*, January 2018 (article)

**Optimizing Long-term Predictions for Model-based Policy Search**
*Proceedings of Machine Learning Research*, 78, pages: 227-238, (Editors: Sergey Levine and Vincent Vanhoucke and Ken Goldberg), 1st Annual Conference on Robot Learning, November 2017 (conference) Accepted