In classical reinforcement learning agents accept arbitrary short term loss
for long term gain when exploring their environment. This is infeasible for
safety critical applications such as robotics, where even a single unsafe
action may cause system failure or harm the environment. In this work, we
address the problem of safely exploring finite Markov decision processes
(MDP). We define safety in terms of an a priori unknown safety constraint
that depends on states and actions and satisfies certain regularity
conditions expressed via a Gaussian process prior. We develop a novel
algorithm, SAFEMDP, for this task and prove that it completely explores the
safely reachable part of the MDP without violating the safety constraint.
Moreover, the algorithm explicitly considers reachability when exploring the
MDP, ensuring that it does not get stuck in any state with no safe way out.
We demonstrate our method on digital terrain models for the task of
exploring an unknown map with a rover.