A formal methods approach to interpretable reinforcement learning for robotic planning

See allHide authors and affiliations

Science Robotics  18 Dec 2019:
Vol. 4, Issue 37, eaay6276
DOI: 10.1126/scirobotics.aay6276
  • Fig. 1 Hot dog cooking and serving task setup.

    (A) Experiment setup consists of a seven-DOF Jaco arm and a Baxter robot. Objects in the scene are tracked by a motion capture system. The status of a ready-to-serve hot dog was detected using a camera running Darknet. (B) Simulation environment in the V-REP simulator setup for training. (C and D) Information perceived by each robot during experiment (displayed in RViz). Tracked items include the three plates, the grill, and the ketchup. The safe regions and serve region are defined by the task specification.

  • Fig. 2 FSPA for the hot dog cooking and serving tasks.

    (A) HotdogCooked(s). (B) HotdogServed(s).

  • Fig. 3 Example execution trace of the hot dog cooking and serving task.

    Each snapshot represents a change in the FSPA (change of a predicate truth value and/or transition on the FSPA). The gray and red paths trace the steps necessary to trigger a transition on the FSPA (for either robot) shown in Fig. 2 (A and B). The FSPA state along with the corresponding edge guard is provided on top of each path.

  • Fig. 4 Performance evaluation and comparison.

    (A) The minimum CBF value against training time for Jaco. At each time, the system is safe if the minimum of the CBF values is greater than zero. The red curves present the case where CBFs are not enforced during training but their values are recorded for comparison. In this case, an episode ends when the minimum CBF value is smaller than zero and this value is given to the RL agent as a penalizing terminal reward. (B) The minimum CBF value against training time for Baxter. (C) Performance comparison of the final policies for Jaco. Here, FSPA represents vanilla FSPA-guided RL. Each policy is executed to produce 30 trajectories. A trajectory satisfies the task (success) if its robustness evaluated against its task specification is greater than zero. Average success rate is reported. (D) Performance comparison of the final policies for Baxter.

  • Fig. 5 Learning progress evaluation at each FSPA state.

    (A) Baxter. (B) Jaco. The y axis shows the FSPA states corresponding to Fig. 2 (A and B). For a fixed time interval during training, the latest policy is evaluated for its ability to transition out of the current FSPA state. This is done by setting the FSPA state and executing the policy for a number of time steps (100 for Baxter and 200 for Jaco). Thirty trajectories are collected for each evaluation. The cell color corresponds to the average time for the transition to happen, and the cell labels show the success rates. q3 in (B) is omitted because it corresponds to TimePassed(s,600), which is not learned.

  • Fig. 6 Pictorial representation of the overall approach.

    The user-provided task specification is first checked symbolically against a knowledge base with generally true statements for task feasibility [using a model checking tool, such as lomap (43)]. If feasible, specification ϕtask ∧ ϕK is transformed into an FSPA that provides guidance (reward) to an RL agent that encourages satisfaction of the specification. The FSPA also provides constraints to the FSPA-guided CBF, which safeguards the decisions made by the RL agent from violating the specification while minimally interfering with the RL agent.

  • Fig. 7 A robot navigation example to illustrate the use of the FSPA-augmented MDP and FSPA-guided CBF.

    The robot moves in a 2D world with regions A, B, and C and obstacle D. The robot’s states are its positions denoted s = probot. The robot is to satisfy the task ϕex=(F ψAF ψB)F ψC(¬ψC U (ψAψB))G¬ψD where ψi = ‖probotpi‖ < δ, i ∈ {A, B, C, D}, and pi is the center coordinate of region i. (A) The robot’s initial position and automaton state, q0 (shown in light gray). The CBF blocks out regions C and D (shown in dark gray) to prevent violation of the specification. (B) The robot visits region A as directed by the FSPA-augmented MDP reward and transitions to q1. The CBF now opens up C but still blocks D. (C) The robot visits region C to complete the task. The edge guards in the automaton above are defined as b(q0, q0) = ¬ ψA ∧ ¬ ψB ∧ ¬ ψC ∧ ¬ ψD, b(q0, q1) = (ψA ∧ ¬ ψC ∧ ¬ ψD) ∨ (ψB ∧ ¬ ψC ∧ ¬ ψD), b(q0, qf) = (ψA ∧ ψC ∧ ¬ ψD) ∨ (ψB ∧ ψC ∧ ¬ ψD), b(q0, qTr) = ψC ∨ ψD, b(q1, qf) = ψC ∧ ¬ ψD, and b(q1, qTr) = ψD. (D) Pictorial depiction of the FSPA-guided CBF. (E) The resulting policy at q0, i.e., π(s, q0). (F) The resulting policy at q1, i.e., π(s, q1).

Supplementary Materials

  • Supplementary Materials

    The PDF file includes:

    • Text
    • Fig. S1. FSPA for example Grasp task.
    • Fig. S2. FSPA for template formulas.
    • Table S1. Predicate definitions.

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Movie S1 (.mp4 format). Demonstration of proposed technique in simulation.
    • Movie S2 (.mp4 format). Execution of learned policy on the physical robots.

    Files in this Data Supplement:

Stay Connected to Science Robotics

Navigate This Article