Research ArticleARTIFICIAL INTELLIGENCE

Learning agile and dynamic motor skills for legged robots

See allHide authors and affiliations

Science Robotics  16 Jan 2019:
Vol. 4, Issue 26, eaau5872
DOI: 10.1126/scirobotics.aau5872
  • Fig. 1 Creating a control policy.

    In the first step, we identify the physical parameters of the robot and estimate uncertainties in the identification. In the second step, we train an actuator net that models complex actuator/software dynamics. In the third step, we train a control policy using the models produced in the first two steps. In the fourth step, we deploy the trained policy directly on the physical system.

  • Fig. 2 Quantitative evaluation of the learned locomotion controller.

    (A) The discovered gait pattern for 1.0 m/s forward velocity command. LF, left front leg; RF, right front leg; LH, left hind leg; RH, right hind leg. (B) The accuracy of the base velocity tracking with our approach. (C to E) Comparison of the learned controller against the best existing controller, in terms of power efficiency, velocity error, and torque magnitude, given forward velocity commands of 0.25, 0.5, 0.75, and 1.0 m/s.

  • Fig. 3 Evaluation of the trained policy for high-speed locomotion.

    (A) Forward velocity of ANYmal. (B) Joint velocities. (C) Joint torques. (D) Gait pattern.

  • Fig. 4 A learned recovery controller deployed on the real robot.

    The learned policy successfully recovers from a random initial configuration in less than 3 s.

  • Fig. 5 Training control policies in simulation.

    The policy network maps the current observation and the joint state history to the joint position targets. The actuator network maps the joint state history to the joint torque, which is used in rigid-body simulation. The state of the robot consists of the generalized coordinate q and the generalized velocity u. The state of a joint consists of the joint velocity Embedded Image and the joint position error, which is the current position ϕ subtracted from the joint position target ϕ*.

  • Fig. 6 Validation of the learned actuator model.

    The measured torque and the predicted torque from the trained actuator model are shown. The “ideal model” curve is computed assuming an ideal actuator (i.e., zero communication delay and zero mechanical response time) and is shown for comparison. (A) Validation set. Data from (B) a command-conditioned policy experiment with 0.75 m/s forward command velocity and (C) its corresponding policy network output. Data from (D) a high-speed locomotion policy experiment with 1.6 m/s forward command velocity and (E) its corresponding policy network output. Note that the measured ground truth in (A) is nearly hidden because the predicted torque from the trained actuator network accurately matches the ground-truth measurements. Test data were collected at one of the knee joints.

  • Movie 1. Summary of the results and the method.

Supplementary Materials

  • robotics.sciencemag.org/cgi/content/full/4/26/eaau5872/DC1

    Section S1. Nomenclature

    Section S2. Random command sampling method used for evaluating the learned command-conditioned controller

    Section S3. Cost terms for training command-conditioned locomotion and high-speed locomotion tasks

    Section S4. Cost terms for training recovery from a fall

    Fig. S1. Base velocity tracking performance of the learned controller while following random commands.

    Fig. S2. Base velocity tracking performance of the best existing method while following random commands.

    Fig. S3. Sampled initial states for training a recovery controller.

    Table S1. Command distribution for training command-conditioned locomotion.

    Table S2. Command distribution for training high-speed locomotion.

    Table S3. Initial state distribution for training both the command-conditioned and high-speed locomotion.

    Movie S1. Locomotion policy trained with a learned actuator model.

    Movie S2. Random command experiment.

    Movie S3. Locomotion policy trained with an analytical actuator model.

    Movie S4. Locomotion policy trained with an ideal actuator model.

    Movie S5. Performance of a learned high-speed policy.

    Movie S6. Performance of a learned recovery policy.

  • Supplementary Materials

    The PDF file includes:

    • Section S1. Nomenclature
    • Section S2. Random command sampling method used for evaluating the learned command-conditioned controller
    • Section S3. Cost terms for training command-conditioned locomotion and high-speed locomotion tasks
    • Section S4. Cost terms for training recovery from a fall
    • Fig. S1. Base velocity tracking performance of the learned controller while following random commands.
    • Fig. S2. Base velocity tracking performance of the best existing method while following random commands.
    • Fig. S3. Sampled initial states for training a recovery controller.
    • Table S1. Command distribution for training command-conditioned locomotion.
    • Table S2. Command distribution for training high-speed locomotion.
    • Table S3. Initial state distribution for training both the command-conditioned and high-speed locomotion.

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Movie S1 (.mp4 format). Locomotion policy trained with a learned actuator model.
    • Movie S2 (.mp4 format). Random command experiment.
    • Movie S3 (.mp4 format). Locomotion policy trained with an analytical actuator model.
    • Movie S4 (.mp4 format). Locomotion policy trained with an ideal actuator model.
    • Movie S5 (.mp4 format). Performance of a learned high-speed policy.
    • Movie S6 (.mp4 format). Performance of a learned recovery policy.

    Files in this Data Supplement:

Stay Connected to Science Robotics

Navigate This Article