Research ArticleHUMAN-ROBOT INTERACTION

Personalized machine learning for robot perception of affect and engagement in autism therapy

See allHide authors and affiliations

Science Robotics  27 Jun 2018:
Vol. 3, Issue 19, eaao6760
DOI: 10.1126/scirobotics.aao6760
  • Fig. 1 Overview of the key stages (sensing, perception, and interaction) during robot-assisted autism therapy.

    Data from three modalities (audio, visual, and autonomic physiology) were recorded using unobtrusive audiovisual sensors and sensors worn on the child’s wrist, providing the child’s heart-rate, skin-conductance (EDA), body temperature, and accelerometer data. The focus of this work is the robot perception, for which we designed the personalized deep learning framework that can automatically estimate levels of the child’s affective states and engagement. These can then be used to optimize the child-robot interaction and monitor the therapy progress (see Interpretability and utility). The images were obtained by using Softbank Robotics software for the NAO robot.

  • Fig. 2 PPA-net.

    The feature layer performs feature fusion using (supervised) auto-encoders designed to reduce noise and handle missing features. The inward and outward arrows depict the encoding (in orange)/decoding (in green) of the observed input features (in blue). At the context layer, behavioral scores of the child’s mental, motor, and verbal ability are used to augment the input features using the expert knowledge [quantified by CARS (34)]. Personalization of the network is achieved using the demographic information (culture and gender), followed by individual network layers for each child. The inference layer performs the child-specific estimation of valence, arousal, and engagement levels. The activations of the hidden nodes (in orange) are learned during the network training (see Results).

  • Fig. 3 Data analysis and results.

    (A) The fraction of data present across the different modalities both individually and concurrently. (B) The heat maps of the joint distributions for the valence, arousal, and engagement levels, coded by human experts. Large differences in these patterns are present at the culture and individual levels. (C) Clustering of the children from C1 and C2 using the t-SNE, an unsupervised dimensionality reduction technique, applied to the auto-encoded features (see Effects of model personalization). (D) ICC scores per child: C1 (n = 17) and C2 (n = 18) for valence (V), arousal (A), and engagement (E) estimation. Compared with the GPA-net (in gray), the performance of the PPA-net (in black) improved at all three levels (culture, gender, and individual) in the model hierarchy. Bottom: Sorted improvements in the ICC performance (ΔICC) between PPA-net and GPA-net for each child.

  • Fig. 4 Interpretability and utility.

    (A) Interpretability can be enhanced by looking at the influence of the input features on the output target. Here, the output is estimated engagement level. The relative importance scores (y axis) are shown for input features from each behavioral modality (x axis). These are obtained from the DeepLift (38) tool, which provides negative/positive values when the input feature drives the output toward −1/+1. (B) Estimated engagement, arousal, and valence levels of the child as the therapy progresses. These were obtained by applying the learned PPA-net to the held-out data of the target child. We also plotted the corresponding signals measured from the child’s wrist: the movement intensity derived from accelerometer readings (ACC), blood-volume pulse (BVP), and EDA. (C) Summary of the therapy in terms of the average ± SD levels of affect and engagement within each phase of the therapy: (1) pairing, (2) recognition, and (3) imitation.

  • Fig. 5 The learning of the PPA-net.

    (A) The supervised AE performs the feature smoothing by dealing with missing values and noise in the input while preserving the discriminative information in the subspace h0—constrained by the CoF0. The learning operators in the PPA-net—(B) learn, (C) nest, and (D) clone—are used for the layer-wise supervised learning, learning of the subsequent vertical layers, and horizontal expansion of the network, respectively. (E) The group-level GPA-net is first learned by sequentially increasing the network depth using learn and nest and then used to initialize the personalized PPA-net weights at the culture, gender, and individual level (using clone). (F) The network personalization is then accomplished via the fine-tuning steps I and II (Materials and Methods).

Supplementary Materials

  • robotics.sciencemag.org/cgi/content/full/3/19/eaao6760/DC1

    Note S1. Details on model training and alternative approaches.

    Note S2. Data set.

    Note S3. Feature processing.

    Note S4. Data coding.

    Fig. S1. Empirical cumulative distribution function of ICC and MSE.

    Fig. S2. The learning of the networks.

    Fig. S3. PPA-net: The performance of the visual (face and body), audio, and physiology features.

    Table S1. Comparisons with alternative approaches.

    Table S2. Summary of the child participants.

    Table S3. Summary of the features.

    Table S4. The coding criteria.

    References (5356)

  • Supplementary Materials

    Supplementary Material for:

    Personalized machine learning for robot perception of affect and engagement in autism therapy

    Ognjen Rudovic*, Jaeryoung Lee, Miles Dai, Björn Schuller, Rosalind W. Picard

    *Corresponding author. Email: orudovic{at}mit.edu

    Published 27 June 2018, Sci. Robot. 3, eaao6760 (2018)
    DOI: 10.1126/scirobotics.aao6760

    This PDF file includes:

    • Note S1. Details on model training and alternative approaches.
    • Note S2. Data set.
    • Note S3. Feature processing.
    • Note S4. Data coding.
    • Fig. S1. Empirical cumulative distribution function of ICC and MSE.
    • Fig. S2. The learning of the networks.
    • Fig. S3. PPA-net: The performance of the visual (face and body), audio, and physiology features.
    • Table S1. Comparisons with alternative approaches.
    • Table S2. Summary of the child participants.
    • Table S3. Summary of the features.
    • Table S4. The coding criteria.
    • References (5356)

    Download PDF

    Files in this Data Supplement:

Stay Connected to Science Robotics

Navigate This Article