SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

Site Hu, Takato Horii

#871 of 3111 · Robotics
Share
Tournament Score
1458±44
10001800
73%
Win Rate
16
Wins
6
Losses
22
Matches
Rating
4.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SADP: Subgoal-Aware Diffusion Policy for Explainable Robots

1. Core Contribution

SADP introduces a framework that bridges explainability and task performance in robot manipulation by embedding human-interpretable subgoal structure directly into a diffusion policy. The key insight is that natural-language subgoals—generated automatically by foundation models—should serve as both the structural backbone for policy execution and the explanation mechanism for users. The framework has two main components: (1) an automated data generation pipeline that produces demonstrations annotated with subgoal descriptions and completion labels using LLMs/VLMs, and (2) a diffusion policy conditioned on both task-level and subgoal-level descriptions, augmented with a binary completion prediction head.

The contribution is primarily integrative rather than fundamentally novel in any single dimension. The data generation pipeline extends prior work (TARAD/Hu et al.), the diffusion policy builds on DP3, and the use of foundation models for task decomposition is well-established. The novelty lies in how these components are combined to create an interpretable-by-design policy that doesn't sacrifice performance.

2. Methodological Rigor

Strengths:

  • The experimental setup includes both simulation (6 RLBench tasks, 3 seeds) and real-world validation (3 tasks on UR5e), providing reasonable evidence for the approach's viability.
  • The ablation study is well-designed, testing both the removal of subgoal conditioning and the use of oracle completion signals. The finding that removing subgoal conditioning actually *degrades* performance below the TARAD baseline is informative—it reveals a structural mismatch when the shared representation encodes subgoal information but the action branch cannot exploit it.
  • The qualitative analysis of completion score trajectories (Figures 5 and 6) effectively demonstrates the diagnostic utility of subgoal-level signals.
  • Weaknesses:

  • The evaluation scale is modest: only 30 demonstrations per task for training, 20 episodes for evaluation per checkpoint, and a limited task repertoire. Standard deviations are relatively high for some tasks (e.g., SlideBlockToTarget: 7.52%, MeatOffGrill: 7.88%), making it difficult to claim statistically significant improvements over TARAD in several cases.
  • The average improvement over TARAD is marginal (82.8% vs. 82.0% in simulation), and TARAD actually outperforms SADP on OpenDrawer (78.3% vs. 75.0%) and PutItemInDrawer (73.9% vs. 71.1%). The paper's claim that "SADP achieves higher task success rates" is weakly supported given these results.
  • No formal user study evaluates the explainability claims. The paper argues for interpretability but provides no evidence that users actually find the subgoal signals useful, comprehensible, or trust-inducing.
  • The completion prediction threshold (τ = 0.2) is empirically set with limited justification, and the sensitivity analysis is absent.
  • The paper reports "average of the top three success rates" per seed rather than overall averages, which inflates the reported numbers and makes comparison with other works difficult.
  • 3. Potential Impact

    The paper addresses a genuine gap: most imitation learning policies are opaque during execution, and adding interpretability post-hoc may not faithfully reflect the decision process. The idea of building interpretability into the policy architecture is conceptually appealing and relevant to HRI applications.

    However, the practical impact may be limited by several factors:

  • The approach requires foundation model access during data generation, adding cost and complexity.
  • The completion prediction mechanism is binary and simplistic; the authors themselves acknowledge it as a major failure source and suggest continuous progress estimation as future work.
  • The framework is demonstrated only on relatively simple manipulation tasks (3-8 subgoals). Scaling to truly complex, long-horizon tasks with dozens of subgoals remains unvalidated.
  • Without user studies, the claimed explainability benefits remain theoretical.
  • 4. Timeliness & Relevance

    The paper is timely in several respects: (1) diffusion policies are rapidly gaining traction in robot learning; (2) foundation models are increasingly used for robot data generation; (3) explainability in robotics is receiving growing attention as robots are deployed in human-facing applications. The intersection of these three trends is relatively underexplored, giving SADP some novelty in positioning.

    However, the paper appears somewhat disconnected from the rapidly advancing VLA model landscape (OpenVLA, π0, CoT-VLA), which is beginning to address similar issues through chain-of-thought reasoning and hierarchical execution within much larger architectures. SADP's lightweight approach may be practical but could quickly be superseded.

    5. Strengths & Limitations

    Key Strengths:

  • Clean integration of subgoal awareness into diffusion policies without requiring manual annotation
  • Demonstration that interpretability and task performance are not necessarily at odds
  • Qualitative evidence that completion scores provide meaningful diagnostic signals for failure analysis
  • The ablation removing subgoal conditioning provides an important negative result about partial integration
  • Notable Limitations:

  • Marginal quantitative improvements that often fall within error bars
  • No user study to validate explainability claims—a critical gap for a paper centering on explainable robots
  • The "evaluation metric" of top-3 success rates is non-standard and potentially misleading
  • Limited analysis of when/why completion prediction fails, despite acknowledging it as the main failure mode
  • The subgoal decomposition is fixed at inference time and cannot adapt to unexpected situations
  • No comparison with hierarchical RL methods or VLA-based approaches that also address long-horizon execution
  • 30 demonstrations per task is a very small dataset; generalization claims are hard to support
  • Overall Assessment

    SADP presents a reasonable engineering contribution that combines several existing components (diffusion policies, foundation model data generation, subgoal decomposition) in a coherent framework. The interpretable-by-design philosophy is well-motivated, but the execution falls short of the paper's ambitions. The quantitative evidence for performance improvement is weak, and the explainability claims lack empirical validation through user studies. The paper would benefit substantially from a user study, more rigorous statistical analysis, and evaluation on more complex tasks.

    Rating:4.8/ 10
    Significance 5Rigor 4.5Novelty 4.5Clarity 6.5

    Generated May 19, 2026

    Comparison History (22)

    vs. CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation
    gpt-5.25/20/2026

    Paper 2 (SADP) has higher estimated impact due to a more novel and timely combination of diffusion policies, foundation-model-generated subgoal annotations, and built-in interpretability—addressing a broadly relevant bottleneck in long-horizon robot learning (data scarcity + explainability). Its approach is likely transferable across many manipulation platforms and tasks, impacting robotics, imitation learning, HRI, and trustworthy AI. Paper 1 is strong and rigorous for humanoid whole-body control and modular planning, but its contributions are more domain-specific (humanoids/EE-root interface) and may have narrower cross-field adoption.

    vs. NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints
    claude-opus-4.65/19/2026

    Paper 1 (SADP) addresses a more fundamental challenge in robot learning—combining explainability with high task performance through subgoal-aware diffusion policies. It introduces a novel framework leveraging foundation models for automatic subgoal annotation, addresses the scarcity of subgoal-level supervision, and demonstrates that built-in interpretability can coexist with strong performance. This has broader impact across imitation learning, explainable AI, and long-horizon manipulation. Paper 2 (NORM-Nav) is solid but more incremental, integrating LLM-parsed constraints into existing costmap planners for socially-aware navigation, a narrower contribution with less methodological novelty.

    vs. Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping
    claude-opus-4.65/19/2026

    Paper 2 (SADP) addresses a broader and more timely research gap at the intersection of explainable AI and robot learning, leveraging foundation models for subgoal-aware policy learning. This combines multiple high-impact trends (diffusion policies, foundation models, explainability) in a novel way with broader applicability across manipulation tasks. Paper 1 (Mono-Hydra++) is a strong systems paper with solid engineering contributions for monocular scene graph construction, but is more incremental—combining existing components (DINOv3, VIO, volumetric fusion) into a pipeline. Paper 2's core insight that built-in interpretability can coexist with high performance has wider implications for trustworthy robotics.

    vs. Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning
    claude-opus-4.65/19/2026

    Paper 2 (XDiffuser) offers a more fundamental and broadly applicable contribution by introducing extrinsic graph-based search to guide diffusion planning, addressing a core limitation of compositional diffusion models for long-horizon tasks. Its ability to handle unseen combinatorial tasks (multi-agent coordination, TSP-style reasoning) at test time via classical algorithms demonstrates greater generality and novelty. Paper 1 (SADP) makes a solid contribution to explainable robotics but is more incremental, combining existing components (foundation models, diffusion policies) for subgoal annotation. Paper 2's broader applicability across planning domains suggests higher cross-field impact.

    vs. Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
    gemini-3.15/19/2026

    Paper 1 integrates highly influential current trends—foundation models and diffusion policies—to address the critical challenge of explainability in robotics. Its approach to autonomously generating subgoal-level supervision offers a scalable solution to dataset limitations, potentially impacting multiple areas of imitation learning and human-robot interaction. Paper 2 presents a practical and effective method for 3D scene graph generation, but its scope is comparatively narrower and relies on more established paradigms, making Paper 1's methodological innovations more likely to spark widespread follow-up research.

    vs. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
    gpt-5.25/19/2026

    Paper 2 (SADP) has higher estimated impact due to combining three timely directions—diffusion policies, foundation-model-generated data, and built-in interpretability via subgoals—into a broadly applicable framework for long-horizon manipulation. It addresses a major bottleneck (lack of subgoal supervision) with a scalable data-generation pipeline and yields user-facing benefits (progress monitoring/diagnosis) alongside performance gains, increasing real-world deployment relevance. GuidedVLA is novel in attention-head specialization, but relies on manual auxiliary signals and may generalize less broadly across tasks/domains than a subgoal-centric, explainability-oriented policy design.

    vs. DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
    gemini-3.15/19/2026

    Paper 1 addresses a critical bottleneck in developing generalist robot foundation models: multi-task scaling and avoiding task-specific overfitting during RL fine-tuning. By proposing a framework for cross-task feature representation, it directly contributes to the highly impactful pursuit of general-purpose Vision-Language-Action (VLA) models. While Paper 2 offers valuable contributions to explainability and long-horizon manipulation, Paper 1's focus on cross-task scaling principles has broader methodological implications for the foundational architecture and training paradigms of large-scale robotic models.

    vs. SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation
    claude-opus-4.65/19/2026

    SADP addresses the broadly important intersection of explainability and robot learning, proposing a novel framework that integrates foundation models for subgoal-annotated demonstration generation with diffusion policies. Its contribution—showing that built-in interpretability can coexist with high performance—has broader implications across robotics, HRI, and trustworthy AI. While SEDualVLN achieves SOTA on VLN-CE benchmarks, it is more incremental, combining known paradigms (dual-system, spatial mapping) in a specific navigation domain. SADP's methodological novelty (subgoal-aware diffusion, foundation model-generated supervision) and real-world validation suggest wider cross-field impact.

    vs. How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
    claude-opus-4.65/19/2026

    Paper 2 (DeMiAn) has higher potential impact because it addresses a fundamental scaling bottleneck in robot learning—extracting more signal from existing data without collecting new demonstrations. Its approach is validated at significantly larger scale (1M+ clips, 50K videos), introduces a generalizable multi-aspect annotation framework applicable across different policy architectures (VLA and world-action models), and demonstrates improvements on composite and OOD tasks. Paper 1 (SADP) makes a solid contribution on explainability via subgoal conditioning, but its scope is narrower and experiments are smaller-scale. DeMiAn's positioning as a practical scaling lever gives it broader applicability across the field.

    vs. No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task
    claude-opus-4.65/19/2026

    Paper 2 demonstrates higher potential scientific impact due to its cross-disciplinary bridging between robotics, cognitive science, and clinical neuroscience. It provides a novel theoretical insight—that a reactive robotics model without planning captures human cognitive failures better than planning models for impaired populations—suggesting deep structural parallels between robotic and biological systems. This has broad implications for understanding cognition, clinical assessment tools, and embodied AI theory. Paper 1, while solid engineering work combining foundation models with diffusion policies for explainability, represents a more incremental contribution within robot learning.

    vs. "I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration
    gemini-3.15/19/2026

    Paper 1 introduces a novel approach to combining foundation models with diffusion policies for robotic manipulation, addressing the critical challenges of explainability and long-horizon task execution. Its methodological innovation in autonomously generating subgoal-annotated data and improving task success rates has broader implications for scalable, interpretable robot learning compared to Paper 2's specific application of VLMs for emotion recognition in human-robot collaboration.

    vs. Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement
    gemini-3.15/19/2026

    Paper 2 addresses a critical bottleneck in robotic imitation learning—data scarcity for subgoals—by leveraging foundation models. Its combination of diffusion policies and explainability for long-horizon tasks aligns with major trends in embodied AI and has broader, more impactful applications compared to Paper 1's narrow focus on specific mechanical actions (toppling) in tabletop planning.

    vs. Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions
    claude-opus-4.65/19/2026

    Paper 1 addresses a broader and more timely challenge at the intersection of foundation models, diffusion policies, and explainable AI for robotics—all rapidly growing areas. Its contribution of integrating subgoal-level interpretability directly into policy learning is novel and broadly applicable across robot learning. Paper 2 makes a solid contribution to AV safety validation but addresses a narrower domain. Paper 1's framework connecting foundation models to structured robot behavior with built-in explainability has higher potential to influence multiple research communities and inspire follow-up work.

    vs. DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
    claude-opus-4.65/19/2026

    SADP introduces a novel framework combining foundation models, subgoal-aware conditioning, and diffusion policies to achieve both high performance and built-in explainability—a relatively underexplored intersection. It addresses the important challenge of interpretable robot decision-making with a principled approach (subgoal structure from foundation models), validated in both simulation and real-world settings. While DexJoCo provides a valuable benchmark for dexterous manipulation, benchmarks typically have narrower impact unless they become widely adopted. SADP's contributions span explainable AI, imitation learning, and foundation model integration, giving it broader cross-field relevance and higher novelty.

    vs. Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction
    gpt-5.25/19/2026

    Paper 2 has higher potential impact due to its novelty in combining foundation-model–generated subgoal annotations with diffusion policies to achieve built-in interpretability for long-horizon manipulation, a timely and fast-growing area in robotics/AI. It offers broad applicability across imitation learning, human-robot interaction, and reliable deployment via progress monitoring/diagnostics, and aligns with current interest in foundation models and explainability. Paper 1 is solid and rigorous but more specialized (planar rigid-motion reconstruction/NBV), with narrower cross-field reach and less general real-world adoption potential.

    vs. From a Single Demonstration to a General Policy for Contact-Rich Manipulation
    gpt-5.25/19/2026

    Paper 2 likely has higher impact due to its strong novelty (constraint-based representation enabling one-shot generalization in contact-rich, multi-stage tasks), clear real-world relevance, and extensive real-robot validation across seven tasks with high success. Its methodological framing around environmental constraints is broadly applicable (robotics, manipulation, LfD, control, compliant interaction) and addresses a central bottleneck—generalization under contact and unmodeled dynamics. Paper 1 is timely and valuable for explainability and diffusion policies, but depends on foundation-model-generated subgoal labels and is more incremental within imitation/diffusion-policy trends.

    vs. Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy
    gemini-3.15/19/2026

    Paper 2 has higher potential scientific impact due to its integration of cutting-edge foundation models with diffusion policies to solve the critical challenges of long-horizon robotic manipulation and explainability. By automating the generation of subgoal annotations and creating built-in interpretable policies, it addresses major data scarcity bottlenecks in imitation learning. Its approach has broader applicability across embodied AI and human-robot interaction, whereas Paper 1 focuses on a more specialized, albeit important, problem of pose estimation and symmetry.

    vs. A Dexterous and Compliant Gripper With Soft Hydraulic Actuation for Microgravity Manipulation
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact due to its novelty in combining foundation-model-generated subgoal annotations with diffusion policies to achieve built-in interpretability, addressing a timely and widely relevant problem in robot learning (long-horizon manipulation, monitoring, explainability). Its applicability spans many robots and tasks beyond a specific platform, and it evaluates in both simulation and real hardware. Paper 1 is valuable engineering for microgravity manipulation but is more domain-specific (Astrobee/ISS) and thus likely narrower in breadth and downstream adoption.

    vs. Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting
    gemini-3.15/19/2026

    Paper 2 addresses fundamental challenges in robot learning—explainability, long-horizon manipulation, and data scarcity—by leveraging foundation models to enhance diffusion policies. Its approach to built-in interpretability has broader applicability across imitation learning and human-robot interaction, whereas Paper 1 focuses on a more specialized, albeit novel, domain of deformable object manipulation and clay sculpting.

    vs. Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers
    gpt-5.25/19/2026

    Paper 2 likely has higher impact due to stronger timeliness (diffusion policies, foundation models, interpretability), broader cross-field relevance (robot learning, HCI/explainability, LLM-generated data, long-horizon manipulation), and clearer real-world applicability via both simulation and UR5e results. Its core idea—using foundation models to generate subgoal-annotated demos and training subgoal-aware diffusion policies—could generalize across many tasks and platforms. Paper 1 is novel and rigorous with polynomial-time optimal planning, but it targets a narrower tabletop block setting and may have more limited transfer beyond tightly packed uniform grids.