PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

May 19, 2026

arXiv:2605.19580v1 PDF

cs.RO(primary)

#1141of 3197·Robotics

#1141 of 3197 · Robotics

Tournament Score

1438±42

10001800

62%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor4.5

Novelty6

Clarity6.5

Tournament Score

1438±42

10001800

62%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PAPO-VLA

1. Core Contribution

PAPO-VLA introduces a planning-aware policy optimization framework for Vision-Language-Action models. The central thesis is that actions within a manipulation trajectory serve dual functional roles—planning (task-oriented decision points like transitioning from approach to grasp) and execution (dense continuous motions connecting these decisions). The paper argues that planning actions disproportionately affect task success and therefore deserve amplified optimization emphasis.

The method has three components: (1) identifying planning actions via action variation magnitude combined with trajectory outcome gating; (2) estimating their importance through causal sufficiency (does preserving the action help success?) and causal necessity (does perturbing it cause failure?); and (3) incorporating this importance into GRPO advantage estimation as a per-action bonus term.

This addresses a genuine limitation of standard GRPO, which assigns uniform trajectory-level advantages to all timesteps regardless of their structural importance.

2. Methodological Rigor

Strengths in formulation: The causal framework (Definition 1) is well-motivated, drawing on Pearl's notions of sufficiency and necessity. The harmonic mean aggregation (Eq. 10) is a reasonable choice for requiring both properties simultaneously.

Concerns about implementation: The paper is notably vague about how causal sufficiency and necessity are actually computed in practice. Equations 8 and 9 involve expectations over trajectory completions conditioned on action prefixes—computing these requires rolling out the policy from perturbed states, which is extremely expensive in simulation. The paper does not clearly specify: (a) how many rollouts are used to estimate these expectations, (b) how perturbations ā_t are generated ("feasible small perturbation" is underspecified), (c) the computational overhead relative to standard GRPO. This is a significant gap—the theoretical framework is clean, but the practical approximation details are largely absent from the main text.

Planning action identification: The use of L1 action variation (Eq. 6) combined with outcome gating (Eq. 7) is heuristic but reasonable. However, the top-k selection introduces a hyperparameter whose sensitivity is not thoroughly analyzed. The outcome gate g(τ) biases selection toward successful trajectories, which means planning actions are primarily identified from good trajectories—this could limit learning from failures.

Ablation quality: The ablation study (Table 3) is minimal, showing only the removal of sufficiency and/or necessity components. Missing ablations include: the effect of top-k, the impact of the outcome gate alone, comparison of different importance aggregation strategies, and analysis of planning action identification quality.

3. Potential Impact

The paper addresses an important problem—structured credit assignment in VLA policy optimization. If the approach generalizes well, it could improve sample efficiency and reliability of RL-based VLA fine-tuning, which is an active area. The planner/executor decomposition is intuitive and could inspire related work on hierarchical credit assignment for robotic control.

However, the practical impact is constrained by: (a) the method's computational cost (multiple rollouts per planning action for causal estimation), (b) reliance on simulation environments for the counterfactual evaluations, making real-world deployment challenging, and (c) the method builds on GRPO, tying it to a specific optimization framework.

4. Timeliness & Relevance

The paper is timely. VLA models are rapidly proliferating (OpenVLA, π0, etc.), and post-training optimization is an active frontier (RIPT-VLA, TGRPO appeared in 2025). The application of GRPO-style optimization to embodied agents is nascent, and structured credit assignment within this paradigm is a genuine gap. The connection to causal reasoning in robotics is also trending.

5. Strengths & Limitations

Key Strengths:

Strong empirical results: On LIBERO, the method achieves 0.96 average success rate versus 0.87 for the next-best method (Nora) and 0.81 for TGRPO, which is a substantial improvement, especially on LIBERO-Long (0.94 vs. 0.74).

Conceptual clarity: The planner/executor decomposition is intuitive and well-articulated.

Multi-benchmark evaluation: Results on both LIBERO and RoboTwin2.0 across different horizon lengths strengthen the claims.

Causal framework: The use of sufficiency and necessity provides principled justification for importance weighting beyond simple heuristics.

Notable Limitations:

Implementation details gap: The most critical weakness. How the causal quantities are estimated in practice—number of rollouts, perturbation generation, computational cost—is insufficiently described. This severely limits reproducibility.

Baseline fairness concerns: The comparison on LIBERO may not be fully fair. OpenVLA-OFT (0.91) uses a different fine-tuning approach; the base model and training data for "Ours" are not clearly specified in the main text. The improvement over TGRPO (0.81→0.96) is large enough to warrant deeper analysis of what drives it.

Limited ablation depth: No analysis of planning action quality (are the identified actions truly at phase transitions?), no visualization of which timesteps receive high importance, no computational cost comparison.

Scalability questions: The counterfactual rollout requirements could be prohibitive for real-robot training or more complex environments.

Real-world validation: All experiments are in simulation. No real-robot experiments are reported.

Writing quality: Generally clear but the paper reads as somewhat incremental—the ideas are natural extensions of existing work (GRPO + credit assignment + causal reasoning), combined competently but without deep theoretical insight.

6. Additional Observations

The gap between LIBERO-Long results (0.94 vs. next-best 0.86 from OpenVLA-OFT) is impressive and supports the claim that planning-aware optimization helps particularly for multi-stage tasks. However, the RoboTwin results are more mixed—for some individual tasks (e.g., Handover Mic: 86.7 vs. π0's 96.0), the method underperforms baselines, suggesting the approach isn't universally beneficial.

The paper is a preprint (May 2026, arXiv) and lacks peer review. The η sensitivity analysis is mentioned but results are not shown in the main text.

Rating:5.8/ 10

Significance 6.5Rigor 4.5Novelty 6Clarity 6.5

Generated May 20, 2026

Comparison History (21)

vs. Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions

gemini-3.15/21/2026

Paper 2 addresses the reliability of Vision-Language-Action (VLA) models, a rapidly growing and highly impactful area in general-purpose robotics. By improving planning capabilities within foundation models using causal reasoning and RL optimization, it offers broader applicability across diverse robotic tasks. While Paper 1 presents a novel integration of 3D Gaussian Splatting and control theory for active perception, Paper 2's focus on enhancing the reasoning of large robotic models aligns more closely with current transformative trends in AI and robotics, promising a wider scientific and practical impact.

vs. Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

claude-opus-4.65/21/2026

Paper 2 addresses a fundamental question in dexterous manipulation—whether joint sensing alone suffices for complex in-hand manipulation—with a novel Proprioceptive Transformer architecture and strong real-world results (3.1x speed improvement). It challenges the prevailing assumption that vision/tactile sensing is necessary, has clear real-world applicability to any robotic hand, and introduces a paradigm shift in how proprioceptive data can be leveraged. Paper 1, while methodologically interesting, is more incremental—refining VLA optimization with planning-aware weighting—and its impact is narrower within the VLA training pipeline.

vs. Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

gemini-3.15/21/2026

Paper 2 addresses Vision-Language-Action (VLA) models, a highly active and rapidly growing area at the intersection of foundation models and robotics. Its novel use of causal inference to optimize planning actions within VLAs offers a broader potential impact on embodied AI. While Paper 1 presents a solid methodological contribution to safe navigation, Paper 2's focus on improving the reliability of generalized VLA policies is more timely and applicable to a wider range of complex manipulation tasks.

vs. Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

gemini-3.15/21/2026

Paper 2 addresses the reliability of Vision-Language-Action (VLA) models, a rapidly expanding and highly influential frontier in embodied AI. By introducing causal reasoning to identify and prioritize planning actions during optimization, it offers a novel, broadly applicable methodological advancement. While Paper 1 presents a rigorous and innovative approach for contact-rich manipulation, Paper 2's focus on generalist language-guided robots gives it higher potential for widespread impact, timeliness, and cross-disciplinary relevance in the current AI landscape.

vs. GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

gemini-3.15/21/2026

Paper 1 introduces a highly novel integration of 3D Gaussian Splatting into Vision-Language-Action models, addressing the critical lack of 3D geometric and temporal grounding in current robotic policies. By using 3D Gaussians solely for training supervision and discarding them at inference, it brilliantly avoids computational overhead during closed-loop control. The exceptional empirical results (e.g., 98.4% on LIBERO) highlight its immediate practical potential. While Paper 2 offers a valuable RL optimization strategy, Paper 1 represents a more paradigm-shifting architectural innovation bridging cutting-edge 3D vision and robotic manipulation, indicating broader and more disruptive scientific impact.

vs. Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

gemini-3.15/21/2026

Demo-JEPA tackles the highly challenging and broadly applicable problem of cross-embodiment imitation learning. By decoupling demonstration intent from embodiment-specific execution using a JEPA-based world model, it eliminates the need for shared action spaces or heuristic retargeting. This state-centric approach offers a more scalable and flexible paradigm for robotic learning from human or heterogeneous robot demonstrations compared to the more incremental optimization improvements for VLA models presented in Paper 1.

vs. AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

gpt-5.25/20/2026

PAPO-VLA introduces a more broadly applicable optimization perspective—explicitly separating planning vs. execution within VLA policies and weighting updates by causal importance of planning actions—potentially improving reliability across many tasks, models, and datasets. This is timely for closed-loop, long-horizon robotics where small decision errors cascade. AffordVLA is valuable but more specialized to perception/representation (affordance alignment) and may depend on teacher quality and affordance transferability. PAPO-VLA’s method-level contribution is likely to generalize beyond manipulation perception, giving wider cross-field impact.

vs. Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

gemini-3.15/20/2026

Paper 1 offers a broader algorithmic contribution to the rapidly expanding field of Vision-Language-Action models. By framing actions as either planning or executing and introducing causal metrics to estimate their importance during policy optimization, it addresses a fundamental challenge in embodied AI. While Paper 2 demonstrates impressive hardware results for humanoids, its methodological contribution (a state-dependent discriminator gate) is more incremental. Paper 1's approach has wider theoretical implications and broader applicability across diverse robotic manipulation tasks.

vs. Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

claude-opus-4.65/20/2026

PAPO-VLA addresses a fundamental challenge in the rapidly growing VLA/foundation model space for robotics, proposing a general optimization framework applicable across diverse manipulation tasks. Its conceptual contribution—distinguishing planning vs. execution actions and using causal reasoning for importance weighting—has broad applicability beyond any single domain. Paper 1, while solid, addresses a narrower surgical robotics problem with incremental methodological advances. The VLA optimization space is seeing explosive growth, giving Paper 2 higher potential for citations, follow-up work, and cross-field influence.

vs. ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

gemini-3.15/20/2026

Paper 2 introduces a pioneering benchmark and simulator for vision-language manipulation in soft robotics, a highly underexplored area compared to rigid arms. By providing a scalable dataset and tasks for this novel intersection, it acts as a foundational testbed that will catalyze future research. While Paper 1 offers a valuable algorithmic optimization for VLA models, Paper 2's introduction of a new domain to VLA research offers greater novelty, broader real-world applications (e.g., medical, confined spaces), and higher long-term community impact.

vs. RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

gpt-5.25/20/2026

Paper 2 (RoHIL) targets a concrete, deployment-critical failure mode—illumination-induced distribution shift—validated on real robots across multiple tasks, with an offline adaptation recipe that avoids extra robot interaction and mitigates catastrophic forgetting. Its combination of synthetic relighting, replay-based retention, and anchored regularization is timely and broadly relevant to sim2real/domain adaptation, continual learning, and practical HIL-RL deployment. Paper 1 is conceptually novel for VLA reliability via planning-action importance, but likely narrower in immediate real-world impact and harder to validate causally at scale than RoHIL’s clear deployment gains.

vs. DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

gpt-5.25/20/2026

Paper 2 (DEFLECT) targets a pervasive, deployment-critical failure mode—asynchronous inference delay—showing dramatic baseline collapse and offering a near drop-in, fully offline fix using label-free counterfactual preferences and a flow-matching likelihood-ratio surrogate. Its real-world relevance (latency-robust control), strong application potential across many VLA/robot stacks, and demonstrated transfer to real-scale models and multiple real-robot tasks suggest broader and more immediate impact. Paper 1’s planning-aware credit assignment is novel and useful, but is less tightly tied to a widely shared systems bottleneck and appears less broadly generalizable.

vs. REBAR: Reference Ethical Benchmark for Autonomy Readiness

claude-opus-4.65/20/2026

PAPO-VLA addresses a core technical challenge in VLA models—improving reliability through planning-aware optimization—with a well-defined methodology combining causal reasoning with policy optimization. It has immediate practical applications in robotics and contributes to the rapidly growing VLA/embodied AI field. While REBAR addresses an important gap in ethical evaluation of autonomous systems, it is more framework-oriented and its impact depends heavily on adoption. PAPO-VLA's technical contributions (causal importance estimation, planning action identification) are more generalizable and methodologically novel within a highly active research area.

vs. Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

claude-opus-4.65/20/2026

PAPO-VLA addresses a fundamental challenge in VLA model optimization by introducing a novel planning-aware framework that distinguishes planning from execution actions using causal reasoning. This targets the rapidly growing VLA/foundation models for robotics field, which has broad impact potential. Paper 2 presents a useful but more incremental contribution—using neural networks to warm-start trajectory optimization for UAV-UGV coordination—which is a narrower application domain with more established methodological patterns (learning-based warm starting). PAPO-VLA's conceptual novelty and relevance to the trending VLA research area give it higher impact potential.

vs. CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

gpt-5.25/20/2026

Paper 2 (CANINE) likely has higher scientific impact due to clear, high-value real-world application (assistive navigation for visually impaired users), strong timeliness, and broader societal/clinical relevance. It integrates adaptive instruction (knowledge tracing + foundation-model-based error diagnosis) and provides human-subject evidence including controlled, retention, and real-user case studies, suggesting methodological rigor and translational potential. Paper 1 is technically novel for VLA reliability, but its impact may be narrower (robot learning methodology) and less immediately validated in real-world deployments.

vs. ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

claude-opus-4.65/20/2026

PAPO-VLA addresses a fundamental challenge in Vision-Language-Action models—a rapidly growing field at the intersection of foundation models and robotics. Its novel decomposition of VLA policies into planning and execution components, combined with causal importance estimation integrated into GRPO optimization, offers a broadly applicable methodological contribution. Paper 1 introduces a niche RL benchmark inspired by game NPCs with limited generalizability beyond its specific domain. PAPO-VLA's relevance to the booming VLA/foundation-model-for-robotics community gives it significantly higher potential impact.

vs. Bilateral Teleoperation with Compliant 6-DOF Pose-and-Force Sensing

gpt-5.25/20/2026

Paper 1 likely has higher impact due to novelty and broad relevance: it introduces a planning-aware optimization framework for VLA policies, combining identification of planning actions with causal importance (necessity/sufficiency) and integrating this into policy optimization—concepts applicable across many embodied AI/robot learning settings. Its potential applications span general-purpose language-conditioned robotics and reliability of foundation-model-based control, a timely, fast-growing area. Paper 2 is methodologically solid and practically useful, but is more engineering-specific (teleoperation hardware/control architecture) with narrower cross-field influence.

vs. Multi-Session Ground Texture SLAM in Low-Dynamic Environments

claude-opus-4.65/20/2026

PAPO-VLA addresses a fundamental challenge in the rapidly growing VLA model space, proposing a novel planning-aware optimization framework that distinguishes planning from execution actions using causal reasoning. This targets a high-impact area (foundation models for robotics) with broad applicability. Paper 2, while solid, addresses a narrower niche (ground texture SLAM in low-dynamic environments) with incremental contributions (applying KL divergence for loop closure). The breadth of impact, timeliness given the VLA model boom, and methodological novelty of Paper 1 give it significantly higher potential impact.

vs. ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation

gpt-5.25/20/2026

Paper 2 likely has higher impact due to a more broadly enabling contribution: a physics-consistent, distillation-free one-step diffusion-style policy that directly addresses a major bottleneck (latency) while claiming strong performance on multiple standard benchmarks and long-horizon tasks. The Elastic Time Horizons idea targets a general temporal mismatch problem across robotic tasks, potentially influencing both embodied AI and generative-model acceleration. Paper 1 is novel in reliability via planning-action attribution, but its impact may be narrower and more dependent on the specific VLA optimization pipeline.

vs. Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

gpt-5.25/20/2026

Paper 2 has higher estimated scientific impact due to its timeliness and broad relevance to rapidly advancing vision-language-action robotics, with potential spillover to RL, causality-aware learning, and policy optimization. Its planning/execution decomposition and causal importance weighting are conceptually novel and likely reusable across tasks and model classes, with clear real-world applications in reliable manipulation. Paper 1 is methodologically rigorous and valuable for soft robotics design, but its impact is narrower and more incremental (extending existing topology optimization to 3D) with a more domain-specific audience.