PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

Shizhe Chen, Paul Pacaud, Cordelia Schmid

May 20, 2026arXiv:2605.21414v1

cs.ROcs.CV

#763of 3713·Robotics

#763 of 3713 · Robotics

Tournament Score

1471±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor7

Novelty6

Clarity7.5

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PointACT

1. Core Contribution

PointACT addresses a well-recognized limitation of current VLA models: their reliance on 2D visual representations, which constrains spatial reasoning for precise robotic manipulation. The paper proposes a dual-system architecture where a frozen pretrained VLM backbone handles semantic understanding while a dedicated "point-action expert" integrates hierarchical 3D point cloud features directly into action decoding. The key technical novelty is the multi-scale point-action interaction mechanism using bottleneck window self-attention, where action tokens serve as information bottlenecks that aggregate local geometric context from spatially-windowed point cloud partitions across multiple PTv3 encoder stages. This design achieves linear computational scaling with respect to the number of windows while allowing dense geometric conditioning of evolving action tokens.

The paper also provides a systematic comparison of where to inject 3D information in VLAs—backbone-level (monolithic) vs. action-expert-level (dual-system)—which is a useful architectural study for the community.

2. Methodological Rigor

The experimental design is generally thorough. The paper evaluates on two complementary benchmarks (LIBERO for short-horizon delta actions; RLBench for long-horizon keypoint actions), covers both regression and classification action heads, and includes real-robot experiments on two platforms (SO-100 and UR5).

Strengths in rigor:

Controlled ablations comparing monolithic vs. dual-system injection, multi-scale vs. single-scale features, and naive concatenation vs. the proposed interaction mechanism (Tables III, IV)

Reproduction of baselines (EO1, ACT3D) under identical evaluation protocols for fair comparison on RLBench

100 evaluation episodes per task on RLBench (vs. 20 in some prior work), acknowledging execution variability

Variance analysis across seeds (82.33 ± 0.65 on RLBench)

Weaknesses in rigor:

LIBERO comparisons are complicated by varying training setups across methods (different pretraining data, multi-view cameras, number of training tasks). The paper acknowledges this but it weakens the comparative claims.

Real-robot experiments use only 10 test episodes per task with 20-50 training demonstrations—small sample sizes that limit statistical significance.

The frozen VLM vs. finetuned VLM comparison is somewhat confounded: EO1 finetunes the VLM backbone while PointACT keeps it frozen, making direct attribution of improvements to 3D integration vs. training strategy difficult.

The monolithic VLA baseline (EO1 + Point) concatenates point tokens from only the final PTv3 layer, while PointACT uses multi-scale features, making the monolithic vs. dual-system comparison partially confounded with the multi-scale mechanism.

3. Potential Impact

The work has moderate-to-high practical relevance:

Immediate applicability: With RGB-D cameras becoming standard on robotic platforms, the method addresses a practical integration challenge. The approach works with readily available depth sensors.

Architectural insight: The finding that backbone-level 3D injection can harm pretrained VLM representations (Table III, RLBench: 73.2 → 18.6 for EO1 + Point) is an important negative result that could save the community significant effort in exploring unproductive directions.

Pretrained 3D encoders: The demonstration that PTv3 pretrained on out-of-domain building-level scenes still provides useful geometric priors (Figure 5) is encouraging, suggesting that as manipulation-specific 3D pretraining data grows, further gains are possible.

Cross-benchmark consistency: Gains on both LIBERO (delta actions) and RLBench (keypoint actions) suggest the approach generalizes across action spaces.

The 10% improvement on RLBench-10Tasks over EO1 (73.2 → 82.3) is substantial, particularly given the challenging nature of the benchmark.

4. Timeliness & Relevance

This work is highly timely. VLAs are experiencing rapid development (π0, GR00T, OpenVLA, EO1), and the 2D-to-3D gap is widely acknowledged as a bottleneck. The concurrent emergence of strong 3D pretrained models (PTv3, etc.) creates a natural opportunity for this type of integration. The paper positions itself well within the current discourse on how to augment VLAs with spatial reasoning without disrupting pretrained knowledge.

5. Strengths & Limitations

Key Strengths:

Clean architectural design with clear separation of concerns (frozen VLM for semantics, trainable point-action expert for geometry)

The bottleneck window self-attention is computationally efficient (linear in windows) and conceptually elegant

Only ~300M trainable parameters for the action expert, vs. 1-3B for many baselines

Comprehensive ablation studies that isolate the contribution of each design choice

Multi-platform real-robot validation with different action spaces

Notable Limitations:

The method requires calibrated RGB-D cameras, limiting deployment flexibility compared to pure RGB methods

Real-world results show sensitivity to point cloud quality (transparent objects cause failures, as noted for the Close Drawer task)

No evaluation on truly diverse real-world tasks at scale; real experiments involve only 3 tasks per platform

The paper does not address sim-to-real transfer or domain adaptation

Limited analysis of failure modes related to the 3D representation itself (e.g., what happens with very sparse or noisy point clouds beyond the sensitivity analysis)

The method inherits limitations of behavior cloning (no failure recovery), as acknowledged in the failure analysis

Missing Comparisons:

No comparison with DepthVLA [66] or 3DS-VLA [38] on the same benchmarks, which are the most directly competing 3D-aware VLA methods.

Overall Assessment

PointACT makes a solid engineering and empirical contribution to the active area of 3D-aware VLA design. The multi-scale point-action interaction mechanism is well-motivated, efficiently designed, and convincingly validated through ablations. The systematic comparison of 3D integration strategies provides useful architectural guidance. However, the novelty is primarily in the integration design rather than in fundamentally new algorithmic concepts. The real-world validation, while present, is limited in scale. The work represents a meaningful incremental advance that could influence how future VLA systems incorporate 3D geometry.

Rating:6.8/ 10

Significance 6.5Rigor 7Novelty 6Clarity 7.5

Generated May 21, 2026

Comparison History (24)

Lostvs. Sparse Compositional Flow Matching by geometric assembly from motion primitives

Paper 2 introduces a fundamental shift in embodied trajectory generation by combining flow matching with compositional motion primitives directly in physical space. This addresses the sample inefficiency of monolithic generators and offers broad applicability across various embodied AI domains (manipulators, mobile robots). While Paper 1 provides a strong architectural improvement for VLA models using 3D data, Paper 2's methodological innovation in structured generative modeling presents a more paradigm-shifting approach with higher potential for widespread impact across the field.

gemini-3.1-pro-preview·May 25, 2026

Lostvs. Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Paper 2 addresses a critical bottleneck in the rapid deployment of humanoid robots: cross-embodiment transfer. By enabling the reuse of whole-body tracking models with only 1% of the original compute and data, it offers a highly scalable and cost-effective paradigm. While Paper 1 provides a strong methodological improvement for 3D manipulation, Paper 2's potential to dramatically accelerate the adoption and development of diverse humanoid platforms gives it a higher potential for broad scientific and industry impact.

gemini-3.1-pro-preview·May 25, 2026

Wonvs. Direct Dynamic Retargeting for Humanoid Imitation Learning from Videos

PointACT addresses the fundamental limitation of 2D representations in VLA models by integrating hierarchical 3D point cloud representations, which is highly relevant to the rapidly growing VLA/foundation model community. It demonstrates strong empirical gains (10%+ on RLBench) and offers broadly applicable insights about coupling 3D geometry with 2D semantics. While Paper 1 makes a solid contribution to humanoid imitation learning with its direct dynamic retargeting approach, Paper 2 targets a larger and more active research area (general-purpose robotic manipulation via foundation models), has broader applicability across manipulation tasks, and its architectural innovations are more likely to influence subsequent work in the VLA space.

claude-opus-4-6·May 25, 2026

Wonvs. GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

Paper 2 addresses a fundamental limitation in current VLA models—lack of 3D spatial reasoning—by directly integrating multi-scale point cloud representations into the action decoding process. While Paper 1 offers an innovative human-robot interaction approach using gestures, Paper 2's focus on fine-grained geometric grounding is more broadly applicable to core robotic manipulation challenges. Furthermore, Paper 2 demonstrates strong methodological rigor with systematic evaluations on established benchmarks (RLBench and LIBERO), making its architectural contributions highly likely to influence future foundational models in robotics.

gemini-3.1-pro-preview·May 22, 2026

Lostvs. Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

SOMA addresses a more fundamental and underexplored limitation of VLA models—operating when task-relevant objects are out of the camera's field of view. This is a highly practical real-world constraint that most existing work ignores. The spatial memory framework introduces a novel architectural concept (persistent memory with construction, refinement, and retrieval) validated on real-world tasks including dual-arm scenarios. PointACT, while solid, addresses the more incremental problem of integrating 3D point clouds into VLAs, which has been explored in various forms. SOMA's problem framing is more novel and has broader implications for deploying robots in realistic, partially observable environments.

claude-opus-4-6·May 22, 2026

Lostvs. Nautilus: From One Prompt to Plug-and-Play Robot Learning

Paper 1 offers higher potential scientific impact because it addresses a critical, universal bottleneck in robotics research: engineering fragmentation. By providing an LLM-driven, plug-and-play harness for cross-validating policies, simulators, and hardware, NAUTILUS functions as foundational infrastructure. While Paper 2 presents a strong architectural advancement for 3D-aware VLA models, Paper 1's tool could accelerate the entire field's workflow, similar to how unified frameworks revolutionized deep learning. Foundational tools that lower barriers to entry and standardize evaluation typically achieve broader, cross-cutting impact than specific algorithmic improvements.

gemini-3.1-pro-preview·May 21, 2026

Wonvs. Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks

PointACT addresses a fundamental limitation of VLA models by integrating 3D point cloud representations into action decoding, with broad implications for the rapidly growing VLA field. Its systematic evaluation on standard benchmarks (LIBERO, RLBench) with 10% improvements over SOTA, comprehensive ablations, and demonstration that hierarchical 3D geometry coupling matters provide foundational insights applicable across many robotic manipulation settings. Paper 2, while practically valuable with real-world dexterous manipulation results, addresses a narrower problem with a more specialized retrieve-align-execute pipeline that may have less generalizable impact across the robotics community.

claude-opus-4-6·May 21, 2026

Wonvs. Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

PointACT addresses a fundamental limitation in VLA models by integrating 3D point cloud representations into action decoding, demonstrating significant improvements (10%+ success rate gains) on established benchmarks (LIBERO, RLBench). It tackles a core challenge in robotic manipulation with broad applicability. Q-SpiRL, while novel in combining quantum computing with spiking networks for RL, is evaluated only on simple grid-world environments, limiting real-world impact. The quantum advantage remains unclear, and quantum hardware constraints limit near-term practical deployment. PointACT's contributions are more immediately impactful for the robotics community.

claude-opus-4-6·May 21, 2026

Wonvs. Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots

Paper 1 addresses a major limitation in general-purpose robotic manipulation by integrating 3D spatial awareness into Vision-Language-Action (VLA) models. Given the rapid growth and broad applicability of foundation models in robotics, this approach has significantly higher potential for widespread impact across the field compared to Paper 2, which focuses on a highly specialized hardware paradigm (inflatable truss robots).

gemini-3.1-pro-preview·May 21, 2026

Lostvs. BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

BlockVLA addresses a fundamental efficiency bottleneck in VLA deployment (inference latency) with a novel architectural bridge between autoregressive and diffusion paradigms. The 3.3x inference acceleration and faster training convergence have broad implications for real-time robotic control. While PointACT's 3D-aware approach is valuable, BlockVLA's contribution is more foundational—it introduces a new computational paradigm (block diffusion for VLAs) that could be combined with various representation improvements including 3D awareness. The efficiency gains are critical for practical deployment, giving it broader potential impact.

claude-opus-4-6·May 21, 2026

#763of 3713·Robotics

#763 of 3713 · Robotics

Tournament Score

1471±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor7

Novelty6

Clarity7.5