The history of robotics is largely a history of programming: specify the task, encode the motion, define the workspace boundaries, repeat for every variation. This approach produced extraordinary precision — modern industrial robots weld, paint, and assemble with sub-millimetre repeatability — but at a cost. Every change to the task required an engineer. Every new environment required re-commissioning. Every object with different geometry required new programming. The robot was a sophisticated tool, not an adaptive system.
Vision-Language-Action (VLA) models are the first serious architectural attempt to change that at the model level. A VLA takes in a camera feed and a natural language instruction and outputs motion — the reasoning and the acting happen in the same system, not in separate pipelines stitched together. The robot does not execute a programme. It interprets a goal and figures out how to achieve it.
The evidence that something structural is shifting: ICLR 2026, one of the top machine learning conferences, received 164 VLA paper submissions — up from a single submission in 2024. That two-year trajectory is one of the fastest research acceleration curves in modern AI. AgiBot, the Chinese humanoid company that deployed 10,000 robots by March 2026, presented ACoT-VLA at CVPR 2026 — an action chain-of-thought model that reasons directly in the language of motor commands. The research community and the production deployment community are converging on the same architecture.
This article explains what VLA models are and how they work, maps the current model landscape, documents what they can and cannot do in 2026, and describes the engineering challenges that sit between a benchmark result and a factory floor.
What a VLA Model Is — and What It Is Not
The terminology is worth unpacking carefully because it is used loosely. A Vision-Language Model (VLM) — like GPT-4o or Gemini — perceives images and text and generates text. It has no mechanism for physical action. A Vision-Language-Action Model (VLA) extends this architecture with an action generation head: it takes visual observations and language instructions as input and outputs motor commands — joint torques, end-effector trajectories, or discrete action tokens — as output. The physical world is both the input and the output domain.
The foundational architecture was established by Google DeepMind’s RT-2 in mid-2023 — a 55-billion-parameter model that fine-tuned PaLI-X on real robot demonstration data, treating robot actions as discrete tokens identical to text tokens. RT-2 demonstrated two properties that made it foundational: it could perform tasks it had never been explicitly trained on by transferring knowledge from its language pre-training, and it could execute multi-step reasoning using chain-of-thought prompting. It was also expensive, slow, and closed-source — which is why the field moved so fast to improve on it.
The key conceptual shift from classical robot programming is generalisation. A classically programmed robot handles the specific task it was programmed for. A VLA-equipped robot can generalise to variations it was not explicitly shown — different object positions, different lighting conditions, variations in instruction phrasing — by drawing on the contextual knowledge embedded in its vision and language pre-training. The robot understands the task rather than executing a fixed sequence of motions.
Key distinction: VLMs perceive and describe the world in language. VLAs perceive and act on the world in motor commands. The addition of an action output head is the architectural difference that puts a model in a robot body.
How They Work: The Three-Layer Architecture
Understanding VLA mechanics requires understanding three components that every modern VLA combines, in different proportions and with different design choices.
1. The Vision-Language Backbone
The backbone is a pre-trained VLM — typically a 3B to 7B parameter model — that provides visual and linguistic understanding. It processes camera frames (from one or more perspectives) and the language instruction, generating a rich contextual representation of the task. Stanford’s OpenVLA uses a Llama 2 language backbone with a visual encoder that fuses DINOv2 and SigLIP features. NVIDIA’s GR00T N1.7 was pre-trained on over 20,000 hours of human egocentric video — the idea being that watching humans manipulate objects teaches the model physical intuitions about grip force, object deformation, and hand-eye coordination that are difficult to encode manually. The backbone is the reasoning engine. It transforms raw observations into a structured understanding of what needs to happen.
2. The Action Generation Head
The action head translates the backbone’s representation into concrete motor commands. This is where the most active architectural innovation is happening in 2026. Three approaches dominate:
- Discrete tokenisation: Actions are quantised into discrete bins and predicted as tokens — the same way RT-2 and OpenVLA work. Simple and effective, but inherently lossy for continuous motion.
- Diffusion / flow-matching: Actions are generated by a diffusion process or continuous normalising flow, producing smoother trajectories. Physical Intelligence’s π0 uses flow-matching; dVLA (ICLR 2026) achieves 96.4% on LIBERO using discrete diffusion with multimodal chain-of-thought.
- Action chain-of-thought: The model explicitly reasons in the action space before committing to motion — generating a coarse trajectory plan before executing fine motor control. ACoT-VLA (CVPR 2026) shows this approach achieves state-of-the-art on long-horizon benchmarks, particularly in the LIBERO-Long suite.
3. The Data Foundation
VLA models learn from demonstration data — recordings of a robot or a human performing tasks, paired with language descriptions of the goal. The scale and diversity of this data is currently the primary determinant of a model’s generalisation capability. OpenVLA was trained on 970,000 robot episodes from the Open X-Embodiment dataset, a collaboration between 21 institutions across 22 different robot embodiments. GR00T N1.7 combined egocentric human video, real robot trajectories, and NVIDIA-generated synthetic data from Cosmos simulations. The open-world generalisation that π0.5 demonstrates — performing in entirely new environments without retraining — is a direct product of training data diversity, not architectural cleverness alone.
The 2026 VLA Model Landscape
The table below maps the primary VLA models available in 2026 — from the foundational closed models that established the field to the open-source and research-stage architectures that are pushing the frontier:
| Model | Developer | Parameters | Access | Key Characteristic |
| RT-2 | Google DeepMind | 55B | Closed | Pioneered the VLA paradigm (2023); fine-tuned PaLI-X on robot demos |
| OpenVLA | Stanford (open-source) | 7B | Open | 970k episodes; outperforms RT-2 with 7x fewer parameters |
| π0 (pi-zero) | Physical Intelligence | ~3B | Weights public | Flow-matching action generation; strong dexterity; folds laundry |
| π0.5 | Physical Intelligence | ~3B | Closed | Open-world generalisation; new environments without retraining |
| GR00T N1 / N1.7 | NVIDIA | 3B (N1.7) | Open (N1.7) | 20,000+ hrs egocentric video; Isaac Sim integration; NemoClaw NL interface |
| ACoT-VLA | AgiBot / CVPR 2026 | Undisclosed | Research | Action chain-of-thought; reasons directly in action space; LIBERO-Long SOTA |
| dVLA | ICLR 2026 | Undisclosed | Research | Diffusion VLA with multimodal CoT; 96.4% on LIBERO benchmark |
| LingBot-VLA | Ant Group | Undisclosed | Research | Depth-enhanced; SOTA on GM-100 cross-platform benchmark (17.3% SR) |
| SmolVLA | Hugging Face | 450M | Open | Comparable to 7B VLAs at 450M params; flow-matching; async inference |
| NanoVLA | Academic | ~100M | Research | 52x faster inference on Jetson-class edge hardware; 98% fewer parameters |
Sources: NVIDIA Research, Stanford OpenVLA, Physical Intelligence, AgiBot / CVPR 2026, Hugging Face, ICLR 2026 submissions, Moritz Reuss ICLR 2026 VLA analysis. Access status as of May 2026.
Two observations about this landscape are worth emphasising. First, OpenVLA’s benchmark result — outperforming the closed RT-2 (55B parameters) by 16.5% in absolute task success rate across 29 tasks, with 7x fewer parameters — demonstrated that the architecture, not the scale, was the primary variable in the 2023–2024 generation of models. This finding collapsed the assumption that closed, compute-intensive models from large labs would maintain a permanent capability lead.
Second, the emergence of Hugging Face’s SmolVLA — achieving comparable performance to 7B VLAs with only 450 million parameters using flow-matching and asynchronous inference — and NanoVLA — achieving 52x faster inference than prior state-of-the-art models with 98% fewer parameters — signals that the efficiency research wave is delivering usable results. The trajectory is toward smaller models, on-device inference, and open weights.
“By 2030, nearly 13 million robots will be in circulation globally. Most of them are still built the old way. Vision-Language-Action models are the first serious attempt to change that at the model level.” — Roboflow, April 2026
The Deployment Gap: Why Benchmark Results Do Not Equal Factory Performance
The most important insight in a16z’s January 2026 analysis of physical AI — “The Physical AI Deployment Gap” — is a constraint that every practitioner deploying VLAs in real environments already knows: manipulation tasks require robot control at 20–100 Hz. A 7B-parameter VLA running on edge hardware achieves inference at 10–20 Hz at best. That gap — between what the task requires and what the current model can deliver — is the central engineering problem of VLA deployment in 2026.
The physics of the constraint are unambiguous. A robot arm tracking a moving object, adjusting grip force in response to tactile feedback, or maintaining balance during dynamic locomotion cannot wait 50–100 milliseconds between control decisions. An LLM that takes two seconds to start streaming is an inconvenience. A robot that takes two seconds to react is a safety hazard.
Three engineering approaches are dominating 2026 production deployments as partial solutions to this constraint:
- Action chunking: Instead of predicting one action per inference step, the policy emits a chunk of 8–50 future actions at once and the robot executes them open-loop while the next chunk computes. This dramatically reduces effective inference frequency requirements and improves motion smoothness.
- Asynchronous dual-system architecture: A heavyweight VLM backbone runs at 5–10 Hz for planning and re-planning, while a lightweight diffusion or flow-matching action expert produces motor commands at 50–100 Hz conditioned on the latest plan. NVIDIA GR00T and Figure AI’s Helix architecture formalise this pattern.
- Quantisation, distillation, and edge silicon: Reducing model precision and size to fit the compute envelope of on-robot hardware — Jetson Orin-class devices in particular. OpenVLA-OFT achieves 26x speedup through parallel decoding and action chunking optimisation.
The sim-to-real gap is the parallel deployment challenge. The ICRA 2026 workshop on VLA pipelines was explicit in its diagnosis: “most pipelines lack rigorous data curation, principled training regimes, and standardized evaluation, leading to inconsistent performance and safety concerns on physical robots.” Models that achieve 95%+ success rates on LIBERO or SIMPLER benchmarks — the standard simulation evaluation platforms — frequently underperform in real factory conditions due to sensor noise, variable lighting, unexpected contact dynamics, and environmental variation not represented in the training distribution.
The Eva-VLA robustness evaluation framework tested leading VLA models — including OpenVLA, UniVLA, and π0.5 — under real-world physical variations and found “severe vulnerabilities across leading VLA models” when exposed to distribution shifts that are routine in production environments. Mid-task recovery remains the hardest unsolved problem: a VLA handles unexpected variation better than any prior approach, but when something goes wrong — a grasp that slips, an object that shifts mid-pick — getting back on track is still a weak point.
“Research papers can run inference on clusters and report results, but production deployments require running on the hardware that fits in — and can be powered by — the actual robot deployed.” — Oliver Hsu, a16z, January 2026
The Open vs. Closed Ecosystem Debate
The strategic question that will shape the VLA landscape over the next three years is not which architecture performs best on benchmarks — it is whether the model ecosystem remains open or closes into proprietary moats.
The closed-model argument is straightforward: training data is the primary competitive moat, not architecture. Humanoid companies are locking up trillions of teleop frames as proprietary datasets — Figure AI’s deployment at BMW generates continuous demonstration data that feeds back into model improvement. Physical Intelligence’s π0.5 is not open-source. The company that accumulates the most diverse, high-quality real-world demonstration data across the most robot embodiments will build a generalisation advantage that architecture innovation alone cannot overcome.
The open-model counter-argument is also strong: the Open X-Embodiment dataset — a collaboration between 21 institutions — provided the training foundation for OpenVLA, which outperformed Google’s 55B closed model. Hugging Face’s LeRobot project is building open datasets, open training pipelines, and open model weights specifically to prevent the robotics field from recapitulating the closed-model dynamics of commercial LLMs. NVIDIA’s decision to release GR00T N1.7 as open weights — making it the largest open VLA available — reflects a platform strategy: own the simulation environment (Isaac Sim) and the developer ecosystem while distributing the model.
The ICLR 2026 analysis by Moritz Reuss identifies a third concern: an emerging gap between frontier research (conducted at companies with access to large proprietary datasets and compute) and academic research (constrained by publicly available datasets and evaluation environments). If the most important VLA advances happen inside closed industrial labs and are not published, the academic community loses the ability to contribute meaningfully — and the field bifurcates between practitioners with access and researchers without it.
What VLA Models Mean for Industrial Robot Deployments in 2026
The question that most matters for an industrial operator reading this article is not about model architecture — it is about deployment readiness. Here is an honest assessment across three dimensions.
Task generalisation is real, but the scope is narrower than it sounds. A VLA-equipped robot can handle variation within its training distribution — different object positions, different instruction phrasings, some degree of lighting and environmental change. It cannot reliably handle categorically new tools, fundamentally different contact dynamics, or task types outside its training domain. The BMW/Figure deployment and the AgiBot/Longcheer deployment both succeeded by choosing tasks within the model’s demonstrated competence envelope. Task selection remains as important for VLA-equipped robots as for classically programmed ones.
Language-controlled robots are available today — at research and pilot scale. NVIDIA’s NemoClaw, part of the GR00T N1.7 release, allows developers to navigate robots using plain natural language commands — “move two metres forward and rotate 45 degrees” — translating instructions into executable Python scripts sent to Isaac Sim in real time. Caterpillar’s Cat AI Assistant, powered by NVIDIA Nemotron, brings natural language interaction directly into heavy equipment. The programming interface that previously required specialist engineers has been replaced by conversation — in select, well-defined deployment contexts.
The efficiency gap is the primary barrier to broad production deployment. A 7B VLA model running at 10–20 Hz on edge hardware is inadequate for dynamic manipulation requiring tight feedback loops. The three engineering patterns — action chunking, dual-system architecture, and edge-optimised models — are genuine partial solutions, not complete ones. The efficient VLA research subfield is accelerating, with SmolVLA and NanoVLA demonstrating that the parameter count required for capable VLA behaviour is falling quickly. The trajectory is toward on-robot inference at adequate control frequencies — but for most manipulation tasks, that threshold has not yet been crossed reliably in production environments.
- 164 VLA paper submissions at ICLR 2026 — up from 1 in 2024 — Moritz Reuss analysis
- 16.5% OpenVLA outperformance vs. RT-2 (55B) — at 7x fewer parameters — Stanford, June 2024
- 52x NanoVLA inference speedup on Jetson-class hardware — vs. prior SOTA VLAs — 98% fewer parameters
Three Developments Worth Watching in H2 2026
The ICRA 2026 competition on VLA pipelines for mobile manipulation is the most practically significant evaluation event of the year. AIRoA’s data engine — releasing approximately 10,000 hours of real-world robot data alongside the competition — will provide the most rigorous public benchmark of cross-environment generalisation that the field has produced. The results will directly inform which architectures and training recipes are closing the sim-to-real gap most effectively.
The Physical AI market is projected by MarketsandMarkets at $1.5 billion in 2026 growing to $15.24 billion by 2032 at a 47.2% CAGR — the fastest projected CAGR of any AI infrastructure segment. As the investment flows into VLA training infrastructure and edge silicon designed for on-robot inference, the efficiency gap that currently limits production deployment will close. The question is pace, not direction.
The open-weights ecosystem — OpenVLA, GR00T N1.7, LeRobot, and the emerging SmolVLA/NanoVLA efficiency models — is developing faster than most observers predicted. Hugging Face’s approach of building open training pipelines, datasets, and evaluation frameworks specifically for VLAs is reducing the barrier to fine-tuning a capable VLA for a new task from months of specialised work to weeks for teams with moderate ML expertise. The developer base that will deploy VLAs in production at industrial scale is being built now — in open-source, not in closed labs.
The Bottom Line
VLA models are not the endpoint of robot intelligence — they are the architecture that makes the next ten years of progress possible. The shift from task-specific programming to generalised policy learning is structural, not cyclical, and the research acceleration demonstrated at ICLR 2026 confirms that the field is moving at a pace that most industrial observers have not yet fully registered.
For practitioners: the deployment gap between benchmark results and factory performance is real and should be engineered around explicitly. Action chunking, dual-system architectures, and careful task scoping are not workarounds — they are the current best practice for production VLA deployment. The models that work in 2026 work because the deployment architecture was designed to match the model’s actual capabilities, not the marketing narrative.
For strategists: the open/closed ecosystem dynamic will determine who captures value as VLA models mature. The companies accumulating the most diverse real-world demonstration data — across robot embodiments, environments, and tasks — are building the moat that architecture innovation alone cannot overcome. The data flywheel is the competitive advantage, and the companies running the most diverse real deployments are building it the fastest.
Key Sources
- Moritz Reuss — State of VLA Research at ICLR 2026
- AgiBot — ACoT-VLA: Action Chain-of-Thought (CVPR 2026)
- Roboflow Blog — Vision-Language-Action Models for Robotics (Apr 2026)
- Stanford / arXiv — OpenVLA: An Open-Source Vision-Language-Action Model
- NVIDIA Research — Isaac GR00T N1: Open Foundation Model for Humanoid Robots
- NVIDIA Blog — Physical AI Open Models for Robots and Autonomous Systems
- a16z — The Physical AI Deployment Gap (Jan 2026)
- Hyscaler — Vision-Language-Action Guide 2026
- Exxact Blog — VLA Models Powering Robotics
- Internet Pros — VLA Models 2026: Robotics Foundation Models
- arXiv — dVLA: Diffusion VLA with Multimodal Chain-of-Thought
- arXiv — Eva-VLA: Evaluating VLA Robustness Under Real-World Variations
- arXiv — Efficient VLA Models for Embodied AI (Survey)
- arXiv — NanoVLA: Lightweight VLA for Edge Devices
- Wikipedia — Vision-Language-Action Model (architecture history)
- ICRA 2026 Workshop — From Data to Decisions: VLA Pipelines for Real Robots
- AIRoA — ICRA 2026 VLA Competition Announcement (Feb 2026)
- AI Weekly — World Models News: Physical AI Market Projection (Apr 2026)
- ScienceDirect — Multimodal Fusion with VLA Models: Systematic Review (Dec 2025)






