Overview
While the dominant AI paradigm is organized around language and code, adjacent fields are entering a scaling regime of their own. Three domains — robot learning, autonomous science, and new interfaces — share a common substrate of technical primitives and are mutually reinforcing. This represents the next phase transition for AI: grounding intelligence in physical reality.
The thesis (a16z, Oliver Hsu): the areas with the greatest upside potential are those that benefit from the same scaling dynamics driving language AI, but sit one step removed — close enough to inherit infrastructure and research momentum, distant enough to require non-trivial additional work. This distance creates both a natural moat and a richer problem space.
Five Shared Primitives
1. Learned Representations of Physical Dynamics
The ability to learn compressed, general-purpose representations of how the physical world behaves — how objects move, deform, collide, and respond to force.
Three architectural paths converging:
- Vision-Language-Action models (VLAs) — Extend pretrained vision-language models with action decoders. Amortize the cost of learning to see across internet-scale pretraining. Examples: Physical Intelligence's π₀, Google DeepMind's Gemini Robotics, NVIDIA's GR00T N1.
- World Action Models (WAMs) — Build on video diffusion transformers pretrained on internet-scale video, inheriting physical dynamics priors. NVIDIA's DreamZero achieves zero-shot generalization and cross-embodiment transfer.
- Native embodied foundation models — Generalist's GEN-1, trained from scratch on 500K+ hours of real-world physical interaction data from wearable devices. Not a fine-tuned VLM or WAM — a first-class foundation model for physical interaction.
Spatial intelligence (e.g., World Labs) fills a representation gap all three share: none explicitly model 3D scene structure.
2. Architectures for Embodied Action
Translating understanding into reliable physical action requires solving: intent-to-motor-command mapping, long-horizon coherence, real-time latency, and learning from experience.
- Dual-system hierarchy — Slow, powerful VLM for scene understanding (System 2) + fast visuomotor policy for real-time control (System 1). Standard pattern in GR00T N1, Gemini Robotics, Figure's Helix.
- Flow matching / diffusion action heads — Pioneered by π₀, displacing discrete tokenization. Treats action generation as denoising, producing physically smoother trajectories.
- RL post-training on VLAs — Physical Intelligence's RECAP method on π*₀.₆: trains a value function estimating success probability from any intermediate state, then conditions the VLA to select high-advantage actions. Results: folds laundry across 50 novel garment types, runs continuously for hours. Doubles throughput, halves failure rates vs. imitation-only.
3. Simulation and Synthetic Data
In language, the internet solved the data problem. In the physical world, simulation solves it. The modern stack: physics-based engines + photorealistic rendering + procedural environment generation + world foundation models bridging sim-to-real. If the bottleneck shifts from collecting real data to designing virtual environments, the cost curve collapses — simulation scales with compute, not human labor.
4. Expanding the Sensory Manifold
AI's sensory access is expanding rapidly beyond vision and language:
- AR devices — Continuous first-person video of human-environment interaction
- Voice-first AI wearables — Higher-bandwidth context for language AI
- Silent speech interfaces (e.g., Wispr Flow) — Detect tongue/vocal cord movements without sound
- Brain-computer interfaces — Neuralink (implanted patients, iterating), Synchron (endovascular Stentrode), Echo Neurotechnologies (speech restoration), Nudge (new neural interface platform). BrainGate decoded inner speech from motor cortex. BISC chip: 65,536 electrodes wireless.
- Tactile sensing — Entering embodied AI architectures as first-class input
- Digital olfaction — Wearable displays with millisecond response, smell models paired with visual AI
Each device category is also a data-generation platform that feeds models across domains.
5. Closed-Loop Agentic Systems
Orchestrating perception, reasoning, and action into sustained autonomous systems over long time horizons. Three requirements beyond digital agents: (1) embodiment in the experimental/operational loop, (2) long-horizon persistence with memory and safety monitoring, (3) closed-loop adaptation based on physical outcomes, not just textual feedback.
Three Frontier Domains
Robotics
The most demanding consumer of all five primitives simultaneously. General-purpose manipulation (e.g., folding a towel) requires physics priors, continuous motor control, simulation training data, tactile feedback, and closed-loop error recovery.
Key insight: as learned policies become standard, value migrates from mechanical systems to models, training infrastructure, and data flywheels. But robotics also feeds back: every real-world trajectory is training data for better world models.
Central remaining challenge: reliability at scale. Even 95% per-step success yields only 60% on a 10-step task chain.
Autonomous Science
Self-driving laboratories combine all five primitives most completely. They require physics/chemistry representations, embodied action (pipetting, positioning), simulation (pre-screening experiments), expanded sensing (spectroscopy, chromatography), and the most demanding agentic orchestration — multi-cycle hypothesis-experiment-analysis-revision workflows over hours or days.
Key differentiator as data engine: every experiment produces physically grounded, experimentally validated training signals — structured, causal, empirically verified data that physical reasoning models need most and can get from no other source.
Companies: Periodic Labs (materials science), Medra (life sciences).
New Interfaces
Extending AI into direct coupling with human perception and the body's own signals. The spectrum from AR glasses to implantable BCIs collectively constitutes increasingly high-bandwidth channels between human physical experience and AI.
The installed base of AI wearables is becoming a distributed data-collection network for physical-world AI, instrumenting human physical experience at a scale previously impossible.
The Flywheel
These three domains are mutually enabling:
- Robotics → Science: Manipulation capabilities transfer directly to lab automation
- Science → Robotics: Scientific data provides structured training data; materials discovery improves robot hardware
- Interfaces → Robotics: AR/EMG/BCI data about human motor intent trains better robot learning systems
- Robotics → Interfaces: Better embodied AI enables more natural human-robot collaboration
- Science → Interfaces: Novel sensors and materials enable better interface devices
- Interfaces → Science: New sensing modalities enable different scientist-machine interactions
Connections
- World Models — Spatial intelligence and world models as a shared primitive
- Y Combinator AI Thesis — YC's 2026 RFS calls out spatial reasoning as a major opportunity
Sources
- "Frontier Systems for the Physical World" — Oliver Hsu, a16z (link)