Physical AI Systems — Cold Mountain Wiki

Overview

While the dominant AI paradigm is organized around language and code, adjacent fields are entering a scaling regime of their own. Three domains — robot learning, autonomous science, and new interfaces — share a common substrate of technical primitives and are mutually reinforcing. This represents the next phase transition for AI: grounding intelligence in physical reality.

The thesis (a16z, Oliver Hsu): the areas with the greatest upside potential are those that benefit from the same scaling dynamics driving language AI, but sit one step removed — close enough to inherit infrastructure and research momentum, distant enough to require non-trivial additional work. This distance creates both a natural moat and a richer problem space.

Five Shared Primitives

1. Learned Representations of Physical Dynamics

The ability to learn compressed, general-purpose representations of how the physical world behaves — how objects move, deform, collide, and respond to force.

Three architectural paths converging:

Vision-Language-Action models (VLAs) — Extend pretrained vision-language models with action decoders. Amortize the cost of learning to see across internet-scale pretraining. Examples: Physical Intelligence's π₀, Google DeepMind's Gemini Robotics, NVIDIA's GR00T N1.
World Action Models (WAMs) — Build on video diffusion transformers pretrained on internet-scale video, inheriting physical dynamics priors. NVIDIA's DreamZero achieves zero-shot generalization and cross-embodiment transfer.
Native embodied foundation models — Generalist's GEN-1, trained from scratch on 500K+ hours of real-world physical interaction data from wearable devices. Not a fine-tuned VLM or WAM — a first-class foundation model for physical interaction.

Spatial intelligence (e.g., World Labs) fills a representation gap all three share: none explicitly model 3D scene structure.

2. Architectures for Embodied Action

Translating understanding into reliable physical action requires solving: intent-to-motor-command mapping, long-horizon coherence, real-time latency, and learning from experience.

Dual-system hierarchy — Slow, powerful VLM for scene understanding (System 2) + fast visuomotor policy for real-time control (System 1). Standard pattern in GR00T N1, Gemini Robotics, Figure's Helix.
Flow matching / diffusion action heads — Pioneered by π₀, displacing discrete tokenization. Treats action generation as denoising, producing physically smoother trajectories.
RL post-training on VLAs — Physical Intelligence's RECAP method on π*₀.₆: trains a value function estimating success probability from any intermediate state, then conditions the VLA to select high-advantage actions. Results: folds laundry across 50 novel garment types, runs continuously for hours. Doubles throughput, halves failure rates vs. imitation-only.

3. Simulation and Synthetic Data

In language, the internet solved the data problem. In the physical world, simulation solves it. The modern stack: physics-based engines + photorealistic rendering + procedural environment generation + world foundation models bridging sim-to-real. If the bottleneck shifts from collecting real data to designing virtual environments, the cost curve collapses — simulation scales with compute, not human labor.

4. Expanding the Sensory Manifold

AI's sensory access is expanding rapidly beyond vision and language:

AR devices — Continuous first-person video of human-environment interaction
Voice-first AI wearables — Higher-bandwidth context for language AI
Silent speech interfaces (e.g., Wispr Flow) — Detect tongue/vocal cord movements without sound
Brain-computer interfaces — Neuralink (implanted patients, iterating), Synchron (endovascular Stentrode), Echo Neurotechnologies (speech restoration), Nudge (new neural interface platform). BrainGate decoded inner speech from motor cortex. BISC chip: 65,536 electrodes wireless.
Tactile sensing — Entering embodied AI architectures as first-class input
Digital olfaction — Wearable displays with millisecond response, smell models paired with visual AI

Each device category is also a data-generation platform that feeds models across domains.

5. Closed-Loop Agentic Systems

Orchestrating perception, reasoning, and action into sustained autonomous systems over long time horizons. Three requirements beyond digital agents: (1) embodiment in the experimental/operational loop, (2) long-horizon persistence with memory and safety monitoring, (3) closed-loop adaptation based on physical outcomes, not just textual feedback.

Three Frontier Domains

Robotics

The most demanding consumer of all five primitives simultaneously. General-purpose manipulation (e.g., folding a towel) requires physics priors, continuous motor control, simulation training data, tactile feedback, and closed-loop error recovery.

Key insight: as learned policies become standard, value migrates from mechanical systems to models, training infrastructure, and data flywheels. But robotics also feeds back: every real-world trajectory is training data for better world models.

Central remaining challenge: reliability at scale. Even 95% per-step success yields only 60% on a 10-step task chain.

Autonomous Science

Self-driving laboratories combine all five primitives most completely. They require physics/chemistry representations, embodied action (pipetting, positioning), simulation (pre-screening experiments), expanded sensing (spectroscopy, chromatography), and the most demanding agentic orchestration — multi-cycle hypothesis-experiment-analysis-revision workflows over hours or days.

Key differentiator as data engine: every experiment produces physically grounded, experimentally validated training signals — structured, causal, empirically verified data that physical reasoning models need most and can get from no other source.

Companies: Periodic Labs (materials science), Medra (life sciences).

New Interfaces

Extending AI into direct coupling with human perception and the body's own signals. The spectrum from AR glasses to implantable BCIs collectively constitutes increasingly high-bandwidth channels between human physical experience and AI.

The installed base of AI wearables is becoming a distributed data-collection network for physical-world AI, instrumenting human physical experience at a scale previously impossible.

The Flywheel

These three domains are mutually enabling:

Robotics → Science: Manipulation capabilities transfer directly to lab automation
Science → Robotics: Scientific data provides structured training data; materials discovery improves robot hardware
Interfaces → Robotics: AR/EMG/BCI data about human motor intent trains better robot learning systems
Robotics → Interfaces: Better embodied AI enables more natural human-robot collaboration
Science → Interfaces: Novel sensors and materials enable better interface devices
Interfaces → Science: New sensing modalities enable different scientist-machine interactions

Connections

World Models — Spatial intelligence and world models as a shared primitive
Y Combinator AI Thesis — YC's 2026 RFS calls out spatial reasoning as a major opportunity

Sources

"Frontier Systems for the Physical World" — Oliver Hsu, a16z (link)