Frontier Models — Cold Mountain Wiki

Overview

Frontier AI model development is no longer dominated by a single architectural approach. As of early 2026, major labs are experimenting with inference-time scaling — using multiple agents at generation time rather than simply training bigger models — and domain-specific specialization.

Meta's Pivot: Open to Closed (Muse Spark)

Meta released Muse Spark in April 2026, its first AI model in over a year and the first from its nine-month-old Superintelligence Labs. It represents a significant strategic shift: after years of championing open-weights models (Llama), Meta released a proprietary closed model.

Background: Meta reorganized its AI labs after critics alleged Llama 4 training data was contaminated with benchmark answers. In June 2025, Meta spent $14.3B for a 49% stake in Scale AI, brought in cofounder Alexandr Wang as chief AI officer, and launched a major hiring spree. The proprietary release has raised concerns among developers who built projects on Llama.

Architecture highlights:

Natively multimodal: text, image, speech input (up to 262K tokens); text output
Three reasoning modes: instant, thinking, contemplating
Multi-agent contemplating mode: launches multiple agents that propose solutions, refine them, and aggregate results in parallel
Post-training used RL penalizing excessive reasoning tokens ("thought compression"): model first improved by reasoning longer, then learned to compress, then extended again
Reworked pretraining approach, model architecture, optimization, and data curation — claims to match Llama 4 Maverick's capabilities with >10x less training compute
Health specialization: >1,000 physicians helped curate training data for health responses
Parameter count, architecture, and training details undisclosed

Availability: Free via meta.ai and Meta AI app; API preview for selected partners; coming to WhatsApp, Instagram, Facebook, Messenger, Ray-Ban glasses.

Benchmark Snapshot (Artificial Analysis, April 2026)

Model	Intelligence Index	Coding Index	Tokens used (index)
GPT-5.4 (xhigh reasoning)	57	57	~116M
Gemini 3.1 Pro Preview (high reasoning)	57	56	—
Claude Opus 4.6 (max reasoning)	53	—	~158M
Muse Spark (reasoning)	52	47	~59M
Claude Sonnet 4.6 (max reasoning)	—	51	—

Muse Spark's notable strengths: multimodal (CharXiv Reasoning: 86.4%, leading GPT-5.4's 82.8%) and health (HealthBench Hard: 42.8%, ahead of GPT-5.4's 40.1%). Notable weakness: coding. Strong token efficiency relative to peers.

On Humanity's Last Exam: 39.9% (thinking mode) vs. 58% (contemplating mode) — a large gap showing the value of its multi-agent inference approach.

Inference-Time Multi-Agent Scaling

An emerging pattern across labs: scaling performance at inference time by orchestrating multiple agents rather than training ever-larger single models.

Muse Spark's contemplating mode — parallel agents propose and refine solutions, aggregate results, achieve better performance at comparable latency
Kimi K2.5 (Moonshot AI) Agent Swarm — similar multi-agent inference pattern

The implication: inference-time compute is becoming a tunable lever alongside model scale and training compute. This changes the economics of frontier AI — labs can sell higher-capability tiers (more inference-time agents) as premium offerings.

Open vs. Closed Weights Tension

Meta's pivot away from open weights is described as "a significant loss for the developer community" by observers who have built on Llama. The tradeoffs:

Open weights: Enables local deployment, fine-tuning, privacy-preserving use cases, and community innovation. US developer ecosystem built significant infrastructure on Llama.
Closed API: Enables monetization, quality control, safety oversight, and competitive positioning for enterprise customers.

Meta's stated goal is to compete for business customers alongside OpenAI, Google, and Anthropic. Health specialization and multimodal capabilities align with Meta's product interests (billions of camera-equipped users, health as a top AI query category).

Sources