Self Play Labs is built around one claim: the next order of magnitude in AI capability will come from systems that learn from their own search, not from human demonstration.

The evidence is on the record. AlphaGo, AlphaZero, AlphaFold — the systems that pushed past human-level performance did it by playing against themselves and exploring search spaces no expert had mapped. They didn't imitate experts. They discovered strategies no expert had considered.

This principle generalizes far beyond games and protein structure. Self-play is how intelligence escapes the ceiling of its training data — and we think it's now possible to make it work in the messy, open-ended domains where intelligence actually matters.

The bottleneck

Scaling language models has produced enormous gains, but it carries a hard constraint: a model trained on imitation cannot exceed what it imitates. No matter how much data or compute you throw at it, the system converges to the distribution it was trained to approximate.

The question isn't whether to scale compute — it's where. The next order of magnitude in capability comes from systems that generate their own training signal at inference time: searching, verifying their own reasoning, and extracting structure from the problems they fail to solve.

What we mean by self-play

We use "self-play" to mean any setting where a system improves by generating its own experience — not just two-player games. Three patterns matter most:

Verification-driven search. Systems that propose solutions and learn to evaluate their own reasoning steps, using that evaluation to guide deeper search. The key insight from process reward models and step-level verification is that knowing where reasoning goes wrong is far more useful than knowing whether the final answer is right.
Recursive self-improvement. Systems that use their current capabilities to generate harder training problems, better evaluation criteria, or more efficient search strategies — bootstrapping beyond their initial training distribution.
Open-ended exploration. Systems that don't just solve fixed benchmarks but discover new problems worth solving, transfer skills across domains, and expand their own capability frontier without predefined objectives.

Four research directions

Our work is organized around four principles:

Self-experience. A system that only learns from human-generated data is bounded by human performance. We study how systems generate their own learning signal — through search, simulation, and self-evaluation — and how to make that signal dense and reliable enough to push capability past the training distribution.

Superthinking. Inference-time compute is wasted unless the system has learned to direct it. A model that scales depth without judgment burns tokens; a model that has learned to evaluate its own intermediate steps, prune dead-end paths, and allocate depth where the return is highest converts compute into capability. This is a learned skill, not a fixed algorithm — the system has to acquire an internal value function over its own reasoning process.

Superoptimization. Most optimization stays close to where it started. We study how systems discover entirely new basins of performance — solutions no gradient-descent trajectory from the human prior would reach. This needs qualitatively different exploration, not marginally better optimization: structured search that escapes local optima and finds regions of the loss landscape no one has visited.

Discovery. The real test of a self-play system is whether it finds things nobody asked it to find. We study open-ended generalization: systems that transfer learned skills to novel domains, generate their own curricula, and expand the frontier of what they can do without explicit human specification.

Why now

Three things have changed in the last two years that make this research tractable:

First, process-level verification works. The PRM and step-level reward literature (2023–2025) has shown that models can learn to evaluate individual reasoning steps with enough fidelity to guide search. This is the foundation for self-play in reasoning — without reliable self-evaluation, the system has no training signal.

Second, inference-time scaling laws are real. Spending more compute at test time — through search, chain-of-thought, and iterative refinement — produces consistent capability gains. But the returns depend entirely on how that compute is allocated. The system needs a learned policy for directing its own thinking, not just more tokens.

Third, open-ended methods are maturing. Work on intrinsic motivation, curiosity-driven exploration, and unsupervised environment design has moved from toy domains to settings complex enough to matter. The question is no longer whether open-ended learning is possible, but whether it can be made efficient and directed enough to produce discoveries in hard domains.

What we're building

Self Play Labs is a research lab first. But the infrastructure required to do this research at scale doesn't exist as a usable system anywhere — and that's the binding constraint on the field.

We're building it: environments that generate training signal through search, frameworks for process-level verification, and tooling for open-ended exploration in structured domains. Our focus is mathematics, code synthesis, scientific reasoning, and optimization problems where the search landscape has learnable structure.

We plan to release this infrastructure as a platform. The bottleneck in self-play research today isn't ideas — it's the engineering required to run these systems at scale. We're building that so researchers can spend their time on the science, not the plumbing.

If this is the kind of research you want to follow — or contribute to — reach out at hello@selfplay.computer. We publish our research notes here and we're building in the open.

← Back to research hello@selfplay.computer →

Introducing Self Play Labs

The bottleneck

What we mean by self-play

Four research directions

Why now

What we're building