Beyond Imitation

Introducing Axion: blazingly fast self-play co-training for generators and verifiers, made to scale continual, superhuman RL systems.

We are advancing frontier RL research to build continual, self-improving superhuman systems capable of solving problems beyond human level through autonomous improvement and discovery. At the same time, we’re creating accessible infrastructure so that researchers and developers can leverage these tools to pursue their own ambitious goals.

Self Play Lab is built around a direct research thesis: imitation gives models a prior, but not a mechanism for sustained improvement beyond the data distribution. The next scaling regime needs systems that manufacture training signal by acting inside environments with measurable outcomes. In that regime, every attempt is not just an answer; it is a rollout trace, a verification event, and a possible update to the next policy.

The strongest environment-driven systems share this structure. A model proposes. An environment applies pressure. A verifier turns the result into signal. The training loop converts that signal into behavior that survives the next evaluation. Axion is built for making that loop programmable, repeatable, and large enough to matter.

Figure 01

The Next Scaling Regime

Illustrative, not empirical: the core claim is qualitative. Once environments produce reliable feedback, additional compute can be spent on rollouts, verification, and post-training instead of imitation alone.

Why the Timing Changed

The timing changed because base models are no longer blank policies. They can write code, operate tools, decompose tasks, search locally, and generate plausible candidate solutions across many domains. That turns self-play from cold-start exploration into guided search: the model begins with a useful prior, then the environment supplies the pressure needed to move past it.

The infrastructure picture has also changed. Structured generation, parallel sampling, sandboxed execution, evaluator pipelines, and test-time search are becoming standard ingredients rather than one-off research systems. The bottleneck is shifting from “can the model produce anything useful?” to “can we run enough verified attempts to turn experience into a training distribution?”

That is where the current stack is still weak. Serving, rollout orchestration, verification, reward construction, dataset assembly, and RL updates usually live in separate systems with brittle glue between them. Long-running self-play makes this worse: generators drift, verifiers need to sharpen, environments expose edge cases, and the training loop has to preserve the signal instead of flattening it into logs.

The Self-Play RL Stack

The minimal stack has five moving parts.

First, capable agents generate candidate actions from a strong prior. Second, verifiable environments evaluate those attempts against tests, constraints, tools, proofs, simulators, or other objective interfaces. Third, rollout infrastructure runs large batches of attempts and keeps the full trace: prompt, action, state, failure mode, verifier output, and reward. Fourth, generators and verifiers co-train so the search distribution and evaluation pressure evolve together. Fifth, post-training and RL convert the resulting traces into durable capability.

When the loop is healthy, the system does not wait for a fixed dataset to define the frontier. It generates attempts, finds counterexamples, hardens verifiers, expands the curriculum, and turns the next batch of failures into training data. Self-play is useful here because it makes the curriculum endogenous: the system can create the problems it is almost, but not yet, able to solve.

This is why verifiable rewards matter. Outcome-based supervision is only as good as the environment that produces it. If the verifier is shallow, the agent learns shortcuts. If rollout throughput is low, the curriculum starves. If traces are poorly structured, post-training loses the information that made the attempt valuable.

Figure 02

Capability Increase from Specialized Self-Play RL

Hypothetical trajectory: specialized agents do not improve automatically. The claim is that reliable verifiers, high-throughput rollouts, and post-training loops make each iteration more likely to produce useful capability gains.

Axion

Self Play Lab is building Axion to make this research loop operational. The goal is not another dashboard around model calls. The goal is infrastructure for running self-play RL experiments where agents, environments, generators, verifiers, rollout workers, and post-training jobs are part of one system.

Axion is our first product in this direction, with the beta opening on June 20, 2026. It connects capable agents, co-trained generators and verifiers, high-throughput rollout infrastructure, and post-training loops for verifiable environments. Researchers should be able to spend their time designing objectives, environments, and evaluation pressure, not rebuilding orchestration machinery for every experiment.

The first-order bet is that verifiable environments are the next high-leverage interface for RL research. When an environment can reliably say what happened, teams can run more trials, compare interventions cleanly, mine failure modes, and train against signal that is stronger than preference alone.

Pretraining remains the prior. Axion is for what comes after: the loop that turns attempts into evidence, evidence into updates, and updates into agents that can keep improving inside the domains they are trained to master.

← Back to research hello@selfplay.computer →

Introducing Self Play Lab

Beyond Imitation

Why the Timing Changed

The Self-Play RL Stack

Axion