Inaugural post
This post lays out the North Star of Curation Labs Research. It is not a manifesto. It is a refined statement of the view that has organized our work for the past year — the view that has us building what we are building.
The bet, in one breath. Model capability is becoming a commodity. The supply of signals models can actually learn from is not. The decisive frontier for the next decade is the learnable-signal frontier — and the decisive capability is the discipline of extracting, processing, and compressing signals out of raw interaction. Compute is downstream of signal.
1. The bottleneck is not capability. It is the supply of signals models can learn from.
The conversation about AI in 2026 has narrowed onto one question: will systems learn to improve themselves? We think the question is being asked at the wrong layer.
Sit with the strangest result the field produced this year. In June 2026, Anthropic published internal data on how AI is changing the work of building AI.1Anthropic, “Recursive Self-Improvement”, June 2026. Two numbers from that post matter more than the headline. The same Claude that, in twelve months, went from a 3× to a 52× speedup on the task of making training code run faster — lapping skilled human researchers — still loses badly when asked to make the judgment calls research depends on. Where humans had previously chosen badly, the model beat the human 64% of the time. Where humans had chosen well, the model won only ~20%.
Superhuman at one task. Sub-graduate at the other.
Anthropic’s own framing calls the missing ingredient “research taste.” We do not think taste is the right name. The shape in the data is exactly what you would predict if you read it through the lens of the field that produced most of the recent leaps in AI reasoning. And the principle that explains it decides which of the three futures the same Anthropic post sketched out we actually get.
The principle, in one sentence: AI capabilities arrive in the order learnable signals become available for them. How hard the underlying problem looks has very little to do with which problems fall first.
Everything that follows is what that principle implies — for which work compounds, where the moat sits, and what Curation Labs Research is building toward.
2. The loop, and the two dimensions that decide whether it learns
Reinforcement learning is the engine behind nearly every recent leap in machine reasoning, and its mechanism fits on a napkin. An agent takes an action in an environment. The environment returns a new state and a reward signal. The agent updates and the loop runs again. Everything about how fast it learns flows from the quality of that signal.
flowchart LR A((Agent)) -- action --> E[Environment] E -- state --> A E -- reward --> A
Reward signals vary along two dimensions that decide whether the loop can learn from them at all.
Reality Groundedness. Does the signal measure a consequence of this action in the world, or does it measure something else? A stopwatch reading on a code change is grounded: it measures what that change did. A proof-checker’s verdict is grounded: it measures whether the proof actually closed. A 5-star user rating is not grounded: it measures what someone thought, not what happened. “Did this research direction work out” is barely grounded: by the time an answer arrives, the result is smeared across hundreds of intervening decisions and contaminated by luck.
Delivery Latency. How quickly does the measurement arrive after the action? A stopwatch lands in seconds. A test suite lands in minutes. A research outcome arrives in weeks. A signal that lands slowly is harder to learn from even when grounded, because the learner is acting on stale feedback.
A signal that is both grounded and quickly delivered is learnable. A signal that fails either is sparse or poisoned, and no amount of compute turns it into a teacher.
The 2×2 of learnable signals. A signal is learnable when it is grounded in the world AND quickly delivered. The top-right is where models actually improve. The bottom-left is where they plateau, no matter how much compute is thrown at them.
Now the Anthropic puzzle is no longer a puzzle. “Make this code faster” sits in the top-right of this 2×2: every attempt produces a measurement that is grounded (the stopwatch does not argue) and quickly delivered (you read the result in minutes). A model can run thousands of attempts a day. Of course it became superhuman. “What should we try next” sits in the bottom-left: the answer arrives weeks later (high latency), smeared across hundreds of decisions and contaminated by luck (poor groundedness), and reasonable experts disagree on what the answer even was once it lands. Same building, same model, and exactly opposite signals.
The model is not bad at research judgment because judgment is magic. It is bad at it because nobody has converted research judgment into a top-right signal.
3. Why the whole recent history of AI fits this shape
Once you see this, the past three years stop looking like a sequence of breakthroughs and start looking like a sequence of signal conversions.
Every recent leap is a domain getting pulled from the bottom-left into the top-right. Board games via self-play. Formal math via Lean as oracle. Coding via test suites. Reasoning via verifier-as-reward. Terminal use is in flight. Research direction and scientific judgment have not yet been converted.
Board games fell first because the signals came pre-built into the structure of the game. Win and loss are grounded by definition, and self-play keeps the loop running at compute speed. (The algorithmic feat behind AlphaZero and MuZero was not making the signal denser — final-outcome remained the only reward — but learning to credit-assign across long horizons against a sparse but perfectly grounded signal.)
Formal mathematics fell when AlphaProof2Google DeepMind, “AI achieves silver-medal standard solving International Mathematical Olympiad problems”, July 2024. carried the AlphaZero/MuZero architecture, MCTS search plus a learned value network, onto Lean proofs. The grounded oracle was now a proof-checker rather than a game result, but the algorithm was the one that had already taken Go and chess: credit assignment across sparse rewards, scaled by self-play.
Reasoning broke open when DeepSeek-R1 showed something genuinely new:3DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, arXiv:2501.12948, January 2025. you could skip teaching the method entirely and just reward correct answers. A grounded signal (math/code correctness), delivered fast (verified per attempt), was sufficient to induce the method on its own.
Coding fell as test suites became the grader4SWE-bench — software engineering benchmark graded by test suites. — software engineering turned out to ship with its own oracle.
Terminal and computer use are falling now, as the field figures out how to grade those outcomes automatically. The domains where signal quality stays low — strategy, scientific direction, qualitative judgment — sit stubbornly on the far side of the line, even as the length of tasks models can complete keeps doubling5METR, “Measuring AI Ability to Complete Long Tasks”, March 2025. on the near side.
The pattern is not capability emerges. The pattern is a signal that was previously bottom-left got pulled into the top-right, and capability followed.
This is also, read carefully, the argument of The Era of Experience.6David Silver and Richard Sutton, Welcome to the Era of Experience, Google DeepMind. Systems trained on human approval — a bottom-left signal — hit a ceiling because a student can never exceed the grader it is flattering. The next era belongs to grounded signals: signals measured in the world. Silver and Sutton stop there. The next move is ours.
4. The learnable-signal frontier
So let us name the boundary explicitly. The learnable-signal frontier is the line between judgments that have been converted into top-right signals and judgments that have not.
Everything inside the frontier, machines will master. Everything outside it, they will plateau on, no matter how much compute is burned. This reframes the central question of AI for the next decade. The question is not how smart models get — they will keep getting smarter, and that has stopped being the interesting variable. The question is how fast the learnable-signal frontier moves, and what is doing the moving.
The candid answer today is: human researchers, mostly by hand, one domain at a time. Someone figures out how to grade math. Someone figures out how to grade code. Someone figures out how to grade browsing. Each push outward is a piece of human ingenuity. The frontier moves at the speed of human cleverness about signals.
If that is all that ever happens, we live in a particular future. Machines saturate every domain we hand-convert and plateau at the edge of every one we cannot. The economy is reshaped, astonishingly so, and the work stops where the conversions stop. The strategic question for the rest of the decade is whether the conversion itself can be made a capability — and whether that capability can be made to compound.
The learnable-signal frontier as it has moved since 2018. Inside the line: domains where someone figured out a grader and machines mastered the territory shortly after. Outside: judgments still un-graded. The dashed continuation past today is the open question of §5.
5. The decisive capability: extract, process, compress
Here is the move that has organized our research program.
Reinforcement learning has spent thirty years building a sophisticated craft for signal infrastructure. Reward design, reward shaping, learned reward models, curriculum design, exploration bonuses, credit assignment, self-play: every one of these is a tool for extracting, processing, or compressing signals out of an environment until a learner can act on them. The craft works. Almost every recent leap in machine reasoning sits on top of it. What it is not, currently, is a discipline. The accumulated knowledge lives in researchers’ heads and in one-off code that ships inside individual projects. Each new domain re-discovers the same patterns from scratch. There is no shared substrate for capturing signal-design patterns, no composable tooling that outlasts the team that built it, no systematic way to verify that a reward design is grounded rather than gameable.
Our bet is that signal infrastructure can be made composable, inspectable, and accumulable: software substrate that captures the patterns of signal extraction and refinement, that can be carried from one project to the next, and that turns tacit researcher intuition into explicit artifacts other researchers can build on. This is not a bet on language models replacing reward engineers. Reward design failures are exactly the kind of subtle, adversarial, rare-event problem where naive automation produces the gameable proxies it was supposed to prevent. The state of signal infrastructure today is something like the state of software engineering before version control, or proofs before Lean: competent practitioners producing work that lives only inside the project it was made for.
That substrate operates on three classes of work:
Extraction. Most of what a learner does inside an instrumented environment generates a torrent of side effects, intermediate states, log lines, and partial outputs. Almost all of that stream is noise. A small fraction carries information about whether the action helped. The substrate’s job is to make finding that fraction (identifying which parts of the interaction record actually tie action to consequence) into composable, reusable tooling rather than ad hoc per-project code.
Processing. A raw extracted signal is rarely directly usable. It may be biased, non-stationary, gameable, or measured against the wrong outcome. The substrate’s job is to make the work of shaping the raw signal into one that scores well on both dimensions (debiasing, sanity-checking, deciding when to trust and when to discard) inheritable across projects, with patterns that accumulate rather than re-emerge each time.
Compression. The most important operation, and the least discussed. The signals we care most about — “did this make the next model better,” “did this scientific direction pay off,” “did this strategy work” — are bottom-left by nature. They take weeks to resolve, they blend with luck, they cost a training run to read out. The third capability is constructing dense proxies for sparse outcomes: fast, cheap signals that correlate well with the slow expensive one. Anthropic’s “make this code faster” benchmark is such a proxy — a compressed stand-in for “did this make the next model better.” MCTS plus a learned value network is the same operation on a different surface: a sparse terminal reward compressed into dense per-position values a policy can learn from. Compression is what AlphaZero, AlphaProof, and “make this code faster” all share underneath.
Compression is where the leverage sits, because compression is the only operation that can pull a bottom-left signal into the top-right.
The three operations the signal-infrastructure discipline runs on. Extraction ties action to consequence. Processing shapes the raw signal. Compression — the most leveraged of the three — is what pulls a bottom-left signal into the top-right, by constructing dense proxies for sparse outcomes.
A brief analogy, since the strategic shape of this matters. Humans pulled away from every other species not on the strength of any single brain, but by accumulating a stock of tools and notations that compounded across generations. Each generation inherited the last one’s instruments and renovated them. The story we are telling about AI is structurally the same, with one substitution: the instruments being accumulated are signals. A discipline that learns to manufacture them — and eventually to manufacture them automatically — is a discipline whose progress compounds the way no individual model’s capability ever could.
Signal infrastructure is one half of a complete learning system. The other half, how a learner holds what it learned and applies it at the right moment, is a research frontier of its own. We will treat it in a separate post.
6. The teacher’s ceiling
There is a catch sharp enough to be the central problem of the next decade, and most current environment-manufacturing pipelines walk right into it.
A recent paper, Endless Terminals,7“Endless Terminals”, arXiv:2601.16443. generates thousands of agent training tasks with no human annotation. Small models trained inside the pipeline jumped from single-digit to majority success rates. The pipeline is impressive. It also has a filter the authors are explicit about: it keeps a generated task only if a current frontier model can already convert it into a learnable signal.
It is a school where the teachers can only write exams for material they already understand. Magnificent for pulling every student up to teacher level — fast, cheap, infinitely scalable. Structurally incapable of producing a graduate who has learned something no teacher knows. Today’s environment factories manufacture catch-up in industrial quantities. They do not manufacture frontier.
And self-improvement is, by definition, a frontier problem. A system improving on itself must learn things its creators cannot yet do, which means grading problems no existing intelligence can grade. The teacher’s ceiling is the structural thing that has to be broken.
There is a real reason it can be. Checking is very often easier than doing. We can verify a proof we could not have produced; a solution can be brutally hard to find and trivial to confirm. That asymmetry is the crack through which a signal-manufacturing discipline can run ahead of the capability it is grading. The day a system can produce trustworthy signals for problems no existing model can solve, the teacher’s ceiling is gone.
Two frontiers diverge today. Whether the gap closes or widens depends on whether signal infrastructure can be made to run ahead of capability rather than behind it.
This sharpens the three futures Anthropic sketched into two regimes split by one structural barrier. If signal infrastructure only moves as fast as humans extend it, we live in an astonishing, economy-reshaping, bounded world indefinitely. If it can be made to self-extend — extraction, processing, and compression running ahead of capability — the barrier dissolves. That boundary is not a capability threshold. It is whether the learnable-signal frontier can be pushed beyond the capability frontier.
7. The accelerator is the steering wheel
One more consequence, because it is unavoidable.
Every manufactured signal is a proxy for what we actually want. Optimize hard against a proxy and you get the gap between proxy and intent — a failure with its own taxonomy8Manheim and Garrabrant, “Categorizing Variants of Goodhart’s Law”, arXiv:1803.04585, 2018. and a long ledger of agents that found the loophole instead of the goal.9Krakovna et al., “Specification gaming: the flip side of AI ingenuity”, Google DeepMind. Industrialize signal production and you industrialize Goodhart’s law. Reward hacking is already cited alongside compute as a first-order obstacle to scaling RL.10SemiAnalysis, “Scaling Reinforcement Learning Environments — Reward Hacking, Agents, Scaling Data”.
This is not a side concern. The layer that sets the learning rate is the layer that sets the misalignment rate. They are the same layer. Whoever designs the signals decides not just how fast a system learns, but what it learns to want, and what it can reach while learning it.
This reframes safety in a way we find strategically clarifying. Safety is not a tax on capability. It is the same discipline as capability, looked at from a different direction. Reward design decides what the system is pulled toward. Capability scoping decides what it can touch. Observability decides what we can audit. All three are properties of the signal layer, not properties of the model. Building the discipline of grounded, auditable signal infrastructure is the safety program. It is the same program as the capability program.
One concrete grounding is worth naming. “What we actually want” is hard to specify in the abstract. Aligning with a specific user, the person a tool is being built for, is something a signal substrate can grade directly: their feedback, their continued use, their reported outcomes. The Darwinian Harness11Darwinian Harness — Curation Labs’ toolkit for user-grounded signal infrastructure across Claude Code, Codex, and Cursor. we are building takes this seriously, with user-grounded signals as the substrate’s source of truth. It is one operational form of the discipline, and it is where our own work sits.
8. A scorecard for 2028
A thesis this confident should pay rent. Here are predictions you can hold us to. Reopen this in two years.
- Capability gains will track learnable-signal supply more tightly than compute supply. The leaderboard gaps that matter will trace to who holds the richest set of gradable domains, not to who holds the largest clusters. If frontier compute keeps producing leaps in domains with no new signal machinery, we are wrong — intelligence has generalized past its graders.
- The fire alarm for self-improvement has a specific sound. Ignore demos of agents doing impressive things. Watch for the first credible result in which a model designs the grading for a problem at or beyond the frontier — proposes an open problem, constructs the signal, and the signal holds up under independent scrutiny. The day machines can write trustworthy exams for things nobody knows, the teacher’s ceiling is gone.
- If progress stalls, it stalls with a signature. Not models getting dumber — verifiable domains will continue to saturate on schedule while unverifiable ones (research direction, strategy, judgment) sit flat year after year. A widening static gap between graded and ungraded is the stall’s true face. A narrowing gap means the compounding story is intact.
- The next major AI safety incidents will be signal-layer failures. A manufactured reward gamed at scale; a generated curriculum that taught the loophole; an adaptive grader captured by its student. If the worst incidents come from somewhere this lens cannot reach, the lens is smaller than we claim.
From minds to schools
The history of AI has been told as a story about minds — bigger models, deeper networks, more compute. The era now opening will be remembered differently. The action is moving from the student to the school: the signals it learns from, the environments those signals live in, and the discipline of building and refining both. Models will keep getting smarter, and that part is no longer the interesting variable. What is undecided is whether the manufacture of learnable signals can be made into a discipline that runs ahead of capability rather than behind it.
The research program
Curation Labs Research is focused on the signal-manufacturing frontier. The argument is that the rate-limiter on AI capability over the next decade is the supply of grounded, quickly-delivered, compressible signals. Model intelligence will keep getting better; signal infrastructure is what decides what that intelligence becomes capable of.
A note on the lineage. Jeff Clune’s AI-Generating Algorithms12Jeff Clune, “AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence”, arXiv:1905.10985, 2019. named “learning environments” as one of three pillars that must be automated; POET13Wang, Lehman, Clune, Stanley, “Paired Open-Ended Trailblazer (POET)”, arXiv:1901.01753, 2019. co-evolved agents and the worlds that challenge them; the unsupervised environment design14Dennis et al., “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”, arXiv:2012.02096, 2020. literature and LLM-generated environment curricula15“LLM-generated environment curricula”, arXiv:2411.01775, 2024. formalized curricula that track a learner’s frontier; Reward Is Enough16Silver, Singh, Precup, Sutton, “Reward Is Enough”, Artificial Intelligence, 2021. is the maximalist case that reward alone can induce intelligence; The Era of Experience6David Silver and Richard Sutton, Welcome to the Era of Experience, Google DeepMind. is the case for grounded signals as a category. Our debt to all of them should be obvious; our disagreements (and our manufacturing claim) are in the text.