Belief Engineering

New paper

This post accompanies SeekerGym, our newly published paper with Google DeepMind on information-seeking under partial observability. Read the full paper. Code and benchmark: [Code Link].

We handed AI systems a document and asked a simple thing: find everything in it. The best of them recovered under half of what mattered, and reported high confidence anyway.

To see why, follow a reporter.

Every answer changes the question

A tip lands in your inbox. A mid-size logistics company is quietly laying people off, the source says. You make one call to confirm it.

The call does not confirm the story. It dissolves it. The layoffs are real, but they are a symptom: the company lost its largest contract three weeks ago, and nobody has reported why. Three calls later you are chasing a regulator, a filing, and a second company you had never heard of. Your first question was “are there layoffs?” Your current question is “why did a regulator let a contract collapse that took two companies down with it?”

You are a different reporter than the one who opened the file. Each answer reshaped what mattered, and with it, the frontier of what you knew you were still missing. The skill that carries you is not memory. It is knowing which question to ask next, and that depends entirely on how well you can name what you are missing.

A machine that seeks information faces the same job. It fails in a specific, fixable way, and SeekerGym was built to measure both the failure and the fix.

The seeking loop.

The loop every searcher runs: a belief about what is missing drives a question, the answer reveals a fragment, the fragment reshapes the belief. Everything in this post comes down to the quality of that belief.

You can’t see what you didn’t ask

A foundation model runs this loop half-blind. It acts in a world it sees only in fragments, the pieces its own queries happen to return, with no view of how much is still out there. Decision theory calls this partial observability. The catch is that we mostly build and test these systems as if the world were fully observable: single-turn answers over a complete context, or agent loops that stuff the entire transcript back in and trust attention to sort it out.

Partial observability.

SeekerGym makes the blindness concrete. The agent is given only a document’s abstract and has to recover the rest by asking, one question at a time. It sees only what it surfaces, and never how much still remains hidden.

How you organize what you found beats how big your model is

The intuition is that feeding the model its full history should help. It does the opposite. As the trajectory grows the signal drowns: passages already read, near-duplicate results, the scaffolding of the search itself. What matters is how the knowledge is organized, far more than how much of it the model carries.

We tested three ways of handing the model what it had found. The raw trajectory, everything verbatim, did worst. Deduplication helped. The clear winner was a structured “oracle” belief that marks, explicitly, what has been found and what is still missing.

Three belief representations.

Three ways to hold the same knowledge. The raw trajectory is a pile; the oracle belief is a map where the gaps are as legible as the findings. The distance between them was larger, across every model we tested, than the distance between different foundation models.

The lesson is uncomfortable for a field that races to scale. The deduplicated trajectory and the oracle are both far shorter than the raw one, yet only the oracle wins by a wide margin, so less text is not what does the work. The oracle is a compact but privileged map: it spends its space marking what is missing, which is exactly what the next question needs. Compression is understanding. A good belief is hard to vary, because you cannot rearrange it without losing the one thing that drives the next question.

It recovers under half, and doesn’t know it

Let the agent explore thoroughly, build the best belief it can, and stop. How much did it actually find? The best approaches recovered 42.5% of the target passages on Wikipedia articles and 29.2% on ML survey papers. More than half went undiscovered, and the agent had no ground truth to grade itself against. An agent that finds 40% of the relevant material and reports high confidence is more dangerous than one that finds 40% and says so.

SeekerGym closes that gap with calibrated completeness. At the end of a search the agent estimates how much of the relevant content it found, and we wrap that estimate in an interval with a guaranteed coverage rate.

Completeness interval.

Instead of a single number, the agent reports a calibrated range: “between 35% and 55% of what is relevant, with 90% confidence.” That is a statement a human can actually act on.

Why it matters

A document search is the laboratory, but the shape is everywhere: a coding assistant in an unfamiliar codebase, a medical AI reading a patient’s history, any system that sees fragments and must decide when to trust its own picture. Two things make such a system trustworthy. A belief that holds its shape across the search, so memory does not dissolve into noise. And an honest signal of how much it still has not seen, so confidence tracks competence. SeekerGym shows both are buildable and measurable.

What is still missing is a way to judge the quality of what gets found, not only how much of it. That is where the work goes next.


The full paper, code, and benchmark are at arXiv. For the formal treatment of POMDPs, see Kaelbling, Littman & Cassandra’s foundational 1998 paper.