Belief Engineering

New paper

This post accompanies SeekerGym, our newly published paper with Google DeepMind on information-seeking under partial observability. Read the full paper. Code and benchmark: [Code Link].

We handed AI systems a document and asked a simple thing: find everything in it. The best of them recovered under half of what mattered, and reported high confidence anyway.

To see why, follow a reporter.

Every answer changes the question

A tip lands in your inbox. A mid-size logistics company is quietly laying people off, the source says. You make one call to confirm it.

The call does not confirm the story. It dissolves it. The layoffs are real, but they are a symptom: the company lost its largest contract three weeks ago, and nobody has reported why. Three calls later you are chasing a regulator, a filing, and a second company you had never heard of. Your first question was “are there layoffs?” Your current question is “why did a regulator let a contract collapse that took two companies down with it?”

You are a different reporter than the one who opened the file. Each answer reshaped what mattered, and with it, the frontier of what you knew you were still missing. The skill that carries you is not memory. It is knowing which question to ask next, and that depends entirely on how well you can name what you are missing.

A machine that seeks information faces the same job. It fails in a specific, fixable way, and SeekerGym was built to measure both the failure and the fix.

The seeking loop.

The loop every searcher runs: a belief about what is missing drives a question, the answer reveals a fragment, the fragment reshapes the belief. Everything in this post comes down to the quality of that belief.

You can’t see what you didn’t ask

A foundation model runs this loop half-blind. It acts in a world it sees only in fragments, the pieces its own queries happen to return, with no view of how much is still out there. Decision theory calls this partial observability. The catch is that we mostly build and test these systems as if the world were fully observable: single-turn answers over a complete context, or agent loops that stuff the entire transcript back in and trust attention to sort it out.

Partial observability.

SeekerGym makes the blindness concrete. The agent is given only a document’s abstract and has to recover the rest by asking, one question at a time. It sees only what it surfaces, and never how much still remains hidden.

How SeekerGym works

Take a document, a Wikipedia article or an ML survey, and cut it into passages. Show the agent only the abstract. Its job is to recover as much of the rest as it can by asking questions, one at a time, to a retrieval system that returns the passages matching each query (those above a fixed embedding-similarity threshold). The agent cannot browse, cannot see the table of contents, cannot list the headings. Its reward is retrieval completeness: how much of the target content it uncovers.

POMDP, formally

Formally, this is a Partially Observable Markov Decision Process, the tuple $(S, A, O, T, Ω, R, γ)$ :

S (State space): The hidden state. In SeekerGym, the document’s target passages plus a binary vector tracking which have been retrieved.

A (Action space): Natural language queries the agent can issue to the retrieval system.

O (Observation space): Passages returned for a query, determined by embedding similarity above a threshold (0.65 cosine similarity in our experiments).

T (Transition function): Deterministic and monotonic. Once a passage is retrieved, it stays retrieved. The hidden state only changes by revealing more.

$Ω$ (Observation function): Maps (state, action) pairs to observations. A query either surfaces a passage (if similarity exceeds the threshold) or returns nothing for it.

R (Reward): Retrieval completeness, how much of the target content the agent has uncovered.

$γ$ (Discount factor): Controls the trade-off between finding information now versus later.

The difficulty: maintaining an exact belief distribution over the combinatorial state space is intractable. A document with n passages has $2^{n}$ possible configurations, so exact Bayesian belief updates are infeasible for any realistic size.

How you organize what you found beats how big your model is

The intuition is that feeding the model its full history should help. It does the opposite. As the trajectory grows the signal drowns: passages already read, near-duplicate results, the scaffolding of the search itself. What matters is how the knowledge is organized, far more than how much of it the model carries.

We tested three ways of handing the model what it had found. The raw trajectory, everything verbatim, did worst. Deduplication helped. The clear winner was a structured “oracle” belief that marks, explicitly, what has been found and what is still missing.

Three belief representations.

Three ways to hold the same knowledge. The raw trajectory is a pile; the oracle belief is a map where the gaps are as legible as the findings. The distance between them was larger, across every model we tested, than the distance between different foundation models.

The three representations, in full
Raw trajectory gives the model everything that happened, verbatim:
Turn 1 Query: "What is the history of Golden Retriever?"
Turn 1 Result: No relevant passages found.
Turn 2 Query: "How did Marjoribanks begin his breeding program at Guisachan?"
Turn 2 Result: [Passage 3] In the 1860s Marjoribanks acquired "Nous," a yellow Flat-coated Retriever...
Turn 3 Query: "Golden Retriever breed characteristics"
Turn 3 Result: [Passage 3] In the 1860s Marjoribanks acquired "Nous," a yellow Flat-coated Retriever...
Passage 3 appears twice because two queries hit it. The model has to wade through failed attempts and duplicates to reconstruct where it stands.

Deduplicated trajectory strips the repetition:
Retrieved passages:
[Passage 3] In the 1860s Marjoribanks acquired "Nous," a yellow Flat-coated Retriever...
Oracle belief uses the document’s true structure, marking what has been found and what has not:
[1] Breed and temperament: [MISSING]
[2] Origin and cross-breeding: [MISSING]
[3] Acquiring Nous: In the 1860s Marjoribanks acquired "Nous," a yellow Flat-coated Retriever...
Now the holes are explicit, and the agent can reason about what the missing passages might hold. It is an upper bound, since a real agent never gets the document’s structure for free.

The lesson is uncomfortable for a field that races to scale. The deduplicated trajectory and the oracle are both far shorter than the raw one, yet only the oracle wins by a wide margin, so less text is not what does the work. The oracle is a compact but privileged map: it spends its space marking what is missing, which is exactly what the next question needs. Compression is understanding. A good belief is hard to vary, because you cannot rearrange it without losing the one thing that drives the next question.

It recovers under half, and doesn’t know it

Let the agent explore thoroughly, build the best belief it can, and stop. How much did it actually find? The best approaches recovered 42.5% of the target passages on Wikipedia articles and 29.2% on ML survey papers. More than half went undiscovered, and the agent had no ground truth to grade itself against. An agent that finds 40% of the relevant material and reports high confidence is more dangerous than one that finds 40% and says so.

SeekerGym closes that gap with calibrated completeness. At the end of a search the agent estimates how much of the relevant content it found, and we wrap that estimate in an interval with a guaranteed coverage rate.

Completeness interval.

Instead of a single number, the agent reports a calibrated range: “between 35% and 55% of what is relevant, with 90% confidence.” That is a statement a human can actually act on.

How we calibrate completeness

We wrap the agent’s point estimate with conformal prediction, a method that turns any estimate into an interval with a guaranteed coverage rate. The guarantee is distribution-free: it holds whatever the model’s internals, as long as the calibration data is exchangeable with the test data. After calibration, every model hit the target 90% coverage, meaning the true completeness fell inside the predicted interval at least nine times in ten. Larger reasoning models needed the least correction, a hint that scale buys some self-awareness, though calibration was necessary across the board.

Which benchmarks model this, and which don't

The most-watched leaderboards lean fully observable: MMLU, GPQA, and long-context “needle in a haystack” tests all hand the model a complete context and ask for a single response. Interactive benchmarks that put a model in a genuinely partial world do exist (τ-bench, WebArena, and ALFWorld-style tasks all make the agent act before it can see everything), but none isolates the variable SeekerGym is built around: how an agent should represent what it has found and what it still lacks.

Why it matters

A document search is the laboratory, but the shape is everywhere: a coding assistant in an unfamiliar codebase, a medical AI reading a patient’s history, any system that sees fragments and must decide when to trust its own picture. Two things make such a system trustworthy. A belief that holds its shape across the search, so memory does not dissolve into noise. And an honest signal of how much it still has not seen, so confidence tracks competence. SeekerGym shows both are buildable and measurable.

What is still missing is a way to judge the quality of what gets found, not only how much of it. That is where the work goes next.

The full paper, code, and benchmark are at arXiv. For the formal treatment of POMDPs, see Kaelbling, Littman & Cassandra’s foundational 1998 paper.

Table of Contents

Belief Engineering

Every answer changes the question

You can’t see what you didn’t ask

How you organize what you found beats how big your model is

It recovers under half, and doesn’t know it

Why it matters