smallbox

← All articles

Working with AI agents safely

How do you get a language model to produce a report you can actually stand behind?

A good report is built, not prompted

The instinct, when you want a language model to produce something careful, is to write a careful prompt. You describe the report you want, add the rules it must follow, and when the first draft invents a confident wrong detail, you add another rule. The prompt grows into a page of instruction and pleading, and the output gets longer without getting more trustworthy. The mistake isn't the wording. It's the unit of work. A report you can stand behind is not something you ask a model for in a prompt — it's something you compose, and the prompt is the smallest part of it.

A prompt isn't an instruction — it's three jobs

Behind a generated report that holds up sit three different things, each doing a different job. A worldview — how to see, the same for every subject. The evidence — what is actually true of this one, handed in rather than recalled. And a contract — what to return, and in what shape. "Prompt engineering," as it's usually described, is the craft of phrasing the request. The real craft is building those three so that the phrasing barely matters. What follows is one section on each, drawn from a system we run in production that writes structural reports on public companies.

The worldview is built once, so the model reasons instead of guessing

The first and largest part is a block of text every generation inherits unchanged: the way of seeing, fixed before the model ever meets the subject. In our case it tells the model to read a company as a coordination system — flows of money, materials, and information moving under constraints — and to describe structure, not predict prices. It carries a few commitments (describe, don't influence; this is one reading, not reality; expect to be incomplete and say so) and a working vocabulary — the economic constraint that shapes an industry, the role a company plays in the economy — framed explicitly as priors to test, never verdicts to apply.

Two things make this worth isolating. It is identical for every company, so the model holds the way of seeing before it pattern-matches the data — which is the difference between reasoning through a lens and reaching for the nearest cliché. And because it never changes, it caches: the large, stable part of the instruction is paid for once instead of on every call. The worldview is where the thinking is shaped; nothing about a specific company belongs in it.

The evidence is handed in, never recalled

A model asked what it knows about a company will answer — fluently, and sometimes from nothing. So the second part never asks. It hands the model a packet of what the system actually holds: figures recomputed from the company's own filings in plain code, with the arithmetic kept; structural patterns computed deterministically rather than guessed; and the system's own prose description marked as a hypothesis to abstract, not a fact to repeat. One rule governs the packet, and it carries more weight than any other line: absence is real — if something is not in here, it is not on file; do not assume it exists, and do not invent it. That single instruction is the seam that stops a capable, confident model from quietly supplying its own facts. What it cannot find, it has to say it cannot find.

The contract is why the report has the shape it does

The third part doesn't ask for "a report." It asks for a fixed set of sections, plus room for one more the evidence earns — and each section does a different job with the truth. The sharpest realization leads. Then what the company is. Then how the system reads its structure — marked, in the section itself, as a reading. Then only what the numbers confirm. Then a single open question — the elegant idea that is interesting but unproven, kept as a hypothesis. Then, last and mandatory, what the system cannot yet see.

The shape is not decoration; it is how the confidence is carried. A structural reading cannot dress itself up as a verified fact, because it physically lives in the section reserved for readings, beside the separate section reserved for verified numbers. An unproven idea cannot leak into the findings, because it has its own room marked "open question." The model is told to write its reasoning before its conclusion, so a verdict chosen too early can't bend the analysis to fit it — and to treat "we can't see this" as a correct answer rather than a gap to paper over. The structure does work the prose would otherwise have to be trusted to do on its own.

Restrict language last; teach by example

With the worldview, the evidence, and the contract carrying the load, the prompt itself can stay lean — and it should. A prompt teaches best by showing one good example, not by accumulating a wall of "do not"; the rules that genuinely must hold belong in code, where they can be tested, not in a paragraph the model may or may not honour. And the wording gets tightened last: gather the reasoning openly first, then compress the language at the end. Pile restrictions into the thinking step and the output flattens toward the safe and generic — careful instructions producing careful nothing. The discipline lives in the structure, so the prompt doesn't have to plead for it.

Built, not prompted

None of this makes the model more trustworthy. It makes the trust unnecessary in the places it can't be earned. The worldview makes the model reason structurally instead of grabbing a cliché; the evidence stops it inventing; the contract makes the structure carry the confidence the prose can't be trusted to carry alone. What comes out is then checked against something the model didn't produce and kept as a record you own — the other two parts of the same discipline. A report built this way is a composed object with a spine you can inspect, not a fluent paragraph you hope is right.

And the move generalizes past reports. Wherever a language model sits near a decision, the work is not writing a cleverer prompt. It is building the system the good answer comes out of.

If the question this leaves you with is about your own system — whether it can be made this legible, this answerable for what it produces — that's what a System Report is for: a written map of what's actually there, where the boundaries fall, and which parts can't yet be trusted, named out loud. The discipline on this page, turned on the system you already have.


This is the middle of a trilogy on building with a model you don't own: what stays yours when you rent the model, how its output is composed (this piece), and how you trust what it produced. All three run in production on CompanyGraph today.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

← All articles