Skip to main content

Essay

Why the agent wouldn't say no

Sam Carter
Sam Carter21 May 2026 · 3 min read

The experiment was meant to see where the AI got things wrong.

The more interesting answer was where it refused to make a call.


What we did

We took public UK government purchase records — the kind every public buyer now has to publish under the new procurement law — and put two reviewers in front of the same set, at the same moment.

On one side, an AI was asked to read each record and pick: approve, flag for a human, or deny.

On the other, MeshQu checked the same records against a written rulebook — when each had to be published, who was allowed to sign it off, and what evidence had to be left behind.

Both reviews were locked together into a single signed record for each decision — the kind of evidence a regulator, a board, or an auditor could replay later. One decision, two reviewers, one verifiable file.


What we saw

The AI almost never said no.

MeshQu approved about half the records and denied about half. The AI approved a handful, flagged almost everything else for a human to look at, and didn't issue a single denial across the whole set.

Reading the records side by side made the pattern clear. The AI wasn't missing the issues. It was naming the same ones the rulebook did — published too late, signed off above the buyer's limit, evidence missing. It just wouldn't commit. Where the rulebook reached a verdict, the AI reached for "more context needed" every time.

The headline disagreement looked huge. The real disagreement — what each reviewer actually believed about the decision — was small.


Why it matters

A rulebook that treats missing evidence as a confirmed problem, and a cautious AI that defaults to "flag for a human", will look like they disagree on every borderline record — even when they're seeing the same thing.

So we changed one rule. Just one. Instead of "this absence is a violation", it said "this absence needs a human to read". Nothing about the AI changed. The two reviewers then agreed on about ten times as many decisions as before. The rulebook had started describing the situation more accurately, and the apparent disagreement collapsed.

That's the finding the headline number hid. The harder problem isn't whether the AI is capable. It's how the rulebook is worded.


What's next

This was the first experiment. The next one gives the AI the rulebook and the precedent it's supposed to be reading against — and measures how its caution moves once it isn't reasoning from training data alone. That experiment is already in build; the records will publish alongside the paper, and you'll be able to verify each one yourself.

When a regulator, board, or customer asks how a specific AI-augmented decision was made, the answer can't be a screenshot, a vendor brochure, or a re-run six months later under a different model. It has to be a record made at the moment of decision, signed in place, and verifiable from outside the firm that made it.

The full paper is below — method, results, the one-rule change, the verification flow, and every record we ran.


Decision Assurance

Two reviewers, one decision, one signed record. Both replay.

Inspect a receipt

More from the Journal