Newer Isn't Better
We benchmarked 25 LLMs on structured regulatory output. Across Anthropic, DeepSeek, and Google, newer models regressed. The reason is instructive — and the subagent trap is worse than you think.
The thing nobody tells you about model upgrades
Here is something that should be more widely known but isn't, possibly because it's bad for business: across three major AI labs, newer models are worse at following structured output instructions than their predecessors.
We found this by accident. We were benchmarking 25 LLMs on a regulatory risk classification task — the kind of thing enterprises actually need to do, as opposed to the kind of thing that looks good on a leaderboard — and the pattern held across Anthropic, DeepSeek, and (to a lesser degree) Google. The newer model, in each case, scored lower. And cost more. Which is, if you think about it, exactly the opposite of what the upgrade cycle promises.
The setup
We built ARA-Eval, a framework for classifying when enterprises can safely deploy autonomous AI agents in Hong Kong financial services. Each model reads 13 regulatory scenarios — loan approvals, AML screening, algorithmic trading, insurance claims denial — and classifies them across 7 risk dimensions, returning valid JSON with exact field names and A–D level ratings. No tools, no multi-turn, no hand-holding. Read the rubric, reason about the scenario, return the structure.
We score with F2 (F-beta, beta=2). In a regulatory context, you'd rather have a system that occasionally cries wolf than one that sleeps through the break-in — a missed risk gate is far more dangerous than a false alarm. F2 encodes that preference: it weights recall four times over precision.
We tested 25 models via OpenRouter, each making 39 calls (13 scenarios across 3 evaluator personality archetypes). Total cost per model ranged from $0.002 to $0.22.
The headline
Gemini 2.5 Flash Lite scores 99% F2 — matching Claude Opus 4.6 on nuanced regulatory reasoning. At $0.10/M input and $0.40/M output, it costs roughly a half-cent per full evaluation run. A budget-tier model, matching Anthropic's flagship.
That was surprising enough. Then we looked at what happened across model generations.
The regression pattern
| Model | Release | F2 | Cost/Run |
|---|---|---|---|
| Claude Haiku 3.5 | Oct 2024 | 92% | $0.09 |
| Claude Haiku 4.5 | Oct 2025 | 79% | $0.22 |
| DeepSeek v3.2 | Dec 2025 | 82% | ~$0.01 |
| DeepSeek V4 Flash | Apr 2026 | 79% | $0.015 |
| Gemini 2.5 Flash Lite | Sep 2025 | 99% | ~$0.006 |
| Gemini 3.1 Flash Lite | Mar 2026 | 92% | $0.035 |
Anthropic: 13-point drop. DeepSeek: 3-point drop. Google: 7-point drop (from an exceptional baseline). In every case, the newer model costs more and scores lower on structured output compliance.
Haiku 4.5 at $0.22 per run scores identically to DeepSeek V4 Flash at $0.015. Fifteen times the cost, same result. That's the kind of fact that should change procurement decisions. It almost certainly won't, because model selection in most enterprises is still driven by brand and benchmark rankings, not by running your own eval on your own task. But it should.
Why this is happening
The short answer is that labs optimise for the benchmarks that move public rankings, and structured output compliance doesn't move public rankings.
The longer answer is more interesting.
The benchmark landscape shifted
MMLU is saturated — frontier models hit 88–94% by late 2025, making it useless for differentiation. Labs pivoted toward Humanity's Last Exam, ARC-AGI 2, SWE-bench Pro, and GPQA Diamond. These benchmarks reward multi-step reasoning, mathematical problem-solving, and agentic tool use. They do not reward "output exactly this JSON structure with exactly these field names."
Post-training compute follows the benchmarks that move public rankings. Structured output compliance doesn't move rankings, so it doesn't get the compute. The models get smarter at reasoning and worse at following format instructions. This is, in a sense, perfectly rational behaviour — if you're an AI lab. It's less rational if you're the enterprise customer who just upgraded.
Flash-tier architecture trades compliance for efficiency
DeepSeek V4 Flash runs 284B total parameters but only 13B active (mixture-of-experts), optimised for 1M-token context at roughly 10% of v3.2's compute cost. The architecture uses on-policy distillation: domain specialists (maths, coding, instruction-following) are trained independently, then merged. Instruction-following is one specialist among many, and the merge dilutes its signal. DeepSeek's own documentation acknowledges that "alignment-sensitive behaviours might regress" under this approach.
The result: your JSON fields come back with slightly wrong names, or the model wraps output in markdown it wasn't asked for. The reasoning is fine. The packaging is broken. And if you're parsing that output programmatically — which you are, because that's the whole point of structured output — the reasoning being fine doesn't help you.
Haiku 4.5 added extended thinking, computer use, and a 200K context window. The new capabilities consumed capacity that previously went to rigid format compliance. More tricks, less discipline. You've met people like this.
Google made a different trade-off
Gemini 3.1 Flash Lite — distilled from Gemini 3 Pro — explicitly included instruction tuning data and reinforcement learning from human feedback in its post-training mix. Google's announcement specifically called out "instruction-heavy chatbot workflows" and structured output as design targets, citing approximately 97% compliance in production labelling pipelines. That's a deliberate prioritisation most other labs didn't make at the flash tier.
It shows: Google's regression is the smallest (7 points), from the highest baseline (99%). They still regressed. But they regressed less, because they decided structured output mattered enough to spend post-training compute on it. The fact that this counts as exceptional tells you something about the state of the field.
The subagent trap
This is the finding that keeps me up at night (not literally, but close enough that the metaphor earns its place here).
Claude Haiku 4.5 scored 8% when evaluated as a Claude Code subagent. Then 79% via the direct API. Same model. Same rubric.
The subagent version understood the task but hallucinated its own dimension names — in 14 of 18 evaluations. It substituted "human_override_latency" for "decision_time_pressure," invented field names that sounded plausible but didn't match the rubric, refused once, and in one case read the source files instead of doing the eval (we scored that as a cheat). The model reasoned correctly about the regulatory scenarios but couldn't adhere to the field names in the rubric. This is instruction-following failure in an agentic context, where the model has more autonomy in how it interprets the task.
The implication for enterprise deployments is uncomfortable: if your production system uses an orchestrator that passes tasks to subagents, your API benchmarks may be irrelevant. The agentic execution path surfaces different failure modes — ones that don't appear in direct API testing.
Benchmark in the configuration you'll deploy, not the cleanest possible harness. This sounds obvious. In practice, almost nobody does it.
The calibration problem
Of 25 models tested, only 4 were "calibrated." ARA-Eval tests each model through three personality archetypes — a risk-averse Compliance Officer, an aggressive CRO who pushes for autonomy, and a neutral Operations Director who cares about fallback paths. A calibrated model produces meaningfully different risk assessments across these three perspectives without systematically over- or under-triggering hard gates. Only four models managed this:
- Claude Opus 4.6 (both API variants)
- Gemini 2.5 Flash Lite
- Qwen3 235B
Ten models were "sleepy": they systematically missed hard gates that should block autonomous action. A sleepy model classifies scenarios as safe for autonomy when they aren't — it tells you the AML screening agent can run unsupervised when the HKMA mandate says otherwise.
Here's the asymmetry that matters: a jittery model (over-triggers) creates friction. Humans override false positives and move on, annoyed but safe. A sleepy model creates liability. Nobody overrides a gate that never fired. You can't catch what you can't see.
What this means if you ship structured output
Don't assume newer is better. For structured output tasks — data extraction, classification, rubric scoring, form completion — benchmark the specific version you plan to deploy against the predecessor. We found three cases across two labs where the newer model was measurably worse.
Cost is not a proxy for quality. Gemini 2.5 Flash Lite at roughly half a cent per run outperforms Claude Haiku 4.5 at $0.22 and GPT-5.4 Nano at $0.029. The cost-performance frontier has shifted; price tier no longer signals capability.
Test your actual integration path. Direct API, proxy endpoint, and subagent produce different results from the same model. We measured a 10x performance swing on Haiku 4.5 between two integration paths. The model isn't the only variable — the execution surface matters.
Distinguish calibration from accuracy. A 92% F2 model that's calibrated may be safer to deploy than a 95% model that's sleepy. False negatives in risk classification create exposure that no amount of accuracy on the happy path can offset.
Reproduce it yourself
Full results, methodology, scenarios, and reference fingerprints are open-source at ara-eval. A complete evaluation run takes 5–30 minutes per model and costs under $0.25 on OpenRouter.
The framework targets Hong Kong financial services regulation (HKMA, SFC, PCPD, PIPL), but the core finding generalises: if your production workload depends on structured output compliance, public benchmark rankings won't tell you what you need to know.
Run your own eval.
Evaluated May 2026. 25 models tested via OpenRouter API, 39 calls per model (13 scenarios x 3 personality archetypes). F2 (beta=2) weights recall 4x over precision — appropriate for risk classification where false negatives are more dangerous than false positives. All results and scoring code are open-source.