Managed agents vs. loose LLM calls — governance bill comes due either way

For about thirty months, the LLM-app stack was a craft project. You picked a model, wrote a prompt, glued a few tools together with a while loop, called it an agent, and shipped. The harness — the bit that runs the agent loop, manages the sandbox, executes tools, and keeps the context coherent — was something every team rebuilt from scratch.

That era ended in April 2026. Within four weeks, Anthropic launched Claude Managed Agents into public beta, OpenAI released a model-native open-source Agents SDK with sandboxing built in, and Google and Microsoft kept folding agent primitives into Copilot and Vertex. The four major labs now agree on one thing: the harness is the product. They disagree on what to charge for it.

Which means anyone building with LLMs in production now faces a choice that didn't exist in March:

Adopt a managed agent runtime — Anthropic at $0.08/session-hour, OpenAI's free SDK on your own infra, or Microsoft Copilot bundled into M365 — and stop maintaining a harness.
Stay with loose LLM calls — your own orchestration, your own retries, your own tool routing — and keep paying the engineering tax to keep that running.

The trade everyone is debating in public is velocity vs. lock-in. That debate is real, but it's a distraction from the trade that will actually cost teams money in 2026. Whichever path you pick, the EU AI Act's August 2 deadline doesn't care, your auditors don't care, and the new Linux Foundation Agentic AI Foundation doesn't care. They all want the same thing: machine-checkable evidence that your agent is operating inside the policy you say it's operating inside.

What "managed" actually moves

It's worth being precise about what a managed agent runtime gives you and what it doesn't.

Concern	DIY loose calls	Managed agent runtime
Agent loop (plan → tool → observe → reflect)	You build it	Vendor runs it
Sandboxed code execution	You wire up Firecracker / Docker / E2B	Native, isolated per-session
Tool registry & routing	You write the dispatcher	YAML or natural-language config
Context window management	You implement compaction	Built-in
Cost ceiling per session	You meter manually	First-class primitive
Telemetry of agent steps	You instrument	Provided as event stream
Policy commitment evidence	You produce it	You produce it
Audit trail mapped to controls	You produce it	You produce it
Refusal rules at runtime	You enforce them	You configure them — but the burden of proof is yours

The top six rows are real engineering work that managed runtimes legitimately remove. Anthropic's pricing — eight cents per session-hour on top of token costs — is reasonable in exchange for not maintaining your own loop. Notion, Asana, and Sentry are early adopters for a reason: shipping faster matters.

The bottom three rows are the governance bill. No managed runtime pays it for you. Read the docs carefully and you'll see why: the runtime can give you a log of every tool call, every model turn, every sandbox boundary — but it cannot give you a credential that says "this agent has committed to your policy P at maturity tier T, with evidence E, signed by an accountable issuer." That credential has to come from your governance system, not from your harness.

Bottom line —

If you only remember one thing: the harness vendor logs what your agent did. You still owe everyone an answer to what your agent agreed to do. That gap is the governance bill, and it's the same in both paths.

What the EU AI Act actually wants

The August 2026 enforcement window is not abstract. For high-risk AI systems — and most production agents that touch customer data, finance, HR, or regulated content qualify — Article 12 demands automatic logging, Article 13 demands transparency to deployers, and Article 14 demands meaningful human oversight that is technically implemented, not just described in a policy document. As one compliance team put it: "In the 2026 compliance environment, screenshots and declarations are no longer sufficient — only operational evidence counts."

A managed agent runtime gives you the operational half. It does not give you the policy half:

Operational evidence — "At 14:32 UTC, agent X called tool Y with input Z; the call was sandboxed; the output was returned at 14:33." Managed runtimes produce this natively.
Policy evidence — "Agent X is bound by Statement P at maturity tier T4, codified in skill file S, approved by user A, valid until D." Managed runtimes produce none of this.

If your only artifact is the runtime log, an auditor can verify what the agent did but not whether what the agent did was permitted. That's the gap the AAIF was founded to close and the gap that Article 14 will not let you ignore after August.

What "loose LLM calls" actually costs

The alternative — keeping your own harness, building on top of OpenAI's open-source Agents SDK or rolling your own MCP-aware loop — is sometimes pitched as the "open" path. It is more open. But the engineering cost is real, and so is the governance discipline required to not produce a worse audit trail than the managed alternative.

In practice, teams running loose LLM calls in production tend to under-instrument. The agent loop sprawls across services. Tool calls get logged in three different formats. Refusal rules live in scattered system prompts. When an auditor asks "show me every action this agent has taken on behalf of a customer in the last 90 days, and prove it stayed inside the documented scope," the answer is too often a manual SQL session and a written narrative.

That's not because loose-calls teams are sloppy. It's because without a forcing function, governance instrumentation always loses to feature velocity. A managed runtime is a forcing function for operational logging. Nothing about either path is a forcing function for policy attestation.

What both paths need next

Both managed and loose architectures converge on the same missing primitive: a portable, machine-checkable record of what each agent has committed to follow, signed by an accountable issuer, with evidence tiered by maturity. We've been calling this Policy Commitment Attestation (PCA), and the draft specification is heading toward AAIF alongside MCP and AGENTS.md.

PCA composes existing standards rather than inventing new ones:

Layer	Standard	Role in PCA
Agent identity	W3C DID	Names the agent unambiguously
Attestation envelope	W3C VC 2.0	Carries the credential, signed
Evidence linking	in-toto attestations	Binds evidence artifacts to claims
Controls mapping	NIST OSCAL	Exports as `assessment-results` for audit
Tool authorization	MCP Auth	Decides what the agent can call
Scope and refusal	W3C ODRL	Machine-readable boundaries

The novel piece is the Commitment Maturity Ladder — six tiers from T1 Read (the agent acknowledges the policy exists) to T6 Enforced (a runtime guardrail blocks violations). Tiers are cumulative and falsifiable: a T6 claim has to satisfy every floor below it, and each floor is mechanically checkable. You don't get to wave at your governance posture; you publish a credential, and a verifier either confirms it or rejects it.

PCA is deliberately runtime-agnostic. A Claude Managed Agent at T6 looks structurally identical to a homegrown Python loop at T6 — same envelope, same evidence kinds, same OSCAL export. Which is the point: portability across runtimes is what gives the standard value, exactly the way MCP and A2A are now Linux Foundation projects precisely because that portability could not exist inside one vendor's stack.

So which path should you pick?

If you're starting today and your team is small, Claude Managed Agents or OpenAI's Agents SDK will save you weeks of harness work, and that's worth the price (or the constraint). Don't refuse to use them out of lock-in fear — the governance interfaces below are vendor-neutral by design, and the AAIF is actively shrinking the lock-in surface.

If you're already running loose LLM calls in production, don't migrate to a managed runtime to fix governance — that's the wrong tool for the wrong problem. Managed runtimes give you better operational telemetry; they don't produce policy attestations. Migrating won't close the audit gap, it just moves the engineering effort.

Whichever you pick, the work that doesn't go away — and that the EU AI Act, your customers' security questionnaires, and your board are all about to ask for — is producing per-agent, per-policy, signed, tiered, evidence-bearing credentials. That's where the differentiating governance work happens in 2026, and it's where Dictiva is investing.

The harness is the product. The governance is the bill. Pick your runtime on velocity grounds; build your evidence layer like every component will eventually need to be replaced — because it will.

Further reading:

Policy Commitment Attestation: the missing layer in agentic AI governance — the full PCA proposal and the Commitment Maturity Ladder
Why IETF SCITT matters for agentic AI governance — transparency logging for agent attestations
OSCAL + agent attestations: machine-readable compliance for FedRAMP and beyond — exporting credentials for audit
From attestation to enforcement: scope and refusal rules for AI agents — the T5/T6 layer, in detail