April 27, 2026|9 min read

Managed agents vs. loose LLM calls: the governance bill comes due either way

Anthropic's Claude Managed Agents, OpenAI's open-source Agents SDK, and DIY harnesses all face the same EU AI Act deadline. The runtime is not the hard part anymore — the evidence is.

C
Carlos Alvidrez
Teilen

For about thirty months, the LLM-app stack was a craft project. You picked a model, wrote a prompt, glued a few tools together with a while loop, called it an agent, and shipped. The harness — the bit that runs the agent loop, manages the sandbox, executes tools, and keeps the context coherent — was something every team rebuilt from scratch.

That era ended in April 2026. Within four weeks, Anthropic launched Claude Managed Agents into public beta, OpenAI released a model-native open-source Agents SDK with sandboxing built in, and Google and Microsoft kept folding agent primitives into Copilot and Vertex. The four major labs now agree on one thing: the harness is the product. They disagree on what to charge for it.

Which means anyone building with LLMs in production now faces a choice that didn't exist in March:

  • Adopt a managed agent runtime — Anthropic at $0.08/session-hour, OpenAI's free SDK on your own infra, or Microsoft Copilot bundled into M365 — and stop maintaining a harness.
  • Stay with loose LLM calls — your own orchestration, your own retries, your own tool routing — and keep paying the engineering tax to keep that running.

The trade everyone is debating in public is velocity vs. lock-in. That debate is real, but it's a distraction from the trade that will actually cost teams money in 2026. Whichever path you pick, the EU AI Act's August 2 deadline doesn't care, your auditors don't care, and the new Linux Foundation Agentic AI Foundation doesn't care. They all want the same thing: machine-checkable evidence that your agent is operating inside the policy you say it's operating inside.

What "managed" actually moves

It's worth being precise about what a managed agent runtime gives you and what it doesn't.

ConcernDIY loose callsManaged agent runtime
Agent loop (plan → tool → observe → reflect)You build itVendor runs it
Sandboxed code executionYou wire up Firecracker / Docker / E2BNative, isolated per-session
Tool registry & routingYou write the dispatcherYAML or natural-language config
Context window managementYou implement compactionBuilt-in
Cost ceiling per sessionYou meter manuallyFirst-class primitive
Telemetry of agent stepsYou instrumentProvided as event stream
Policy commitment evidenceYou produce itYou produce it
Audit trail mapped to controlsYou produce itYou produce it
Refusal rules at runtimeYou enforce themYou configure them — but the burden of proof is yours

The top six rows are real engineering work that managed runtimes legitimately remove. Anthropic's pricing — eight cents per session-hour on top of token costs — is reasonable in exchange for not maintaining your own loop. Notion, Asana, and Sentry are early adopters for a reason: shipping faster matters.

The bottom three rows are the governance bill. No managed runtime pays it for you. Read the docs carefully and you'll see why: the runtime can give you a log of every tool call, every model turn, every sandbox boundary — but it cannot give you a credential that says "this agent has committed to your policy P at maturity tier T, with evidence E, signed by an accountable issuer." That credential has to come from your governance system, not from your harness.

Bottom line

If you only remember one thing: the harness vendor logs what your agent did. You still owe everyone an answer to what your agent agreed to do. That gap is the governance bill, and it's the same in both paths.

What the EU AI Act actually wants

The August 2026 enforcement window is not abstract. For high-risk AI systems — and most production agents that touch customer data, finance, HR, or regulated content qualify — Article 12 demands automatic logging, Article 13 demands transparency to deployers, and Article 14 demands meaningful human oversight that is technically implemented, not just described in a policy document. As one compliance team put it: "In the 2026 compliance environment, screenshots and declarations are no longer sufficient — only operational evidence counts."

A managed agent runtime gives you the operational half. It does not give you the policy half:

  • Operational evidence — "At 14:32 UTC, agent X called tool Y with input Z; the call was sandboxed; the output was returned at 14:33." Managed runtimes produce this natively.
  • Policy evidence — "Agent X is bound by Statement P at maturity tier T4, codified in skill file S, approved by user A, valid until D." Managed runtimes produce none of this.

If your only artifact is the runtime log, an auditor can verify what the agent did but not whether what the agent did was permitted. That's the gap the AAIF was founded to close and the gap that Article 14 will not let you ignore after August.

What "loose LLM calls" actually costs

The alternative — keeping your own harness, building on top of OpenAI's open-source Agents SDK or rolling your own MCP-aware loop — is sometimes pitched as the "open" path. It is more open. But the engineering cost is real, and so is the governance discipline required to not produce a worse audit trail than the managed alternative.

In practice, teams running loose LLM calls in production tend to under-instrument. The agent loop sprawls across services. Tool calls get logged in three different formats. Refusal rules live in scattered system prompts. When an auditor asks "show me every action this agent has taken on behalf of a customer in the last 90 days, and prove it stayed inside the documented scope," the answer is too often a manual SQL session and a written narrative.

That's not because loose-calls teams are sloppy. It's because without a forcing function, governance instrumentation always loses to feature velocity. A managed runtime is a forcing function for operational logging. Nothing about either path is a forcing function for policy attestation.

What both paths need next

Both managed and loose architectures converge on the same missing primitive: a portable, machine-checkable record of what each agent has committed to follow, signed by an accountable issuer, with evidence tiered by maturity. We've been calling this Policy Commitment Attestation (PCA), and the draft specification is heading toward AAIF alongside MCP and AGENTS.md.

PCA composes existing standards rather than inventing new ones:

LayerStandardRole in PCA
Agent identityW3C DIDNames the agent unambiguously
Attestation envelopeW3C VC 2.0Carries the credential, signed
Evidence linkingin-toto attestationsBinds evidence artifacts to claims
Controls mappingNIST OSCALExports as assessment-results for audit
Tool authorizationMCP AuthDecides what the agent can call
Scope and refusalW3C ODRLMachine-readable boundaries

The novel piece is the Commitment Maturity Ladder — six tiers from T1 Read (the agent acknowledges the policy exists) to T6 Enforced (a runtime guardrail blocks violations). Tiers are cumulative and falsifiable: a T6 claim has to satisfy every floor below it, and each floor is mechanically checkable. You don't get to wave at your governance posture; you publish a credential, and a verifier either confirms it or rejects it.

PCA is deliberately runtime-agnostic. A Claude Managed Agent at T6 looks structurally identical to a homegrown Python loop at T6 — same envelope, same evidence kinds, same OSCAL export. Which is the point: portability across runtimes is what gives the standard value, exactly the way MCP and A2A are now Linux Foundation projects precisely because that portability could not exist inside one vendor's stack.

So which path should you pick?

If you're starting today and your team is small, Claude Managed Agents or OpenAI's Agents SDK will save you weeks of harness work, and that's worth the price (or the constraint). Don't refuse to use them out of lock-in fear — the governance interfaces below are vendor-neutral by design, and the AAIF is actively shrinking the lock-in surface.

If you're already running loose LLM calls in production, don't migrate to a managed runtime to fix governance — that's the wrong tool for the wrong problem. Managed runtimes give you better operational telemetry; they don't produce policy attestations. Migrating won't close the audit gap, it just moves the engineering effort.

Whichever you pick, the work that doesn't go away — and that the EU AI Act, your customers' security questionnaires, and your board are all about to ask for — is producing per-agent, per-policy, signed, tiered, evidence-bearing credentials. That's where the differentiating governance work happens in 2026, and it's where Dictiva is investing.

The harness is the product. The governance is the bill. Pick your runtime on velocity grounds; build your evidence layer like every component will eventually need to be replaced — because it will.


Further reading:

Managed agents vs. loose LLM calls — governance bill comes due either way | Dictiva