Running an AI agent in production is six services in a trench coat
By Kingsley Torlowei
Your AI agent looks like one product to your users. It's six services to your on-call.
Let's run a scenario. You run an LLM batch pipeline for hundreds of tenants — nightly enrichment, weekly re-classification, that kind of thing. Last Tuesday, tenant 47's batch silently dropped 80 items. Every layer reported success on the thing it tracked. The queue knew the job finished. The tracer knew every span completed. The framework knew the function returned — even when it returned a default three calls deep. All three were right at their layer. All three missed the question that actually mattered: did the 80 items get what they were sent to do? Nothing in the stack owns that question, because nothing in the stack is accountable for outcomes — only for execution. Silent failure isn't a bug. It's the predictable consequence of an architecture where the unit of accountability is wrong.
From the outside, your AI agent is a single product. From the inside, it's a stack of vendors holding hands and hoping nobody lets go.
Unbutton the coat
Here's a walkthrough of what's actually running when one item moves through your "agent":
- LLM provider — the brain. OpenAI, Anthropic, Bedrock. Lives behind an API you don't control.
- Retrieval — the memory. pgvector, Pinecone, an embeddings pipeline you built last quarter and haven't touched since.
- Agent framework — the orchestration. LangChain, LlamaIndex, or the homegrown loop you wrote because both of those felt wrong.
- Durable execution / queue — the muscle. Temporal, AWS Batch, Celery, SQS. Whichever one was already in the org when you started.
- Observability — the eyes. Langfuse, Phoenix, Datadog traces, three OpenTelemetry exporters fighting over the same span.
- Evals — the conscience. Braintrust, homegrown notebooks, or "we'll add that next quarter."
That's six.
Now layer in the things nobody counts: secrets, blob storage for prompts and outputs, per-tenant isolation, rate limits per provider, cost attribution back to a customer ID, retry policy that doesn't multiply your bill by three.
That's twelve.
It looks like one product because that's what you sell. It's twelve because that's what shipped.
What it costs you
The bill isn't theoretical. Every team running this stack has the same set of scars.
- Triple billing. A retry storm at the queue layer fires the same prompt three times to OpenAI before anyone notices. The tracer logs three successful calls. Accounting catches it at month-end, if at all.
- Invisible cost explosions. Your tracing layer sees every LLM call. It doesn't know they were for tenant 47. The bill spike shows up in your provider dashboard, not your tenant dashboard. Two weeks pass before someone correlates them.
- Silent partial failure. The agent framework catches an exception three calls deep and returns a default. The queue marks the job done. The item is gone. There is no replay path because nothing knows it failed.
- Degraded outputs at scale. A 10K-item batch finishes "successfully" with 200 silently degraded outputs — a tool call returned
null, a retrieval came back empty, the model hallucinated a field name. You can't tell which 200 without re-running the whole batch. - Onboarding requires a release. A new tenant needs configuration in three different YAMLs and a feature flag flip. Every onboarding is a deploy.
None of these are bugs in any individual layer. Every layer is doing its job. The failures live in the seams between them.
How it got like this
Each layer was built for a different era, and the seams show.
Queues like SQS and Celery were designed for millisecond-latency transactional web work — a user clicks a button, a worker sends an email. The work is small, cheap, idempotent, and finishes fast. Retries are free.
Durable execution engines like Temporal and AWS Step Functions were designed for deterministic data pipelines — ETL jobs, financial reconciliation, ML training. The work is deterministic, the state machine is the point, and you write your workflow as code.
Agent frameworks were written last year, mostly for prototypes. Run a notebook, chain some calls, ship a demo. Multi-tenancy was somebody else's problem.
Now stack them. You're running 30-second-to-30-minute items, against non-deterministic LLM outputs, for hundreds of tenants, with per-item cost attribution, on infrastructure designed for none of those things.
Nobody designed any layer for: per-tenant accounting, per-item replay, silent partial failure, pause-on-budget. So you assemble the layers and write the missing pieces yourself. Every quarter. A new on-call rotation discovers a new failure mode and writes a new piece of glue.
The layer that's missing
The instinct, when you've been bleeding glue code for two years, is to look for a tool that replaces all six. That instinct is also wrong.
Three of the six don't need replacing. Your LLM provider is fine. Your vector store is fine. Your agent framework is fine at being a framework. Keep them.
The other three — durable execution, observability, and the contract between them — can't be assembled from off-the-shelf parts and stay sane. They have to share one model: items, tenants, outcomes, budgets, replays. Three separate tools can only translate between models, badly, in glue code. They have to be one thing — a workload layer.
A workload has tenants. Tenants have items. Items have outcomes. Outcomes are accountable.
That contract doesn't exist in your queue. Your queue knows about jobs, not items. It doesn't know what tenant they belong to or what they cost. It doesn't know that a null field in the output is a degraded outcome, not a successful one.
That contract doesn't exist in your tracer. Your tracer knows about spans. It can tell you a call happened. It can't tell you whether the call was the right call to make, or whether the item it served got what it needed.
That contract doesn't exist in your framework. Your framework knows about chains and tools. It doesn't know about the tenant the chain is running for or the budget that tenant is on.
The workload layer is what binds them. It's the thing that says: this run belongs to tenant 47, contains 10,000 items, has a budget of $40, must pause — not fail — if the budget breaks, must replay any single item on demand, must surface the 200 degraded outputs without re-running the 9,800 good ones.
Get that layer right and the twelve-services-in-a-trench-coat problem doesn't go away — but it stops being your problem.
What we built
Papayya is the workload layer.
We don't touch the three that don't need replacing. Bring your LLM keys. Bring your vector store. Bring whatever framework code you already wrote — our primitives compose with whatever you put inside @agent.
What we replace is the part that has to be one thing: durable execution, per-item observability, and the workload contract that ties them to tenants and items. Under the hood, that's our own durable execution engine, Postgres-backed and purpose-built for AI workloads. Ships as a free SDK you run yourself, or as a hosted runtime we run for you.
What we add is the contract.
You decorate the function you already have:
from papayya import agent, run
@agent
def enrich(item):
context = run.retrieve(item.id)
return call_my_llm(item, context)
You run it the way you already run it — python script.py — and you get per-item replay, per-tenant cost attribution, failure clustering, and a pause-on-budget contract. Your LLM provider hasn't changed. Your vector store hasn't changed. Your framework code keeps working.
What changed is that the run is now a workload, not a pile of spans.
The twelve-services-in-a-trench-coat thing isn't going away. Every layer is doing its job. None of them are wrong.
What's missing is Papayya. The layer that knows your agent is a workload — that your items have tenants, your tenants have budgets, your failures have shapes, and your runs have to be replayable a week from now.
That's the layer we built.