Runcap vs Langfuse vs LiteLLM: Which One Actually Stops a Runaway AI Agent?
You let a coding agent loose on a task. It loops. It re-reads the same files, re-summarizes the same context, retries the same failing call. Forty minutes later you check the provider dashboard and the run cost more than the feature was worth. You had four tools that could have told you, and none of them stopped it.
This is the gap most people do not notice until it costs them. The tools in this space look interchangeable from the outside, but they sit in three different places in the request lifecycle and solve three different problems. Here is the honest breakdown of where each one fits, and the one job that only one of them does.
The three categories
Observability (Langfuse, Helicone, LangSmith). These record what your LLM calls did after they happened. Traces, token counts, latency, cost per call, evals. They are excellent for understanding behavior over time and debugging quality. They live beside the request path: the call completes, then the data flows to the dashboard. They can alert you that a budget was crossed. They cannot reach back and block the call that crossed it, because by the time the trace exists, the call is already paid for.
Gateways (LiteLLM, OpenRouter, Portkey). These sit in the request path and route. One API surface across many providers, key management, fallbacks, caching, and per-key rate limits and budgets. Their budgets are real, but they are billing-period guardrails: spend X per key per month, reset on a window. That protects you from a leaking key over weeks. It does not estimate what this specific run will cost before you press go, and it does not hard-stop a single agent that goes into a tight, expensive loop inside its allowance.
Pre-flight cost control (Runcap). This also sits in the request path, but its job is different: estimate the cost of a run before it starts, enforce a hard ceiling that physically stops the run when spend crosses it, and cut wasted tokens out of each request on the way through. It is the only one of the three built around the moment before the money is spent.
Side by side
| Capability | Observability (Langfuse / Helicone) |
Gateway (LiteLLM / OpenRouter) |
Runcap |
|---|---|---|---|
| Estimate run cost before it starts | No | No | Yes (as a range) |
| Hard stop mid-run at a cap | No (alert only) | Per-key budget over time | Yes (HTTP 429 at cap) |
| Compress wasted tokens per request | No | No | Yes (lossless) |
| Delta-encode a re-read file after an edit | No | No | Yes (37.9% on a real call) |
| Post-run traces and analytics | Yes (their strength) | Basic logs | Run report + truth labels |
| Multi-provider routing and fallbacks | No | Yes (their strength) | Proxies Claude and OpenAI |
| Runs fully local, no server | Cloud or self-host | Self-host option | Yes (100% local) |
| Rescue prompt when agent is stuck | No | No | Yes |
The point of the table is not that Runcap wins every row. It does not. Langfuse will out-trace it; LiteLLM will out-route it. The point is the first three rows: estimate before, hard-stop during, compress on the way through. Those are the rows that decide whether a runaway loop costs you a dollar or a hundred, and they are empty for every tool except one.
How the hard stop actually works
Runcap runs a small local proxy. You point your agent's base URL at it, set a cap, and run your agent as normal. Every call flows through the proxy. Before a call is forwarded upstream, Runcap prices it against the live model rate and checks the running total. If forwarding the call would cross your cap, it never goes to the provider: the proxy returns HTTP 429 and the spend stays at zero for that call. Your agent sees a budget error, not a surprise bill.
The one trick no other proxy does
While the call passes through, Runcap also compresses it, and one layer is genuinely unique. Coding agents read a file, change one line, then re-read it. The two copies are almost identical, so the ordinary dedup a gateway does saves nothing on the second copy. Runcap detects the near-duplicate and replaces the re-read with a lossless line-diff against the version the model already saw. The model reconstructs the current file from the diff and answers exactly as it would have on the full text.
This is not a marketing estimate. On a real OpenAI gpt-4o-mini call where the answer depended on the one changed line, the same request dropped from 1,186 prompt tokens to 737 with delta-encoding on: 37.9% off a single re-read, with OpenAI's own usage counter confirming the number and the model giving the identical correct answer. It is lossless by construction: Runcap refuses to emit a delta unless it reconstructs the original byte for byte. The full proof and a script you can run yourself are in the repo.
The honest claim. Runcap does not promise an exact cost oracle. Agent runs are stochastic; nobody can tell you the penny-precise cost in advance. What it gives you is a range before the run and a hard cap during it. Every number it reports carries a truth label: observed, calculated, provider_usage, or unknown. It tells you which numbers are measured and which are estimated, instead of pretending they are all the same.
So which should you use?
This is not a cage match. These tools stack.
- You want to understand and improve quality over time across many runs in production: use Langfuse or Helicone. That is what they are for.
- You serve many users or rotate many providers and need routing, fallbacks, and per-key billing limits: use LiteLLM or OpenRouter.
- You are a developer running a coding agent (Claude Code, Codex, Cursor) on your own key and you want to know what a run will cost and guarantee it cannot blow past a number: use Runcap. It is free, MIT-licensed, and runs entirely on your machine, so your code and tokens never touch a server.
The most common real setup is a gateway for routing, observability for after-the-fact analysis, and Runcap in front of the agent you actually let run unattended. They answer different questions: what did it do, where did it go, and how much before I let it go.
Try it
Runcap installs in one line and runs locally:
npm install -g runcap
Source and docs are on GitHub, and there is a plain-language walkthrough on the Runcap page. It is free forever for the local core. If you later want cloud sync, a hosted dashboard, and alerts across machines, there is a paid Pro layer, but the thing that stops your agent never lives behind a paywall.
The hard cap is one example of a broader principle: a serious system models the consequence of an action before it commits. The same idea shows up in making a bulk update across many locations all-or-nothing instead of letting it corrupt half of them on a failed write. I build AI cost-control and stabilization systems for solo founders and small teams. Runcap is the open-source piece of that work. If you are wiring agents into production and want a second set of eyes on where the spend and the failure modes hide, email me at kirill@launchsoloai.com with your stack and what you are running. Within 24 hours you get back a free written teardown of that one bottleneck - what I would automate, what it would take, and a fixed price - or a straight no.
- All insights