Most agencies won’t tell you what an AI agent actually costs. They’ll quote you a range so wide it’s useless , “somewhere between $50K and $500K” , and then bill hourly until you stop them. Here are our real numbers from two production agent systems we run on our own business.
Why every quote you’ve gotten is wrong
I ran a small experiment last month. I emailed seven AI agent dev shops with the same one-paragraph brief: “We need an AI agent that watches a Gmail inbox, classifies incoming leads by intent, drafts a personalized reply, and books a Calendly slot when the lead is hot. About 200 emails a week.”
The quotes came back:
- Offshore shop A: “$15,000–$45,000, 8–12 weeks”
- Offshore shop B: “$50,000+, 16 weeks, must scope discovery first ($5K)”
- US “AI agency” with VC logos: “$120,000 for an MVP, then $8K/month retainer”
- Indian dev shop: “$4,800, 2 weeks, fixed price” (this one I almost took just to see what would ship)
- Two no-reply
- One “let’s hop on a call” that I declined
The range is 25x. For the same brief. That’s not a market , that’s a fog.
The reason the fog exists is that almost none of those shops have built and run an agent system on their own business. They quote the way contractors quote a kitchen remodel: vague enough to protect themselves, scary enough to anchor you high, hourly enough to keep the meter running.
I run two multi-agent systems on Dangerous Media right now , Funding OS (evaluates accelerators, grants, and fellowships for an app we’re advising) and Run GTM (plans our weekly content, drafts it, schedules it, books our calendar). I know to the dollar what they cost. Below are the actual numbers, broken down by component, plus the math on what your agent will cost depending on what you’re trying to do.
If you want to skip the article: we built an interactive calculator that does the math live based on your inputs. The rest of this piece explains the assumptions behind it.
The honest cost breakdown by tier
Every agent build I’ve ever seen fits cleanly into one of three buckets. They map roughly to the Agent Sprint pricing tiers we sell, but the components underneath are the same whether you’re hiring us or building it in-house.
Tier 1 , Simple agent ($3K–$8K build, $40–$150/mo to run)
What it is: One LLM-powered workflow with one trigger, one or two tools, and structured output. Examples: classify and reply to inbound email, summarize a Slack channel daily, score leads from a form submission.
Build components:
| Component | Hours | Cost (at $150/hr blended) |
|---|---|---|
| Discovery + prompt design | 4–6 | $600–$900 |
| Tool integrations (1–2 APIs) | 6–10 | $900–$1,500 |
| Eval set + calibration runs | 4–6 | $600–$900 |
| Deployment + monitoring | 3–5 | $450–$750 |
| Handoff + documentation | 2–3 | $300–$450 |
| Total | 19–30 hrs | $2,850–$4,500 |
Monthly run cost (real, not a guess): – LLM API: $20–$80/mo (assuming Claude Sonnet 4.6 or GPT-5 mini for a few thousand calls/month) – Hosting (Vercel / Railway / Cloudflare Workers): $0–$20/mo – Observability (Langfuse / Helicone free tier or $20/mo paid): $0–$20/mo – Eval re-runs (monthly calibration set): $5–$15/mo in API cost – Total: $40–$150/mo
A competent shop should ship Tier 1 in 7–14 days fixed price. Anyone quoting more than $10K for a Tier 1 agent is either padding hours or doesn’t know what they’re doing.
Tier 2 , Mid-complexity agent ($8K–$18K build, $150–$500/mo to run)
What it is: A small pipeline of 2–4 agents handing structured output to each other, hitting 3–6 tools, with persistent state across runs. Examples: investor evaluation pipeline (Funding OS sits here), a multi-step support triage system, a sales-research agent that enriches and qualifies leads.
Build components:
| Component | Hours | Cost (at $150/hr blended) |
|---|---|---|
| Architecture (where’s the seam between agents and scripts?) | 6–10 | $900–$1,500 |
| Schema design + JSON-schema validation between agents | 4–8 | $600–$1,200 |
| Discovery + prompt engineering for each agent | 12–20 | $1,800–$3,000 |
| Tool integrations (3–6 APIs, OAuth flows) | 15–25 | $2,250–$3,750 |
| Eval set + per-agent calibration | 8–12 | $1,200–$1,800 |
| Memory layer (persistent context across runs) | 6–10 | $900–$1,500 |
| Deployment, retry/backoff, error handling | 8–12 | $1,200–$1,800 |
| Handoff + documentation | 4–6 | $600–$900 |
| Total | 63–103 hrs | $9,450–$15,450 |
Monthly run cost: – LLM API: $80–$300/mo (more agents × more calls × longer context) – Hosting + queue/scheduler: $20–$50/mo – Observability (now genuinely required): $20–$50/mo – Vector DB (if memory is semantic): $0–$50/mo (Pinecone starter, Supabase pgvector free) – Eval / regression suite re-runs: $20–$50/mo in API cost – Total: $140–$500/mo
Real-world example: Funding OS, which sits here, costs ~$120/mo to run with a discovery sweep every 48 hours and full evaluation runs on demand. More on that below.
Tier 3 , Complex multi-agent system ($18K–$60K+ build, $400–$1,500+/mo to run)
What it is: 5+ agents, multiple orchestration patterns, often with human-in-the-loop checkpoints, multi-day asynchronous flows, more than one external system to write back to. Examples: a full GTM workflow that researches, plans, drafts, schedules, and books meetings; a multi-agent support system with escalation paths; an investor-fit pipeline that auto-drafts applications.
Build components:
| Component | Hours | Cost (at $150/hr blended) |
|---|---|---|
| Architecture + orchestration design | 12–20 | $1,800–$3,000 |
| Schema design across N agents | 8–14 | $1,200–$2,100 |
| Prompt engineering (5+ agents) | 25–45 | $3,750–$6,750 |
| Tool integrations (6+ APIs, often custom) | 30–60 | $4,500–$9,000 |
| Eval framework + per-agent calibration sets | 15–25 | $2,250–$3,750 |
| Memory + state management | 12–20 | $1,800–$3,000 |
| Observability + alerting | 10–15 | $1,500–$2,250 |
| Human-in-the-loop approval flows | 8–15 | $1,200–$2,250 |
| Retry, fallback, rate-limit handling | 10–15 | $1,500–$2,250 |
| Handoff, documentation, training | 6–10 | $900–$1,500 |
| Total | 136–239 hrs | $20,400–$35,850 |
Monthly run cost: – LLM API: $250–$900/mo – Hosting + orchestration (Inngest, Trigger.dev, Temporal): $50–$200/mo – Observability (Langfuse Pro, Helicone, or DataDog): $50–$200/mo – Vector DB + caching layer: $50–$150/mo – Eval suite + regression runs: $50–$100/mo – Total: $450–$1,550/mo
Real-world example: Run GTM sits in the lower half of this band. ~$280/mo, five agents, ~50 runs/week. More below.
The Funding OS numbers (real)
Funding OS is a Tier 2 system that evaluates accelerators, grants, and fellowships for fit. Five-stage pipeline: discovery → eligibility (script) → strategic fit (agent) → submission readiness (agent) → ROI synthesis (agent). 19 agents, 12 skills, 18 scripts, 10 templates, 7 schemas.
What it cost to build (founder-hours, not billed): – ~85 hours of build time over 3 weeks – At a $150/hr blended rate, replacement cost: ~$12,750 – Cash out of pocket during build: ~$140 (API spend during development + a Pinecone account I didn’t end up using)
What it costs per month to run:
| Line item | Monthly cost | Notes |
|---|---|---|
| Claude API (Sonnet 4.6 for most calls, Haiku for cheap classification) | ~$85 | ~3M input tokens, ~700K output tokens/mo |
| Hosting (Cloudflare Workers + cron) | $0 | Within free tier |
| Supabase (pgvector for memory) | $0 | Within free tier so far |
| Langfuse (observability) | $20 | Pro tier, worth it |
| Eval re-runs (calibration set against 15 historical programs, weekly) | $15 | All Claude API |
| Total | ~$120/mo |
Token economics, in detail: a full evaluation pass on one program runs ~6 Claude calls. About 4 of those are Sonnet-class (~8K input tokens, ~1.5K output tokens each). The other 2 are Haiku-class (eligibility-style classification, cheap). Cost per program evaluated: ~$0.06 to $0.09.
We evaluate ~30 programs per discovery sweep, sweeping every 48 hours. That’s ~450 evals/month, plus reruns when the founder pushes back on a score and we recalibrate. Total: roughly 2,800 LLM calls/month, $85 in API spend.
What drives the cost up if you’re not careful: – Letting Sonnet do work Haiku can do (the single biggest cost leak , eligibility classification on Sonnet is 5x more expensive than it needs to be) – Not caching the discovery sweep results (we cache for 36 hours, this alone saves ~40% on monthly API) – Forgetting to truncate context. Persistent memory is great until an agent pulls 30K tokens of irrelevant history into every call.
What we’d ship differently if rebuilding for a client at this tier: the same architecture, plus a $50/mo Inngest tier so we don’t have to maintain cron + queue logic ourselves. Net delivered cost to client: $170–$190/mo to run, ~$12K–$14K to build under our Sprint terms.
The Run GTM numbers (real)
Run GTM is a Tier 3 system. Five agents that plan our content week, draft posts in our voice, schedule them to Obsidian → Drive → Gmail → Calendar, and queue prospect DMs. It writes back to four external systems. Human approval is required at two checkpoints (final post approval, calendar booking).
Build cost (founder-hours): – ~140 hours of build time over 5 weeks – Replacement cost at $150/hr: ~$21,000 – Cash out of pocket during build: ~$380 (API spend + Trigger.dev cloud during development)
Monthly run cost:
| Line item | Monthly cost | Notes |
|---|---|---|
| Claude API (Sonnet 4.6 for drafting, Haiku for routing/classification, Opus for the weekly plan) | ~$180 | ~7M input, ~1.4M output tokens/mo |
| OpenAI API (one fallback path + voice transcription for outbound clips) | ~$25 | |
| Trigger.dev (orchestration) | $25 | Hobby tier |
| Langfuse Pro | $20 | |
| Vercel (front-end approval UI) | $0 | Within free tier |
| Resend (transactional email for approval flows) | $10 | |
| Vector DB (Supabase pgvector) | $0 | Free tier |
| Eval / regression runs (weekly) | $20 | All Claude API |
| Total | ~$280/mo |
Per workflow run: ~$0.85 to $1.40 in API cost depending on how much drafting it does. ~50 runs/week. The Opus weekly-plan call is the single most expensive line item (~$0.40/call) but it runs once a week, not 50x.
The cost optimization that mattered most: routing the right model to the right job. Early version of Run GTM used Sonnet for everything, including routing decisions (“does this message need the drafter agent or the scheduler agent?”). Switching routing to Haiku cut API spend by 38% with zero quality degradation. That single change is the difference between a sustainable $280/mo and a painful $450/mo.
The hidden costs nobody quotes you on
The build + API numbers above are the costs you can see. Here are the ones offshore quotes systematically omit, ranked by how often they bite.
1. Eval / QA infrastructure (~10–25% of total build)
You can’t ship an agent without a way to measure if it got worse last Tuesday. That means a calibration set (15–50 hand-labeled examples), a regression-run harness, and the discipline to actually run them. The shops quoting you $4,800 for a Tier 1 agent are skipping this entirely. Their agent will silently degrade by week 6 and they’ll bill you to “fix it.”
2. Prompt versioning + rollback (~$0 in tools, ~5–8% of hours)
Every prompt is a config file under version control with semver tags. Otherwise you change a sentence on Tuesday, performance drops on Wednesday, and you have no idea why. Most teams use a folder of .md files and a CI check; some use Langfuse or PromptLayer. The discipline costs maybe 3 hours of build time. Skipping it costs you 20 hours of “why is the agent acting weird” later.
3. Rate-limit handling and provider fallbacks (5–10% of hours)
Anthropic and OpenAI both have rate limits that you will hit at production scale, usually on a Monday morning when traffic spikes. A real agent build has retry-with-backoff, a fallback model (Sonnet primary, GPT-5 mini fallback), and a circuit breaker for when both providers are degraded. None of the cheap quotes include this. You find out the system was missing it during your first outage.
4. Observability , and the cost of not having it (5–10% of build, $20–$200/mo)
Langfuse, Helicone, or DataDog with LLM tracing. Without it, debugging a production failure means scrolling through CloudWatch logs trying to reconstruct what the agent decided. With it, you click on a failed run and see the full call tree. This is non-optional for Tier 2+ and the shops that skip it are signing you up for a 6-month tail of “why did the agent do that” tickets.
5. Memory bloat (silent cost, can 5–10x your API bill)
Agents with persistent memory will, by default, drag every prior interaction into every new call. A six-month-old conversation can balloon a $0.05 call into a $0.40 call. The fix is context windowing, summarization passes, and TTL on memory. None of this is hard. All of it is invisible until your API bill arrives.
6. Retries on transient failures (~3–5% of hours)
JSON parse failures, schema validation failures, 503s from upstream APIs. Without retry + structured-output enforcement, your agent fails ~2–4% of the time silently. With it, that drops to <0.3%. The difference between “this agent works” and “this agent works enough to leave running unattended.”
7. Founder onboarding / handoff (~5–8% of hours)
The agent ships, the founder doesn’t know how to operate it, the founder pings the agency every time something feels off. A real handoff is: a Loom of the system end-to-end, a one-page operator runbook, and 30 days of Slack support. The cheap quotes leave this out, which is why they end with a six-month retainer “to keep things running.”
Add it up: the components above are 25–40% of total build cost. They’re the difference between a quote that ships a working system and a quote that ships a tech demo.
Why the cheap quotes are mathematically impossible
Let’s do the math on the $4,800 / 2-week quote for the Gmail-classification agent.
To ship that agent at production quality you need, at minimum: – ~6 hours of prompt design + eval-set creation – ~10 hours of Gmail + Calendly + reply-drafting integration – ~5 hours of structured-output validation + retries – ~4 hours of deployment, monitoring, error handling – ~3 hours of handoff / docs
That’s 28 hours. At $4,800, the shop is billing $171/hour , which would be reasonable for a US shop, but the shops quoting this rate are offshore at $25–$50/hour. Which means they’re not actually spending 28 hours on it. Which means they’re skipping the eval set, the retries, the monitoring, and the handoff.
What you get for $4,800: a Cursor-generated script that hits the Gmail API, throws an OpenAI call at the body, and writes back a reply. It works on day one. It will be silently wrong on day forty. By the time you notice, the shop is unresponsive.
The economics of agent dev shops are simple: a real Tier 1 agent is ~25 founder hours of senior work. At a sustainable US blended rate that’s ~$4,500–$6,000. Anyone quoting meaningfully less is either operating at a loss or shipping something that isn’t production. Anyone quoting meaningfully more (in the $15K+ range for Tier 1) is padding for retainer capture.
Year-1 total cost of ownership (the number that actually matters)
The build cost is what you ask about. The TCO is what you actually pay. Here’s the realistic Year-1 math by tier, assuming you ship and run for 12 months:
| Tier | Build | API + infra (12 mo) | Maintenance buffer (~10% of build) | Year-1 TCO |
|---|---|---|---|---|
| Simple (Tier 1) | $4,500 | $720–$1,800 | $450 | $5,670–$6,750 |
| Mid (Tier 2) | $12,000 | $1,680–$6,000 | $1,200 | $14,880–$19,200 |
| Complex (Tier 3) | $25,000 | $5,400–$18,000 | $2,500 | $32,900–$45,500 |
Compare that to: – One enterprise-shop quote we beat last month: $120,000 build + $8,000/mo retainer = $216,000 Year-1 for what was, on inspection, a Tier 2 system. – The Indian-shop quote of $4,800 with no monitoring, eval, or retries: realistic Year-1 with rebuild = $4,800 + $9,000 to fix it in month 5 = $13,800 for a Tier 1.
The honest middle is the right answer. Most agent work fits cleanly in $4.5K–$25K with monthly run costs in the low hundreds. Anything dramatically higher is rent extraction. Anything dramatically lower is a tech demo with your logo on it.
Run the numbers on your own agent
We built an interactive calculator that takes your inputs (number of agents, daily query volume, model tier, whether you need eval / observability / human-in-the-loop) and outputs a real-numbers estimate based on the same coefficients we use to scope our own Agent Sprints.
→ Open the AI Agent Cost Calculator
It uses the same Anthropic and OpenAI per-token pricing we use internally (Claude Sonnet 4.6 at $3/M input, $15/M output, plus Haiku and Opus tiers; GPT-5 pricing parity), the same build-hour coefficients, and the same hidden-cost overhead. Output is a Year-1 TCO with a band, not a single number , because honest estimates have bands.
What we charge , and why we tell you the math first
We sell the Agent Sprint at three fixed prices: $4,500 (Sprint), $12,000 (Stack), $25,000+ (System). Those numbers map exactly to the three tiers above. We tell you the cost math first because:
- If you’re a fit for an Agent Sprint, you’ll know it from the math. We don’t have to sell you.
- If you’re not , if you’ve got the in-house chops to build it yourself , the calculator will tell you that too. The math doesn’t change based on who’s holding the keyboard.
- Most of our competitors won’t show you the math. That’s the actual moat.
We have two production agent systems running our own agency right now. The numbers above are real. If your numbers don’t roughly land in the same place, you’re either getting a better deal than us or you’re getting fleeced. Either way, you’ll know.
If you’re ready to scope an agent for your business: book an Agent Audit. 45 minutes, free, you keep the doc whether or not you hire us. We’ll map the 3 workflows in your business that are highest-ROI for agentification, and you’ll leave with the actual cost band , not a fog.