Most agencies selling AI agents have never shipped one. Here’s the architecture of one we run in production , and the parts that broke before they worked.
The problem
A founder I know had a spreadsheet. It tracked accelerators, grants, and fellowships her startup might be a fit for. Forty-three rows. Each one needed: an eligibility check, a strategic-fit read, a guess at how much work the application would take, and a deadline. She was the only person qualified to do that work, and she was also the only person building the product.
So the spreadsheet did what spreadsheets do. It rotted. New programs opened, old ones closed, deadlines passed, and the rows that mattered most got buried under the rows that didn’t.
I have the same problem at Dangerous Media. So does every founder I talk to. The shape of the problem isn’t “we don’t have a list.” It’s: the work of triaging the list is judgment work, and judgment work doesn’t scale with a script.
That’s the gap Funding OS was built to fill.
Why a multi-agent system, and not just a script or one big prompt
Before writing a line of code I had to answer one question honestly: could this be a Zapier flow plus a single LLM call?
If yes, I should build that. Multi-agent systems are heavier, more expensive, and harder to debug than a well-placed function call. The default move is the cheap one.
I went through the candidates:
Option 1: A script that scrapes program pages and dumps them into a sheet. Solves the list problem. Doesn’t solve the triage problem. The reason the spreadsheet rotted wasn’t a missing list , it was missing judgment. Rejected.
Option 2: One Claude call per program, with a fat prompt that does everything. This was the seductive one. One prompt, structured output, done. I built it. It failed for a specific reason: when you ask a single LLM call to discover, validate, score, and write in one shot, it cuts corners on the parts that need rigor (the eligibility check, the deadline parse) to leave room for the parts it’s better at (the narrative). The score was confident and the underlying facts were sometimes wrong. That’s the worst possible failure mode for a tool a founder is supposed to act on.
Option 3: A pipeline of specialized agents, each with one job, each handing structured output to the next. Slower to build. Easier to trust. Each stage has a job small enough to verify. The expensive parts (synthesis, drafting) only run on inputs that have already passed cheaper checks.
I picked Option 3.
The rule that fell out of this: deterministic before generative. Scoring and validation are scripts. Synthesis and drafting are agents. If a step can be tested with assert, it should not be a prompt.
The architecture
The current shape, at the architecture level:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Discovery │───▶│ Eligibility │───▶│ Strategic │
│ (agent) │ │ (script) │ │ Fit (agent) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
pass/fail gate 35-pt scoring
│ │
└──────────┬──────────┘
▼
┌──────────────────────┐
│ Submission readiness│
│ (agent) │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ ROI synthesis + │
│ ranked output │
└──────────────────────┘
In English: an agent goes out and finds candidate programs (incubators, accelerators, grants, fellowships) the app could be a fit for. A script , not an agent , runs hard eligibility checks: stage, geography, sector, deadline still open. Programs that fail are dropped, with a one-line “why.” Survivors get handed to a strategic-fit agent that scores them on a fixed rubric (out of 35). A submission-readiness agent estimates the application effort against the materials the founder already has on file (out of 35). A final synthesis step combines everything into a 100-point score and a ranked list with reasoning attached.
The 100-point scale is: hard eligibility (pass/fail) + strategic fit (35) + submission readiness (35) + ROI to pursue (30). Every score has the agent’s reasoning attached, and every claim is labeled , verified_internal, verified_external, founder_supplied_unverified, agent_inference, or missing_support , so when the founder reads the output, she knows which numbers to trust and which to push back on.
That last detail , the claim labels , is the one I’d ship first if I were rebuilding this from scratch. It’s the difference between an AI tool that founders use once and one they actually act on.
How the agents talk to each other
Each agent’s output is a JSON document that matches a schema. The next agent in the pipeline gets that document as input. No agent reads another agent’s free-form output. No agent infers what the previous agent meant.
Three rules that keep this from collapsing:
Structured I/O everywhere. Schemas are checked at the boundaries. A malformed handoff fails loudly instead of silently corrupting the next stage. (We use JSON Schema for this.
pip install jsonschema, then validate every payload.)The cheap gate runs first. Eligibility is a script. It rejects ~60% of candidates before any expensive agent sees them. That alone cut the per-run cost by more than half and made the system fast enough to run nightly.
Persistent memory per project. Each agent in the pipeline carries the founder’s app context across runs , what was decided last week, what was rejected and why, which materials exist. This is the thing most “AI agent” demos skip. Without it, the agents re-litigate the same decisions every run and the founder loses trust by week two.
The hard parts
Here’s where I got it wrong, in chronological order.
I tried to make discovery an agent first. It was slow, expensive, and unreliable. The same agent would find the same program twice and miss two others. The fix was boring: a deterministic discovery sweep against a known set of sources, then an agent layer on top that interpreted what was found. The agent’s job got smaller. It got more reliable.
The token bill that nearly killed it on Day 3. The first end-to-end run on Certcy cost $47. The second run cost $112. The third, with no architectural change, cost $238. I assumed it was the agent count growing , it wasn’t. Every run, the synthesis agent was pulling more memory into context “for completeness,” and the strategic-fit agent it called was inheriting that bloated context, and the focus-group agent that it called was inheriting that. 1M context windows let you make this mistake silently. Three layers deep and a single Funding OS run was reading the equivalent of 200 pages of irrelevant prior-run history every time it answered a question. The fix had two parts: a hard per-run budget cap (pipeline halts at $50 by default, requires explicit override above that), and a structured summary-pass between agents , every handoff is a summary object, never a context dump. Run cost on Certcy is now $1.20 average, $4.80 worst-case. The principle: just because the context window is big doesn’t mean you should fill it. Big context is for one agent looking at one deep problem, not for memory bleed between agents that should be talking through narrow interfaces.
The scoring rubric drifted. First version of the strategic-fit agent gave 30+/35 to almost everything. The agent was being generous because the prompt was vague about what a bad fit looked like. The fix was a calibration set: 15 historical programs the founder had explicit opinions about. We score those every run and check the distribution. If the average creeps up, we re-anchor the rubric. This is the closest thing the system has to a unit test.
The focus group that told us what we wanted to hear. The focus-group agent runs 10 consumer-persona archetypes against application copy before submission. First version was great , fast, structured, founder loved the feedback. Every persona scored everything 4 out of 5, with thoughtful critique. The system “worked.” Then I noticed all 10 personas sounded suspiciously like me. Different demographics on paper, but they reasoned the same way, valued the same things, used the same vocabulary. Of course they did , I had written the persona archetypes. The focus group wasn’t pressure-testing the copy. It was reflecting the founder back to himself at 10x volume. The most dangerous kind of broken , confident output that confirms what you already thought. The fix was painful: throw out the founder-written personas, replace with archetypes seeded from anonymized real user research (interview transcripts, support tickets, churn surveys). The next focus-group run on the same application materials scored a 2.4 average and surfaced a specific tone problem the founder couldn’t see because he was the one writing it. That run was the most useful thing the system has ever produced.
None of these are clever. They’re the kind of things that only show up when you actually ship the system and watch it fail in real conditions.
What it does in production today
Current state, as of this writing:
- 19 agents, 12 skills, 18 scripts, 10 templates, 7 schemas
- Discovery, eligibility, scoring, submission readiness, and synthesis all live
- Auto-apply with safety guardrails for low-stakes programs (a human still approves anything material)
- Market intelligence layer (PESTEL, SWOT, TAM/SAM/SOM, competitor scan) feeds the strategic-fit agent
- A focus-group agent with 10 consumer-persona archetypes pressure-tests positioning before applications go out
- A learning loop that compares predicted score to outcome and re-calibrates the rubric over time
- Portfolio mode: the same pipeline runs across multiple apps and ranks them against each other, not just in isolation
The first real test was on Certcy, our cert-prep SaaS. The system found 7 programs, ranked 4 as apply_now, and surfaced PearX S26 as the highest priority because the deadline was closest. None of that was hand-curated. The founder’s job dropped from “spend Saturday triaging accelerators” to “spend 20 minutes reviewing a ranked list with reasoning attached.”
What this means if you have a similar workflow problem
Funding OS is a triage system. The shape is general: a backlog of opportunities that need judgment to rank, where the judgment is repeatable but not scriptable, and where the cost of missing the right one is high.
If you have any of these in your business, the same architecture applies:
- Lead qualification. Inbound is noisy. Every lead deserves a judgment call. Most teams pay an SDR to make it badly.
- Vendor or hire evaluation. Same shape: long list, structured rubric, judgment per row, ranked output.
- Content or product idea triage. What to build next. What to write next. What to kill.
- Customer support escalation routing. Most tickets are routine. The 5% that aren’t are the ones that hurt you if they’re miscategorized.
- Compliance review on outbound copy. Repeatable rubric, structured decisions, the cost of a miss is real.
The architectural moves are the same every time:
- Cheap deterministic gate first. Reject the obviously-out before spending tokens on it.
- One agent, one job. Structured handoffs between them.
- Claim labels on every output. Founders need to know which facts to trust.
- A calibration set. So the agents don’t drift.
- An escalation protocol. So action falls out the end of the pipeline, not just data.
You don’t have to build all five at once. We didn’t. The version of Funding OS that gave Certcy its first real ranked list had two agents and three scripts. Everything else got added when we watched it fail.
If this sounds like your problem
We built Funding OS for ourselves because the alternative , a founder doing triage work on a Saturday , wasn’t going to scale.
If you have a workflow in your business that looks like this , repeatable judgment work, real cost to getting it wrong, currently being done by you on a Saturday , we can build the agent version of it in 14 days, fixed price.
See how the Agent Sprint works →
No pitch on the call. We’ll map three workflows in your business that are ready for this treatment, ranked by ROI. You keep the doc whether or not we work together.