Why I'm building Flakestorm
Over the past year, I've watched AI agents go from experimental demos to production systems handling real user requests. I've seen teams deploy agents that work perfectly in demos, only to fail catastrophically when real users interact with them. The gap between "it works on my machine" and "it works for all users" has never been wider, and the stakes have never been higher.
This isn't just a theoretical problem. I've watched a startup lose their biggest client because their customer service agent started giving incorrect pricing information when users phrased questions slightly differently. I've seen a healthcare company's agent leak sensitive data because it didn't properly handle prompt injection attacks. I've witnessed teams spend months perfecting prompts, only to discover their agent breaks when users make typos, use aggressive language, or simply ask questions in unexpected ways.
The pattern is always the same: developers build agents that work beautifully in demos, pass all their unit tests, and then fail spectacularly in production. The problem isn't that developers are careless—it's that we're using the wrong tools for the job.
The "Happy Path" Fallacy
Current AI development tools are built around a fundamental assumption: that we can predict how users will interact with our agents. We write test cases for the "happy path" where users ask questions exactly as we expect. We optimize for the demo scenario where everything goes perfectly. We ship when our agent gives the right answer to our carefully crafted test prompts.
But LLMs are fundamentally non-deterministic. An agent that works perfectly on Monday with temperature=0.7 might fail on Tuesday with the same input. Users don't follow "Happy Paths"—they make typos, they're aggressive, they lie, they attempt prompt injections, and they interact in ways we never anticipated.
I've seen agents that handle "What's the weather?" perfectly but break when users ask "whats the weather" (missing apostrophe) or "WHATS THE WEATHER???" (all caps, multiple question marks). I've watched agents that work great in English but fail completely when users mix languages or use slang. I've witnessed agents that are polite and helpful until users get frustrated and use aggressive language—then they break down entirely.
The reality is that production is chaos. Users don't read documentation. They don't follow best practices. They don't care about your carefully crafted prompts. They just want their problem solved, and if your agent can't handle their particular way of asking, they'll move on to a competitor who can.
What's Missing in the Ecosystem
When I started looking for tools to solve this problem, I found a landscape full of observability tools and evaluation frameworks, but nothing that actively tests agents for robustness before deployment.
Observability tools like LangSmith are excellent for understanding what happened after your agent failed in production. They give you beautiful traces, detailed logs, and comprehensive analytics. But by the time you see those traces, the damage is already done. Your users have experienced the failure. Your reputation has taken a hit. Your business has lost trust.
Evaluation libraries like RAGAS focus on academic metrics—context precision, answer relevance, faithfulness. These are valuable for research, but they don't tell you if your agent will break when a user makes a typo or attempts a prompt injection. They test for quality, not for robustness.
Prompt testing tools like PromptFoo let you test prompts across different LLMs and compare outputs. But they require you to manually create test cases. They don't automatically generate adversarial inputs. They don't actively try to break your agent.
What's missing is a tool that takes a chaos engineering approach to AI agents—a tool that actively attacks your agent to find vulnerabilities before they reach production. A tool that gives you mathematical proof of reliability, not just "it worked in the demo."
The Inspiration: Chaos Engineering for AI
The idea for Flakestorm came from chaos engineering—the practice of intentionally injecting failures into distributed systems to test their resilience. Companies like Netflix use chaos engineering to ensure their systems can handle unexpected failures gracefully. They run "Chaos Monkeys" that randomly kill servers, inject network latency, and simulate failures to find weaknesses before they impact users.
Why don't we do the same for AI agents? Why don't we actively try to break our agents before users do? Why don't we inject adversarial inputs, semantic variations, noise, and prompt injections to find vulnerabilities before they reach production?
That's exactly what Flakestorm does. Instead of running one test case, Flakestorm takes a single "Golden Prompt"—a prompt that represents your agent's core functionality—and generates 50+ adversarial mutations:
- Semantic perturbations: Uses local LLMs to rewrite inputs semantically without changing user intent. "What's the weather?" becomes "Can you tell me about the current weather conditions?"
- Noise injection: Adds typos, extra spaces, random characters, and formatting issues that real users introduce.
- Tone shifts: Converts polite requests into aggressive demands, friendly questions into hostile interrogations.
- Prompt injections: Attempts OWASP Top 10 prompt injection attacks to see if your agent can be manipulated.
- Encoding obfuscation: Tests how your agent handles Unicode, emojis, and special characters.
- Context manipulation: Tries to confuse your agent with irrelevant information or contradictory instructions.
Each mutation is run against your agent, and the responses are checked against your defined invariants—latency limits, JSON validity, semantic similarity, PII detection, safety checks. The result is a Robustness Score: a mathematical measure (0.0-1.0) that quantifies how well your agent handles the unexpected.
Why This Matters: Real-World Impact
I've talked to dozens of teams building AI agents, and the stories are consistent. A fintech company's agent started giving incorrect loan information when users phrased questions with typos. A healthcare company's agent leaked patient data because it didn't properly handle prompt injections. An e-commerce company's agent broke down when users used aggressive language, leading to lost sales and customer complaints.
These aren't edge cases—they're the reality of production. Users are unpredictable. They don't follow your carefully crafted user flows. They don't read your documentation. They just interact with your agent in whatever way feels natural to them, and if your agent can't handle it, they'll go somewhere else.
But here's the thing: these failures are preventable. If teams had tested their agents with adversarial inputs before deployment, they would have found these vulnerabilities. They would have discovered that their agent breaks with typos, or that it's vulnerable to prompt injection, or that it can't handle aggressive language.
That's what Flakestorm enables. It's not about finding bugs—it's about finding vulnerabilities before they impact users. It's about proving that your agent is robust, not just functional. It's about shipping with confidence, knowing that your agent can handle the chaos of production.
The Technical Challenge: Building for Everyone
Building Flakestorm has been a technical challenge from day one. How do you generate meaningful adversarial mutations? How do you test agents that might be HTTP endpoints, Python functions, or LangChain chains? How do you make it fast enough to run in CI/CD pipelines? How do you make it accessible to developers who don't want to pay for API costs?
The answer is a local-first architecture. Flakestorm runs entirely on your machine using Ollama, so you can generate thousands of mutations without paying for API calls. It uses local LLMs for semantic perturbation, so you're not sending your prompts to external services. It's designed to work offline, so you can test your agents even when you don't have internet access.
But local execution has trade-offs. Running 50+ mutations sequentially can take 5-10 minutes. That's fine for development, but it's too slow for CI/CD pipelines. That's why we're building a cloud platform that runs mutations in parallel on cloud GPUs, giving you 10-20x faster performance. The same tests that take 10 minutes locally can run in 30 seconds in the cloud.
The key is choice. Developers can start with the open-source version, test locally for free, and upgrade to cloud when they need speed or advanced features. Teams can run tests in CI/CD without worrying about API costs. Enterprises can deploy on-premise for security and compliance.
The Vision: Making Robustness the Standard
My vision for Flakestorm is simple: I want robustness testing to be as standard for AI agents as unit testing is for traditional software. I want developers to run Flakestorm before every deployment, just like they run their test suite. I want CI/CD pipelines to block PRs when robustness scores drop, just like they block PRs when tests fail.
I want teams to have mathematical proof that their agents are production-ready. Not "it worked in the demo." Not "it passed our test cases." But "it has a robustness score of 0.87, meaning it handles 87% of adversarial inputs correctly."
I want Flakestorm to be the tool that prevents the failures I've seen teams experience. I want it to be the tool that gives developers confidence to ship. I want it to be the tool that makes AI agents more reliable, more trustworthy, and more production-ready.
But this isn't just about building a tool—it's about changing how we think about AI agent development. It's about recognizing that robustness is just as important as functionality. It's about understanding that "it works" isn't enough—we need to know that it works reliably, even when things go wrong.
The Journey Ahead
We're still early in this journey. The open-source version of Flakestorm is available today, and it's already helping developers test their agents locally. We're building the cloud platform to make it faster and more accessible. We're adding more mutation types, more invariant checks, and more integrations.
But the real work is ahead of us. We need to prove that robustness testing makes a difference. We need to show teams that investing in reliability pays off. We need to build a community of developers who care about making AI agents more robust.
If you're building AI agents and you've experienced the frustration of agents that work in demos but fail in production, I'd love to have you join us. Try the open-source version. Join our Discord community. Share your stories. Help us make Flakestorm better.
Together, we can build a future where AI agents are robust, reliable, and production-ready. We can make "it works" mean "it works reliably, even when things go wrong." We can change how the industry thinks about AI agent development.
The future of AI isn't just about making agents that work—it's about making agents that work reliably, even when things go wrong. That's what Flakestorm is for, and that's why I'm building it.