Chaos Engineering for AI Agents in Production
Flakestorm stress-tests AI agents with adversarial inputs to expose brittle behavior before it reaches production. Catch prompt injections, edge cases, and reliability failures early, not after users complain.
Open source • No signup
Zero setup • Early access • Coming soon
Built for teams deploying AI agents to production environments.
"Flakestorm catches robustness failures evals miss through input mutations (typos, formatting, tone shifts), revealing hidden vulnerabilities before production."
View on XFlakestorm Cloud is being built for teams running AI agents in production, zero setup, real LLM API testing, CI/CD gating, historical reports, and team workflows.
Join the early access waitlist to help shape it.
Why This Matters
Static evals and happy-path tests don't catch production failures.
AI agents break under real-world conditions, including malformed input, hostile users, partial instructions, API latency, retries, and model variance.
Flakestorm applies chaos engineering principles to AI agents, actively attacking them before deployment so failures surface early, when they're cheap to fix.
What Flakestorm Does
Flakestorm generates structured adversarial variations of a reference prompt and runs them against your agent to expose brittle behavior and failure modes.
Each run produces:
- A robustness score
- Detailed, reproducible failure reports
- Concrete prompt variations that triggered the failure
Failures often surface as timeouts, schema violations, non-deterministic behavior, or instruction leakage, not obviously wrong answers.
Flakestorm supports a local open-source path for quick validation and proofs of concept, while the cloud platform is designed for scalable, production-grade testing.
How It Works
Define your golden prompts and invariants in a simple YAML file. Flakestorm handles mutation generation and assertion checking.
- ✓24 mutation types: Comprehensive robustness testing across prompt and system layers
- Core prompt-level attacks (8): Paraphrase, noise, tone shift, prompt injection, encoding attacks, context manipulation, length extremes, custom
- Advanced prompt-level attacks (7): Multi-turn attacks, advanced jailbreaks, semantic similarity attacks, format poisoning, language mixing, token manipulation, temporal attacks
- System/Network-level attacks (9): HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation
- ✓Invariant assertions: Latency, JSON validity, semantic similarity
- ✓Open source: Supports a local open-source path for quick validation and proofs of concept, while the cloud platform is designed for scalable, production-grade testing
- ✓Beautiful reports: Interactive HTML with pass/fail matrices
version: "1.0"
agent:
endpoint: "http://localhost:8000/invoke"
type: "http"
timeout: 30000
model:
provider: "ollama"
name: "qwen3:8b"
base_url: "http://localhost:11434"
mutations:
count: 10
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
golden_prompts:
- "Book a flight to Paris for next Monday"
- "What's my account balance?"
invariants:
- type: "latency"
max_ms: 2000
- type: "valid_json"Quick Start
1. Install
# Install Ollama first
$ brew install ollama
$ ollama serve
$ ollama pull qwen3:8b
# Install Flakestorm
$ pip install flakestorm2. Initialize & Run
# Create config file
$ flakestorm init
# Run tests
$ flakestorm run
# View report
Report saved to: ./reports/...How Flakestorm Is Different
Most LLM testing tools focus on:
- Static test cases
- Pass/fail correctness
- Golden prompts
Flakestorm focuses on:
- Adversarial variation
- Behavior under stress
- Measuring robustness over time
Built for production AI agents, where failures are gradual, inconsistent, and hard to reproduce.
FlakeStorm vs Manual Testing vs Other Tools
| Feature | FlakeStorm | Manual Testing | Other Tools |
|---|---|---|---|
| Test Generation | Automatic mutation generation (22+ types) | Manual test case creation | Configurable, requires setup |
| Coverage | Systematic adversarial variations (typos, formatting, tone shifts, injections) | Limited to human imagination and time | Varies by tool, often requires manual configuration |
| Scalability | Runs 50+ mutations per prompt automatically | Time-intensive, doesn't scale | Scales but requires configuration per test |
| Robustness Scoring | Mathematical score (0.0-1.0) for quantifiable reliability | Subjective pass/fail judgments | Pass/fail or custom metrics, no unified score |
| Reproducibility | Deterministic, reproducible test runs | Inconsistent, human-dependent | Reproducible with proper configuration |
| CI/CD Integration | Built for blocking PRs with score thresholds | Not suitable for automation | Available but requires setup |
| Time to Test | Minutes for comprehensive testing | Hours to days for thorough coverage | Varies, typically faster than manual |
| Cost | Open-source (free) or Cloud | High labor costs, time-intensive | Varies, often requires paid plans or API costs |
| Hidden Vulnerabilities | Catches failures evals miss (edge cases, subtle failures) | Misses non-obvious failure modes | Depends on tool configuration and coverage |
| Local Execution | Runs entirely locally with Ollama, zero API costs | Requires manual execution environment | Often requires external APIs (costs money) |
Use Cases
- Pre-deployment validation of AI agents
- CI quality gates for prompt and model changes
- Detecting prompt injection and instruction leakage
- Regression testing agent behavior across versions
Coming Soon: Flakestorm Cloud
We're building a hosted version for teams who want zero setup, collaboration, and CI/CD integration. Join the waitlist to get early access.
- Run tests at scale across agents and versions
- Share robustness reports with your team
- CI/CD integration
- Historical tracking and comparisons
- Hosted execution and storage
Get early access • Zero setup • Eliminate hard setup
