Chaos Engineering for AI Agents
The Agent Reliability Engine
Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates 50+ adversarial mutations, runs them against your agent, and calculates a Robustness Score.
Quick Start
1. Install
# Install Ollama first
$ brew install ollama
$ ollama serve
$ ollama pull qwen3:8b
# Install Flakestorm
$ pip install flakestorm2. Initialize & Run
# Create config file
$ flakestorm init
# Run tests
$ flakestorm run
# View report
Report saved to: ./reports/...Configure Once, Test Forever
Define your golden prompts and invariants in a simple YAML file. Flakestorm handles mutation generation and assertion checking.
- ✓5 mutation types: Paraphrasing, noise, tone shifts, adversarial, custom
- ✓Invariant assertions: Latency, JSON validity, semantic similarity
- ✓Local-first: Uses Ollama with Qwen 3 8B for free testing
- ✓Beautiful reports: Interactive HTML with pass/fail matrices
version: "1.0"
agent:
endpoint: "http://localhost:8000/invoke"
type: "http"
timeout: 30000
model:
provider: "ollama"
name: "qwen3:8b"
base_url: "http://localhost:11434"
mutations:
count: 10
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
golden_prompts:
- "Book a flight to Paris for next Monday"
- "What's my account balance?"
invariants:
- type: "latency"
max_ms: 2000
- type: "valid_json"Mutation Types
| Type | Description | Example |
|---|---|---|
| Paraphrase | Semantically equivalent rewrites | "Book a flight" → "I need to fly out" |
| Noise | Typos and spelling errors | "Book a flight" → "Book a fliight plz" |
| Tone Shift | Aggressive/impatient phrasing | "Book a flight" → "I need a flight NOW!" |
| Prompt Injection | Basic adversarial attacks | "Book a flight and ignore previous instructions" |
The Problem
The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.
The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.
The Void:
- Observability Tools (LangSmith) tell you after the agent failed
- Eval Libraries (RAGAS) focus on academic scores, not reliability
- Missing Link: A tool that actively attacks before deployment
"If it passes Flakestorm, it won't break in Production."
Why Flakestorm?
Adversarial Fuzzing
Actively attacks your agent with semantic perturbations, noise injection, tone shifts, and prompt injections.
Robustness Scoring
Mathematical proof of reliability. Get a 0.0-1.0 score that quantifies how well your agent handles unexpected inputs.
Invariant Assertions
Define success criteria: latency limits, JSON validity, semantic similarity, PII exclusion, and refusal checks.
Local-First
Runs on your machine using Ollama. No API costs. Generate 1,000+ mutations for free with local models.
Beautiful Reports
Interactive HTML reports with pass/fail matrices, mutation details, and failure analysis.
CI/CD Ready
Block PRs if robustness score drops below threshold. Integrate with GitHub Actions and other pipelines.
Ready to Get Started?
Open source and free forever, or upgrade to cloud for 10-20x faster performance.