How It Works
Everything you need to test and prove your AI agents are production-ready
Open source and cloud share the same testing capabilities. Both include full mutation categories, safety checks, and all four pillars. The difference is execution and tooling: open source runs locally with local LLMs or BYOK; cloud provides zero-setup, scaling, team collaboration, and CI/CD integration (gating, PR comments).
The Problem
Production AI agents are distributed systems: they depend on LLM APIs, tools, context windows, and multi-step orchestration. Each of these can fail. Today's tools don't answer the questions that matter:
- What happens when the agent's tools fail? — A search API returns 503. A database times out. Does the agent degrade gracefully, hallucinate, or fabricate data?
- Does the agent always follow its rules? — Must it always cite sources? Never return PII? Are those guarantees maintained when the environment is degraded?
- Did we fix the production incident? — After a failure in prod, how do we prove the fix and prevent regression?
Observability tools tell you after something broke. Eval libraries focus on output quality, not resilience. No tool systematically breaks the agent's environment to test whether it survives. Flakestorm fills that gap.
The Solution: Four Pillars
Like Chaos Monkey for infrastructure, Flakestorm injects failures into tools, APIs, and LLMs — and tests how the agent handles hostile or malformed inputs — then verifies that the agent obeys its behavioral contract and recovers gracefully.
| Pillar | What it does | Question answered |
|---|---|---|
| Environment Chaos | Inject faults into tools and LLMs (timeouts, errors, rate limits, malformed responses) | Does the agent handle bad environments? |
| Behavioral Contracts | Define invariants and verify them across a matrix of chaos scenarios | Does the agent obey its rules when the world breaks? |
| Replay Regression | Import real production failure sessions and replay them as deterministic tests | Did we fix this incident? |
| Adversarial Inputs | Generate adversarial prompt variations (24 types) and run them against the agent; combine with chaos to test both bad inputs and bad environments | Does the agent handle bad inputs and bad environments? |
Who Flakestorm Is For
- Teams shipping AI agents to production — Catch failures before users do
- Engineers running agents behind APIs — Test against real-world abuse patterns
- Teams already paying for LLM APIs — Reduce regressions and production incidents
- CI/CD pipelines — Automated reliability gates before deployment
Flakestorm is built for production-grade agents handling real traffic. While it works for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.
Scores at a Glance
| What you run | Score you get |
|---|---|
| flakestorm run | Robustness score (0–1): how well the agent handled adversarial inputs. |
| flakestorm run --chaos --chaos-only | Chaos resilience (0–1): how well the agent handled a broken environment. |
| flakestorm contract run | Resilience score (0–100%): contract × chaos matrix, severity-weighted. |
| flakestorm replay run … | Per-session pass/fail; aggregate replay regression score via flakestorm ci. |
| flakestorm ci | Overall (weighted) score: adversarial + chaos + contract + replay — one number for CI gates. |
Why This Matters
Observability tools tell you after something broke. Eval libraries focus on output quality, not resilience.
AI agents are distributed systems — they break under real-world conditions: tool failures, malformed input, hostile users, API latency, rate limits, and model variance.
Flakestorm applies chaos engineering to AI agents: it deliberately breaks the agent's environment and verifies behavioral contracts so failures surface before production, when they're cheap to fix.
How Flakestorm Works
You define golden prompts, invariants (or a full contract), and optionally chaos, replay, and adversarial input settings. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a robustness or resilience score plus an HTML report.
- Chaos only — Golden prompts → agent with fault-injected tools/LLM → invariants.
- Contract — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix.
- Replay — Recorded production input + tool responses → agent → contract.
- Adversarial inputs — Golden prompts → adversarial variations (24 types) → agent (with or without chaos) → invariants.
Define Once, Run Any Mode
Define golden prompts, invariants (or a full contract), and optionally chaos, replay, and adversarial input settings in a simple YAML file. Flakestorm handles fault injection, adversarial generation, and assertion checking.
- Adversarial inputs: 24 mutation types; use with chaos to test both bad inputs and bad environments
- Environment chaos: timeouts, errors, rate limits, malformed tool/LLM responses
- Behavioral contracts: named invariants × chaos matrix; severity-weighted resilience score
- Invariants & assertions: latency, JSON validity, semantic similarity, safety (PII, refusal)
- Reports: interactive HTML and JSON; contract matrix and replay reports
version: "1.0"
agent:
endpoint: "http://localhost:8000/invoke"
type: "http"
golden_prompts: ["Book a flight to Paris", "What's my balance?"]
invariants: [{ type: "latency", max_ms: 2000 }]Try Flakestorm in ~60 Seconds
pip install flakestorm → flakestorm init → edit flakestorm.yaml (agent.endpoint) → flakestorm run. With v2 config you can also run --chaos, contract run, replay run, or flakestorm ci.
You get a robustness score (adversarial runs) or resilience score (chaos/contract/replay), plus a report. For local adversarial generation you'll need Ollama — see the docs.
How Flakestorm Is Different
Most tools test output quality. Flakestorm tests survivability: does the agent handle broken tools, rate limits, degraded environments, and adversarial inputs while still obeying its rules?
- Environment chaos (fault injection into tools and LLMs)
- Behavioral contracts verified across a chaos matrix
- Replay of production failures as deterministic tests
- Adversarial inputs (24 mutation types) for both bad inputs and bad environments
Built for production AI agents, where failures are gradual, inconsistent, and hard to reproduce — and where "did we fix the incident?" matters.
Flakestorm vs Manual Testing vs Other Tools
| Feature | Flakestorm | Manual | Other Tools |
|---|---|---|---|
| Test generation | Chaos + contract + replay + adversarial (24 types) | Manual | Configurable |
| Scalability | Runs at scale | Time-intensive | Varies |
| Scoring | 0.0–1.0 / weighted | Subjective | Varies |
| CI/CD | Block PRs on score | Not suitable | Setup required |
| Local | Ollama (zero API cost) or BYOK — use your own Gemini, Claude, or OpenAI keys | Manual | Often paid APIs |
Use Cases
- Catch failures before users do — pre-deployment chaos, contract, replay, and adversarial testing
- CI/CD reliability gates — overall weighted score for blocking PRs
- Verify behavioral contracts when tools and LLMs fail
- Replay production incidents as deterministic tests
- Detect prompt injection, context attacks, and instruction leakage under stress
Open Source vs Cloud
Open Source (Always Free)
- Core chaos engine: environment chaos, contracts, replay, adversarial inputs (24 types)
- Local execution, full transparency (no managed CI/CD; run via CLI in your own scripts)
Cloud (Early Access / Waitlist)
- Zero-setup, scalable runs, shared dashboards, team collaboration
- Scheduled & continuous chaos runs
- Resilience Certificate export — Named, dated document with score, contract matrix, and methodology for compliance officers, CTO sign-off, and auditors (e.g. EU AI Act)
We do not cripple the OSS version. Cloud exists to remove operational pain, not to lock features.
Feature reference
Organized by the four pillars (Environment Chaos, Behavioral Contracts, Replay Regression, Adversarial Inputs) plus invariant assertions, execution, reporting, integrations, and developer experience.
Environment Chaos
Tool & LLM Fault Injection
Inject timeouts, errors, rate limits, and malformed responses into the tools and LLMs your agent depends on. Test how the agent behaves when the environment breaks.
Built-in Chaos Profiles
Use predefined fault profiles or define custom ones. Run chaos-only mode (flakestorm run --chaos --chaos-only) for a single chaos resilience score.
Context Attacks
Faults applied to tool responses and context (e.g. hidden instructions in valid-looking content). Complements prompt-level adversarial testing.
Behavioral Contracts
Named Invariants × Chaos Matrix
Define invariants (rules the agent must always follow) and verify them across a matrix of chaos scenarios. One resilience score per contract.
Severity-Weighted Resilience Score
Contract run produces a 0–100% resilience score weighted by severity. Use for CI gates and trend tracking.
Optional Reset for Stateful Agents
Support for stateful agents with configurable reset between scenarios so each chaos run starts from a clean state.
Replay Regression
Import Production Failures
Import real production failure sessions (manual or from LangSmith). Replay them as deterministic tests to verify fixes and prevent regression.
Deterministic Replay
Replay runs use recorded input and tool responses. Verify agent behavior against contracts. Per-session pass/fail; aggregate replay regression score via flakestorm ci.
LangSmith Integration
Pull sessions from LangSmith for replay. Prove that a production incident is fixed and will not recur.
Adversarial Inputs
24 Mutation Types
Comprehensive coverage across prompt and system/network layers. Available in both open source and cloud with no feature gating.
Core Prompt-Level (8)
Paraphrase, noise/typos, tone shift, prompt injection, encoding attacks, context manipulation, length extremes, custom.
Advanced Prompt-Level (7)
Multi-turn attacks, advanced jailbreaks, semantic similarity attacks, format poisoning, language mixing, token manipulation, temporal attacks.
System/Network-Level (9)
HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation.
Semantic Perturbation
Uses LLMs to rewrite inputs semantically without changing user intent. Generates meaningful variations that test agent robustness.
Invariant Assertions
Deterministic Checks
Contains patterns, regex matching, latency limits, JSON validity
Semantic Similarity
Vector-based similarity checking using local embeddings to ensure responses maintain semantic meaning
Safety Checks
PII detection and refusal checks using regex patterns. Available in both open source and cloud.
Advanced AI/ML Safety
Advanced AI/ML-based PII detection, contextual analysis, and safety scoring. Jailbreaking attacks are covered in mutation capabilities. Same features available in both open source and cloud.
Execution & Infrastructure
Open Source Execution
Run locally with Ollama (zero API cost) or BYOK — use your own Gemini, Claude, or OpenAI keys via env. All four pillars available locally.
Cloud Execution
Zero-setup execution with real LLM APIs. Fast, parallel test runs at scale. Same features as open source, plus Resilience Certificate export for compliance and audits.
Team Collaboration
Cloud plans include shared dashboards, team workflows, and collaboration features. Open source runs locally with full feature parity.
CI/CD Integration
Cloud only. Zero-setup CI/CD with gating, PR comments, and team-wide enforcement. OSS runs locally or via your own CLI/scripts.
Reporting
Interactive HTML Reports
Beautiful pass/fail matrices with mutation details and failure analysis
JSON Export
Export results as JSON for CI/CD integration and programmatic analysis
Terminal Output
Rich terminal UI with progress bars and real-time updates
Robustness Score
Mathematical score (0.0-1.0) that quantifies agent reliability
Test History
Historical test runs with trend analysis and commit-by-commit comparison. Available locally in open source. Cloud plans provide 6-12 months of centralized history with team access.
Resilience Certificate Export (Cloud)
Export a named, dated Resilience Certificate: resilience score, contract matrix, methodology statement, and signature field. For compliance officers, CTO sign-off, and auditors. Addresses EU AI Act evidence requirements for high-risk AI systems; attach to approval tickets or file in compliance folders.
Integrations
HTTP Agents
Test any HTTP-based agent endpoint
Python Callables
Directly test Python functions and callables
LangChain
Native LangChain chain integration
CI/CD Integration
Cloud only. GitHub Actions, GitLab CI, Jenkins, CircleCI. Block merges on trust score drops, PR comments with results, team-wide enforcement. OSS runs locally or in your own CI scripts via CLI.
Notifications
Slack, email, webhook support for test completion alerts. Available in cloud plans for team coordination.
Developer Experience
Simple CLI
Install with pip, configure with YAML, run with one command
YAML Configuration
Human-readable configuration format. Same config works for local and cloud
Rich Terminal UI
Beautiful progress bars and real-time feedback using Rich library
Type Safety
Full type hints and Pydantic validation for configuration
Rust Performance
Performance-critical operations (scoring, similarity) use Rust bindings
Ready to get started?