How It Works

Everything you need to test and prove your AI agents are production-ready

Open source and cloud share the same testing capabilities. Both include full mutation categories, safety checks, and all four pillars. The difference is execution and tooling: open source runs locally with local LLMs or BYOK; cloud provides zero-setup, scaling, team collaboration, and CI/CD integration (gating, PR comments).

The Problem

Production AI agents are distributed systems: they depend on LLM APIs, tools, context windows, and multi-step orchestration. Each of these can fail. Today's tools don't answer the questions that matter:

What happens when the agent's tools fail? — A search API returns 503. A database times out. Does the agent degrade gracefully, hallucinate, or fabricate data?
Does the agent always follow its rules? — Must it always cite sources? Never return PII? Are those guarantees maintained when the environment is degraded?
Did we fix the production incident? — After a failure in prod, how do we prove the fix and prevent regression?

Observability tools tell you after something broke. Eval libraries focus on output quality, not resilience. No tool systematically breaks the agent's environment to test whether it survives. Flakestorm fills that gap.

The Solution: Four Pillars

Like Chaos Monkey for infrastructure, Flakestorm injects failures into tools, APIs, and LLMs — and tests how the agent handles hostile or malformed inputs — then verifies that the agent obeys its behavioral contract and recovers gracefully.

Pillar	What it does	Question answered
Environment Chaos	Inject faults into tools and LLMs (timeouts, errors, rate limits, malformed responses)	Does the agent handle bad environments?
Behavioral Contracts	Define invariants and verify them across a matrix of chaos scenarios	Does the agent obey its rules when the world breaks?
Replay Regression	Import real production failure sessions and replay them as deterministic tests	Did we fix this incident?
Adversarial Inputs	Generate adversarial prompt variations (24 types) and run them against the agent; combine with chaos to test both bad inputs and bad environments	Does the agent handle bad inputs and bad environments?

Who Flakestorm Is For

Teams shipping AI agents to production — Catch failures before users do
Engineers running agents behind APIs — Test against real-world abuse patterns
Teams already paying for LLM APIs — Reduce regressions and production incidents
CI/CD pipelines — Automated reliability gates before deployment

Flakestorm is built for production-grade agents handling real traffic. While it works for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.

Scores at a Glance

What you run	Score you get
flakestorm run	Robustness score (0–1): how well the agent handled adversarial inputs.
flakestorm run --chaos --chaos-only	Chaos resilience (0–1): how well the agent handled a broken environment.
flakestorm contract run	Resilience score (0–100%): contract × chaos matrix, severity-weighted.
flakestorm replay run …	Per-session pass/fail; aggregate replay regression score via flakestorm ci.
flakestorm ci	Overall (weighted) score: adversarial + chaos + contract + replay — one number for CI gates.

Why This Matters

Observability tools tell you after something broke. Eval libraries focus on output quality, not resilience.

AI agents are distributed systems — they break under real-world conditions: tool failures, malformed input, hostile users, API latency, rate limits, and model variance.

Flakestorm applies chaos engineering to AI agents: it deliberately breaks the agent's environment and verifies behavioral contracts so failures surface before production, when they're cheap to fix.

How Flakestorm Works

You define golden prompts, invariants (or a full contract), and optionally chaos, replay, and adversarial input settings. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a robustness or resilience score plus an HTML report.

Chaos only — Golden prompts → agent with fault-injected tools/LLM → invariants.
Contract — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix.
Replay — Recorded production input + tool responses → agent → contract.
Adversarial inputs — Golden prompts → adversarial variations (24 types) → agent (with or without chaos) → invariants.

Define Once, Run Any Mode

Define golden prompts, invariants (or a full contract), and optionally chaos, replay, and adversarial input settings in a simple YAML file. Flakestorm handles fault injection, adversarial generation, and assertion checking.

Adversarial inputs: 24 mutation types; use with chaos to test both bad inputs and bad environments
Environment chaos: timeouts, errors, rate limits, malformed tool/LLM responses
Behavioral contracts: named invariants × chaos matrix; severity-weighted resilience score
Invariants & assertions: latency, JSON validity, semantic similarity, safety (PII, refusal)
Reports: interactive HTML and JSON; contract matrix and replay reports

version: "1.0"
agent:
  endpoint: "http://localhost:8000/invoke"
  type: "http"
golden_prompts: ["Book a flight to Paris", "What's my balance?"]
invariants: [{ type: "latency", max_ms: 2000 }]

Try Flakestorm in ~60 Seconds

pip install flakestorm → flakestorm init → edit flakestorm.yaml (agent.endpoint) → flakestorm run. With v2 config you can also run --chaos, contract run, replay run, or flakestorm ci.

You get a robustness score (adversarial runs) or resilience score (chaos/contract/replay), plus a report. For local adversarial generation you'll need Ollama — see the docs.

How Flakestorm Is Different

Most tools test output quality. Flakestorm tests survivability: does the agent handle broken tools, rate limits, degraded environments, and adversarial inputs while still obeying its rules?

Environment chaos (fault injection into tools and LLMs)
Behavioral contracts verified across a chaos matrix
Replay of production failures as deterministic tests
Adversarial inputs (24 mutation types) for both bad inputs and bad environments

Built for production AI agents, where failures are gradual, inconsistent, and hard to reproduce — and where "did we fix the incident?" matters.

Flakestorm vs Manual Testing vs Other Tools

Feature	Flakestorm	Manual	Other Tools
Test generation	Chaos + contract + replay + adversarial (24 types)	Manual	Configurable
Scalability	Runs at scale	Time-intensive	Varies
Scoring	0.0–1.0 / weighted	Subjective	Varies
CI/CD	Block PRs on score	Not suitable	Setup required
Local	Ollama (zero API cost) or BYOK — use your own Gemini, Claude, or OpenAI keys	Manual	Often paid APIs

Use Cases

Catch failures before users do — pre-deployment chaos, contract, replay, and adversarial testing
CI/CD reliability gates — overall weighted score for blocking PRs
Verify behavioral contracts when tools and LLMs fail
Replay production incidents as deterministic tests
Detect prompt injection, context attacks, and instruction leakage under stress

Open Source vs Cloud

Open Source (Always Free)

Core chaos engine: environment chaos, contracts, replay, adversarial inputs (24 types)
Local execution, full transparency (no managed CI/CD; run via CLI in your own scripts)

Cloud (Early Access / Waitlist)

Zero-setup, scalable runs, shared dashboards, team collaboration
Scheduled & continuous chaos runs
Resilience Certificate export — Named, dated document with score, contract matrix, and methodology for compliance officers, CTO sign-off, and auditors (e.g. EU AI Act)

We do not cripple the OSS version. Cloud exists to remove operational pain, not to lock features.

Feature reference

Organized by the four pillars (Environment Chaos, Behavioral Contracts, Replay Regression, Adversarial Inputs) plus invariant assertions, execution, reporting, integrations, and developer experience.

Environment Chaos

Tool & LLM Fault Injection

Inject timeouts, errors, rate limits, and malformed responses into the tools and LLMs your agent depends on. Test how the agent behaves when the environment breaks.

Built-in Chaos Profiles

Use predefined fault profiles or define custom ones. Run chaos-only mode (flakestorm run --chaos --chaos-only) for a single chaos resilience score.

Context Attacks

Faults applied to tool responses and context (e.g. hidden instructions in valid-looking content). Complements prompt-level adversarial testing.

Behavioral Contracts

Named Invariants × Chaos Matrix

Define invariants (rules the agent must always follow) and verify them across a matrix of chaos scenarios. One resilience score per contract.

Severity-Weighted Resilience Score

Contract run produces a 0–100% resilience score weighted by severity. Use for CI gates and trend tracking.

Optional Reset for Stateful Agents

Support for stateful agents with configurable reset between scenarios so each chaos run starts from a clean state.

Replay Regression

Import Production Failures

Import real production failure sessions (manual or from LangSmith). Replay them as deterministic tests to verify fixes and prevent regression.

Deterministic Replay

Replay runs use recorded input and tool responses. Verify agent behavior against contracts. Per-session pass/fail; aggregate replay regression score via flakestorm ci.

LangSmith Integration

Pull sessions from LangSmith for replay. Prove that a production incident is fixed and will not recur.

Adversarial Inputs

24 Mutation Types

Comprehensive coverage across prompt and system/network layers. Available in both open source and cloud with no feature gating.

Core Prompt-Level (8)

Paraphrase, noise/typos, tone shift, prompt injection, encoding attacks, context manipulation, length extremes, custom.

Advanced Prompt-Level (7)

Multi-turn attacks, advanced jailbreaks, semantic similarity attacks, format poisoning, language mixing, token manipulation, temporal attacks.

System/Network-Level (9)

HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation.

Semantic Perturbation

Uses LLMs to rewrite inputs semantically without changing user intent. Generates meaningful variations that test agent robustness.

Invariant Assertions

Deterministic Checks

Contains patterns, regex matching, latency limits, JSON validity

Semantic Similarity

Vector-based similarity checking using local embeddings to ensure responses maintain semantic meaning

Safety Checks

PII detection and refusal checks using regex patterns. Available in both open source and cloud.

Advanced AI/ML Safety

Advanced AI/ML-based PII detection, contextual analysis, and safety scoring. Jailbreaking attacks are covered in mutation capabilities. Same features available in both open source and cloud.

Execution & Infrastructure

Open Source Execution

Run locally with Ollama (zero API cost) or BYOK — use your own Gemini, Claude, or OpenAI keys via env. All four pillars available locally.

Cloud Execution

Zero-setup execution with real LLM APIs. Fast, parallel test runs at scale. Same features as open source, plus Resilience Certificate export for compliance and audits.

Team Collaboration

Cloud plans include shared dashboards, team workflows, and collaboration features. Open source runs locally with full feature parity.

CI/CD Integration

Cloud only. Zero-setup CI/CD with gating, PR comments, and team-wide enforcement. OSS runs locally or via your own CLI/scripts.

Reporting

Interactive HTML Reports

Beautiful pass/fail matrices with mutation details and failure analysis

JSON Export

Export results as JSON for CI/CD integration and programmatic analysis

Terminal Output

Rich terminal UI with progress bars and real-time updates

Robustness Score

Mathematical score (0.0-1.0) that quantifies agent reliability

Test History

Historical test runs with trend analysis and commit-by-commit comparison. Available locally in open source. Cloud plans provide 6-12 months of centralized history with team access.

Resilience Certificate Export (Cloud)

Export a named, dated Resilience Certificate: resilience score, contract matrix, methodology statement, and signature field. For compliance officers, CTO sign-off, and auditors. Addresses EU AI Act evidence requirements for high-risk AI systems; attach to approval tickets or file in compliance folders.

Integrations

HTTP Agents

Test any HTTP-based agent endpoint

Python Callables

Directly test Python functions and callables

LangChain

Native LangChain chain integration

CI/CD Integration

Cloud only. GitHub Actions, GitLab CI, Jenkins, CircleCI. Block merges on trust score drops, PR comments with results, team-wide enforcement. OSS runs locally or in your own CI scripts via CLI.

Notifications

Slack, email, webhook support for test completion alerts. Available in cloud plans for team coordination.

Developer Experience

Simple CLI

Install with pip, configure with YAML, run with one command

YAML Configuration

Human-readable configuration format. Same config works for local and cloud

Rich Terminal UI

Beautiful progress bars and real-time feedback using Rich library

Type Safety

Full type hints and Pydantic validation for configuration

Rust Performance

Performance-critical operations (scoring, similarity) use Rust bindings

Ready to get started?