Chaos Engineering for AI Agents in Production

Flakestorm stress-tests AI agents with adversarial inputs to expose brittle behavior before it reaches production. Catch prompt injections, edge cases, and reliability failures early, not after users complain.

Try Now

Open source • No signup

Zero setup • Early access • Coming soon

Built for teams deploying AI agents to production environments.

$ flakestorm runGenerating mutations...100%Running attacks...100%╭──────────────────────────────────────────╮│ Robustness Score: 87.5%│ ──────────────────────── ││ Passed: 17/20 mutations ││ Failed: 3 (2 latency, 1 injection) │╰──────────────────────────────────────────╯Report saved to: ./reports/flakestorm-2024-01-15-143022.html
Featured by
LangChain Official
Flakestorm: Mutation Testing for LangChain Agents

"Flakestorm catches robustness failures evals miss through input mutations (typos, formatting, tone shifts), revealing hidden vulnerabilities before production."

View on X
LangChain Community Content Spotlight: Flakestorm: Mutation Testing for LangChain Agents

Flakestorm Cloud is being built for teams running AI agents in production, zero setup, real LLM API testing, CI/CD gating, historical reports, and team workflows.
Join the early access waitlist to help shape it.

Open-source
Public GitHub
Runs locally
Your data stays
Deterministic
Reproducible tests
Production-first
Built for scale

Why This Matters

Static evals and happy-path tests don't catch production failures.

AI agents break under real-world conditions, including malformed input, hostile users, partial instructions, API latency, retries, and model variance.

Flakestorm applies chaos engineering principles to AI agents, actively attacking them before deployment so failures surface early, when they're cheap to fix.

What Flakestorm Does

Flakestorm generates structured adversarial variations of a reference prompt and runs them against your agent to expose brittle behavior and failure modes.

Each run produces:

  • A robustness score
  • Detailed, reproducible failure reports
  • Concrete prompt variations that triggered the failure

Failures often surface as timeouts, schema violations, non-deterministic behavior, or instruction leakage, not obviously wrong answers.

Flakestorm supports a local open-source path for quick validation and proofs of concept, while the cloud platform is designed for scalable, production-grade testing.

How It Works

Define your golden prompts and invariants in a simple YAML file. Flakestorm handles mutation generation and assertion checking.

  • 24 mutation types: Comprehensive robustness testing across prompt and system layers
    • Core prompt-level attacks (8): Paraphrase, noise, tone shift, prompt injection, encoding attacks, context manipulation, length extremes, custom
    • Advanced prompt-level attacks (7): Multi-turn attacks, advanced jailbreaks, semantic similarity attacks, format poisoning, language mixing, token manipulation, temporal attacks
    • System/Network-level attacks (9): HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation
  • Invariant assertions: Latency, JSON validity, semantic similarity
  • Open source: Supports a local open-source path for quick validation and proofs of concept, while the cloud platform is designed for scalable, production-grade testing
  • Beautiful reports: Interactive HTML with pass/fail matrices
flakestorm.yaml
version: "1.0"

agent:
  endpoint: "http://localhost:8000/invoke"
  type: "http"
  timeout: 30000

model:
  provider: "ollama"
  name: "qwen3:8b"
  base_url: "http://localhost:11434"

mutations:
  count: 10
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

golden_prompts:
  - "Book a flight to Paris for next Monday"
  - "What's my account balance?"

invariants:
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

Quick Start

1. Install

# Install Ollama first
$ brew install ollama
$ ollama serve
$ ollama pull qwen3:8b

# Install Flakestorm
$ pip install flakestorm

2. Initialize & Run

# Create config file
$ flakestorm init

# Run tests
$ flakestorm run

# View report
Report saved to: ./reports/...

How Flakestorm Is Different

Most LLM testing tools focus on:

  • Static test cases
  • Pass/fail correctness
  • Golden prompts

Flakestorm focuses on:

  • Adversarial variation
  • Behavior under stress
  • Measuring robustness over time

Built for production AI agents, where failures are gradual, inconsistent, and hard to reproduce.

FlakeStorm vs Manual Testing vs Other Tools

FeatureFlakeStormManual TestingOther Tools
Test GenerationAutomatic mutation generation (22+ types)Manual test case creationConfigurable, requires setup
CoverageSystematic adversarial variations (typos, formatting, tone shifts, injections)Limited to human imagination and timeVaries by tool, often requires manual configuration
ScalabilityRuns 50+ mutations per prompt automaticallyTime-intensive, doesn't scaleScales but requires configuration per test
Robustness ScoringMathematical score (0.0-1.0) for quantifiable reliabilitySubjective pass/fail judgmentsPass/fail or custom metrics, no unified score
ReproducibilityDeterministic, reproducible test runsInconsistent, human-dependentReproducible with proper configuration
CI/CD IntegrationBuilt for blocking PRs with score thresholdsNot suitable for automationAvailable but requires setup
Time to TestMinutes for comprehensive testingHours to days for thorough coverageVaries, typically faster than manual
CostOpen-source (free) or CloudHigh labor costs, time-intensiveVaries, often requires paid plans or API costs
Hidden VulnerabilitiesCatches failures evals miss (edge cases, subtle failures)Misses non-obvious failure modesDepends on tool configuration and coverage
Local ExecutionRuns entirely locally with Ollama, zero API costsRequires manual execution environmentOften requires external APIs (costs money)

Use Cases

  • Pre-deployment validation of AI agents
  • CI quality gates for prompt and model changes
  • Detecting prompt injection and instruction leakage
  • Regression testing agent behavior across versions

Coming Soon: Flakestorm Cloud

We're building a hosted version for teams who want zero setup, collaboration, and CI/CD integration. Join the waitlist to get early access.

  • Run tests at scale across agents and versions
  • Share robustness reports with your team
  • CI/CD integration
  • Historical tracking and comparisons
  • Hosted execution and storage

Get early access • Zero setup • Eliminate hard setup