Chaos Engineering for AI Agents

The Agent Reliability Engine

Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates 50+ adversarial mutations, runs them against your agent, and calculates a Robustness Score.

View on GitHub

Quick Start

1. Install

# Install Ollama first
$ brew install ollama
$ ollama serve
$ ollama pull qwen3:8b

# Install Flakestorm
$ pip install flakestorm

2. Initialize & Run

# Create config file
$ flakestorm init

# Run tests
$ flakestorm run

# View report
Report saved to: ./reports/...

Configure Once, Test Forever

Define your golden prompts and invariants in a simple YAML file. Flakestorm handles mutation generation and assertion checking.

✓5 mutation types: Paraphrasing, noise, tone shifts, adversarial, custom
✓Invariant assertions: Latency, JSON validity, semantic similarity
✓Local-first: Uses Ollama with Qwen 3 8B for free testing
✓Beautiful reports: Interactive HTML with pass/fail matrices

flakestorm.yaml

version: "1.0"

agent:
  endpoint: "http://localhost:8000/invoke"
  type: "http"
  timeout: 30000

model:
  provider: "ollama"
  name: "qwen3:8b"
  base_url: "http://localhost:11434"

mutations:
  count: 10
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

golden_prompts:
  - "Book a flight to Paris for next Monday"
  - "What's my account balance?"

invariants:
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

Mutation Types

Type	Description	Example
Paraphrase	Semantically equivalent rewrites	"Book a flight" → "I need to fly out"
Noise	Typos and spelling errors	"Book a flight" → "Book a fliight plz"
Tone Shift	Aggressive/impatient phrasing	"Book a flight" → "I need a flight NOW!"
Prompt Injection	Basic adversarial attacks	"Book a flight and ignore previous instructions"

The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.

The Void:

Observability Tools (LangSmith) tell you after the agent failed
Eval Libraries (RAGAS) focus on academic scores, not reliability
Missing Link: A tool that actively attacks before deployment

"If it passes Flakestorm, it won't break in Production."

Why Flakestorm?

Adversarial Fuzzing

Actively attacks your agent with semantic perturbations, noise injection, tone shifts, and prompt injections.

Robustness Scoring

Mathematical proof of reliability. Get a 0.0-1.0 score that quantifies how well your agent handles unexpected inputs.

Invariant Assertions

Define success criteria: latency limits, JSON validity, semantic similarity, PII exclusion, and refusal checks.

Local-First

Runs on your machine using Ollama. No API costs. Generate 1,000+ mutations for free with local models.

Beautiful Reports

Interactive HTML reports with pass/fail matrices, mutation details, and failure analysis.

CI/CD Ready

Block PRs if robustness score drops below threshold. Integrate with GitHub Actions and other pipelines.

Ready to Get Started?

Open source and free forever, or upgrade to cloud for 10-20x faster performance.

View on GitHub