← All Projects
Side Project

PromptOps — Prompt Testing for Claude Code

Claude CodePluginAI ToolingOpen Source
VIEW PROJECT →

Problem

Engineers treat prompts like throwaway text — no version control, no test suite, no regression detection. A prompt that worked last week silently degrades after a model update, and there's no tooling to catch it.

Ownership

Sole creator. Designed the command architecture, scoring rubric, cost benchmarking model, and plugin distribution strategy. Open-sourced on GitHub.

Architecture

Five user-invoked commands (/evaluate, /improve, /regression, /compare, /run) plus two auto-invoked skills (prompt-quality-check, golden-dataset-builder). All output persists to .promptops/ locally — evaluations, improved versions, regression reports, and cost comparisons. The $ARGUMENTS pattern with behavioral overrides prevents Claude from executing prompts instead of analyzing them.

Highlights

  • 1–5 scoring across five dimensions: structure, specificity, output control, error prevention, testability
  • /compare benchmarks across Claude Opus, Sonnet, Haiku, and GPT-4o with per-run and monthly cost projections
  • /regression detects improvements, risks, and regressions between prompt versions
  • Auto-invoked skill nudges prompt quality while editing — silent when solid
  • Golden dataset builder captures approved outputs as test cases over time

Deployment

Distributed as a Claude Code plugin — installable via terminal or Mac app upload. All state stored locally in .promptops/ per project. No auth, no backend, no infrastructure for v1.

Impact

Closes the gap between writing prompts and testing them. Engineers get a repeatable evaluate → improve → verify → run loop without leaving their terminal.

Let's Connect

Have a project in mind?
Let's talk.

Always interested in thoughtful engineering challenges, platform problems, and building things that matter.