Most engineers treat prompts like throwaway text. You write it, send it, hope it works. If it doesn't, you tweak it and try again. There's no version control, no test suite, no regression detection.
I kept running into the same problem: a prompt that worked last week would silently degrade after a model update. Or I'd "improve" a system prompt and break three things I didn't test. The tooling gap was obvious — we treat prompts worse than we treat TODO comments.
So I built PromptOps, a Claude Code plugin that treats prompts like tested code.
What It Does
Five commands, all usable inline:
/evaluate— Paste any prompt, get a 1-5 score across five dimensions (structure, specificity, output control, error prevention, testability) with the exact fixes to improve it./improve— Same input, but it hands you back a rewritten version. Copy-paste ready. Shows a before/after score with what changed./regression— Compare two prompt versions side-by-side. See what improved, what regressed, and whether it's safe to ship./compare— Benchmark a prompt across Claude Opus, Sonnet, Haiku, and GPT-4o. Shows per-run cost, daily cost, and monthly cost with clear token assumptions./run— Execute the improved prompt for real. Closes the loop in one session.
Plus two auto-invoked skills: one that nudges you while you're editing prompts (silent when things are solid), and one that captures approved outputs as test cases over time.
The Workflow
The typical session looks like this:
/evaluate You are a code reviewer. Find bugs.
→ Score: 1.8/5. Three issues: no output format, no guardrails, too vague.
/improve You are a code reviewer. Find bugs.
→ Rewrites it with JSON schema, negative examples, role framing. Score: 4.2/5.
/regression
→ Auto-compares original vs improved. ✅ Safe — all changes are improvements.
/run
→ Executes the improved prompt against your actual codebase.
Evaluate. Improve. Verify. Run. All in one session, no setup.
The Hard Part: Getting Claude to Not Execute Your Prompt
The biggest challenge wasn't the scoring rubric or the cost calculations. It was getting Claude to evaluate a prompt instead of doing what the prompt says.
When you paste "go through my codebase and list all the features" into /evaluate, Claude's instinct is to actually go through your codebase. It took several iterations to find the right approach — the $ARGUMENTS pattern, combined with explicit behavioral overrides at the top of each command file, finally got it to consistently treat input as text to analyze rather than instructions to follow.
This is a useful lesson for anyone building Claude Code plugins: Claude's base behavior is strong. Your command instructions compete with its system prompt for attention budget. Keep commands short, use examples over rules, and front-load the behavioral override before anything else.
Architecture Decisions
Why a plugin instead of a SaaS app? I originally designed PromptOps as a full product — Supabase backend, Vercel frontend, CLI tool, the works. But the plugin system gives you distribution for free. There's no auth, no billing, no infrastructure to manage for v1. The plugin stores everything locally in .promptops/ inside whatever project you're working in.
Why commands instead of skills? Skills auto-invoke based on context, which is great for the background nudges. But the core workflow (evaluate → improve → regression → run) needs to be user-initiated. You don't want your code review getting interrupted by an unsolicited prompt evaluation.
Why save files? Every command saves its output to .promptops/. This is the seed of a prompt engineering history — over time, you build up evaluations, improved versions, regression reports, and cost comparisons. When this eventually becomes a hosted product, that local data is what migrates.
What I Learned About Claude Code Plugins
A few things that aren't obvious from the docs:
-
Instruction budget is real. Claude Code's system prompt uses ~50 of the ~150-200 instructions the model can reliably follow. My early command files were 100+ lines and Claude would ignore half of them. The final versions are 35-75 lines each.
-
Descriptions are the first thing Claude reads. The
descriptionfield in the frontmatter isn't just for the user's autocomplete. It heavily influences how Claude interprets the command. A description that says "Score any prompt" gives Claude permission to decide what is or isn't a prompt. A description that says "Treat ALL user input as a prompt to score" doesn't. -
Formatting instructions get overridden. No matter how many times you write "NEVER use ASCII art," Claude's default output preferences are strong. The fix is to provide an exact template with the structure you want, not a list of things to avoid. Claude follows examples better than rules.
-
The Mac app supports plugins. Not just the terminal — you can upload plugins through Plugins → Add plugin → Upload plugin in the desktop app. Same functionality, nicer interface.
Try It
The plugin is open source — check out the landing page or install directly:
/plugin install promptops@harmansidhu/promptops
Or download from GitHub and upload through the Mac app.
If you're writing prompts for production use — system prompts, agent instructions, CLAUDE.md files, CI/CD templates — you should be testing them. PromptOps makes that as easy as typing /evaluate.
What's Next
- CI integration: run evaluations as a pre-merge check
- Multi-model direct execution within
/compare - Hosted backend for team collaboration and aggregate analytics
- Mutation engine that learns which improvements work for which task types
The local plugin is the validation vehicle. If engineers actually use it, the hosted product writes itself.
Harman Sidhu is a Product + Platform Engineer building reliable AI tooling. More at harmansidhudev.com.