Perfect for Coding Agents needing comprehensive content analysis and head-to-head comparison capabilities on custom tasks. agent-eval is a CLI tool for head-to-head comparison of coding agents on reproducible tasks, providing systematized evaluation and data-backed insights.

How do I install agent-eval?

Run the command: npx killer-skills add affaan-m/everything-claude-code/agent-eval. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-eval?

Key use cases include: Comparing coding agents on custom tasks with reproducible results, Measuring agent performance before adopting a new tool or model, Running regression checks when an agent updates its model or tooling, Producing data-backed agent selection decisions for a team.

Which IDEs are compatible with agent-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-eval?

Requires Git repository access. Needs specific commit for reproducibility. Limited to coding agents with API spend tracking.

agent-eval

Install agent-eval, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md

Readonly

Agent Eval Skill

Name: agent-eval
Availability: InStock
Author: affaan-m

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

yaml
1name: add-retry-logic
2description: Add exponential backoff retry to the HTTP client
3repo: ./my-project
4files:
5  - src/http_client.py
6prompt: |
7  Add retry logic with exponential backoff to all HTTP requests.
8  Max 3 retries. Initial delay 1s, max delay 30s.
9judge:
10  - type: pytest
11    command: pytest tests/test_http_client.py -v
12  - type: grep
13    pattern: "exponential_backoff|retry"
14    files: src/http_client.py
15commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Metric	What It Measures
Pass rate	Did the agent produce code that passes the judge?
Cost	API spend per task (when available)
Time	Wall-clock seconds to completion
Consistency	Pass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

bash
1mkdir tasks
2# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

bash
1agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

bash
1agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

yaml
1judge:
2  - type: pytest
3    command: pytest tests/ -v
4  - type: command
5    command: npm run build

Pattern-Based

yaml
1judge:
2  - type: grep
3    pattern: "class.*Retry"
4    files: src/**/*.py

Model-Based (LLM-as-judge)

yaml
1judge:
2  - type: llm
3    prompt: |
4      Does this implementation correctly handle exponential backoff?
5      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

agent-eval — for Claude Code agent-eval, everything-claude-code, official, for Claude Code, ide skills, coding agent evaluation, reproducible tasks, git worktree isolation, pass rate metrics, cost analysis, Claude Code

About this Skill

Features

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for agent-eval

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

agent-eval

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

FAQ & Installation Steps

? Frequently Asked Questions

What is agent-eval?

How do I install agent-eval?

What are the use cases for agent-eval?

Which IDEs are compatible with agent-eval?

Are there any limitations for agent-eval?

↓ How To Install

Related Skills

Looking for an alternative to agent-eval or another official skill for your workflow? Explore these related open-source skills.

flags

extract-errors

fix

flow

Related Collections