agent-eval — for Claude Code agent-eval, everything-claude-code, official, for Claude Code, ide skills, coding agent evaluation, reproducible tasks, git worktree isolation, pass rate metrics, cost analysis, Claude Code

Verified
v1.0.0
GitHub

About this Skill

Perfect for Coding Agents needing comprehensive content analysis and head-to-head comparison capabilities on custom tasks. agent-eval is a CLI tool for head-to-head comparison of coding agents on reproducible tasks, providing systematized evaluation and data-backed insights.

Features

Define tasks using YAML definitions
Run agents in isolated git worktrees for reproducibility
Collect metrics like pass rate, cost, time, and consistency
Execute agents against custom tasks with ease
Produce data-backed agent selection decisions for teams

# Core Topics

affaan-m affaan-m
[105.8k]
[13742]
Updated: 3/25/2026

Agent Capability Analysis

The agent-eval skill by affaan-m is an open-source official AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for for Claude Code, coding agent evaluation, reproducible tasks.

Ideal Agent Persona

Perfect for Coding Agents needing comprehensive content analysis and head-to-head comparison capabilities on custom tasks.

Core Value

Empowers agents to systematize comparisons of coding agents like Claude Code, Aider, and Codex using YAML task definitions, Git worktree isolation, and metrics like pass rate, cost, time, and consistency, all while utilizing protocols like pytest and grep for deterministic judging.

Capabilities Granted for agent-eval

Comparing coding agents on custom tasks with reproducible results
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

! Prerequisites & Limits

  • Requires Git repository access
  • Needs specific commit for reproducibility
  • Limited to coding agents with API spend tracking
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

agent-eval

Install agent-eval, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

  • Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
  • Measuring agent performance before adopting a new tool or model
  • Running regression checks when an agent updates its model or tooling
  • Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

yaml
1name: add-retry-logic 2description: Add exponential backoff retry to the HTTP client 3repo: ./my-project 4files: 5 - src/http_client.py 6prompt: | 7 Add retry logic with exponential backoff to all HTTP requests. 8 Max 3 retries. Initial delay 1s, max delay 30s. 9judge: 10 - type: pytest 11 command: pytest tests/test_http_client.py -v 12 - type: grep 13 pattern: "exponential_backoff|retry" 14 files: src/http_client.py 15commit: "abc1234" # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

MetricWhat It Measures
Pass rateDid the agent produce code that passes the judge?
CostAPI spend per task (when available)
TimeWall-clock seconds to completion
ConsistencyPass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

bash
1mkdir tasks 2# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

bash
1agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

  1. Creates a fresh git worktree from the specified commit
  2. Hands the prompt to the agent
  3. Runs the judge criteria
  4. Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

bash
1agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

yaml
1judge: 2 - type: pytest 3 command: pytest tests/ -v 4 - type: command 5 command: npm run build

Pattern-Based

yaml
1judge: 2 - type: grep 3 pattern: "class.*Retry" 4 files: src/**/*.py

Model-Based (LLM-as-judge)

yaml
1judge: 2 - type: llm 3 prompt: | 4 Does this implementation correctly handle exponential backoff? 5 Check for: max retries, increasing delays, jitter.

Best Practices

  • Start with 3-5 tasks that represent your real workload, not toy examples
  • Run at least 3 trials per agent to capture variance — agents are non-deterministic
  • Pin the commit in your task YAML so results are reproducible across days/weeks
  • Include at least one deterministic judge (tests, build) per task — LLM judges add noise
  • Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
  • Version your task definitions — they are test fixtures, treat them as code

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is agent-eval?

Perfect for Coding Agents needing comprehensive content analysis and head-to-head comparison capabilities on custom tasks. agent-eval is a CLI tool for head-to-head comparison of coding agents on reproducible tasks, providing systematized evaluation and data-backed insights.

How do I install agent-eval?

Run the command: npx killer-skills add affaan-m/everything-claude-code/agent-eval. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-eval?

Key use cases include: Comparing coding agents on custom tasks with reproducible results, Measuring agent performance before adopting a new tool or model, Running regression checks when an agent updates its model or tooling, Producing data-backed agent selection decisions for a team.

Which IDEs are compatible with agent-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-eval?

Requires Git repository access. Needs specific commit for reproducibility. Limited to coding agents with API spend tracking.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add affaan-m/everything-claude-code/agent-eval. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use agent-eval immediately in the current project.

Related Skills

Looking for an alternative to agent-eval or another official skill for your workflow? Explore these related open-source skills.

View All

flags

Logo of facebook
facebook

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

243.6k
0
Developer

extract-errors

Logo of facebook
facebook

Use when adding new error messages to React, or seeing unknown error code warnings.

243.6k
0
Developer

fix

Logo of facebook
facebook

Use when you have lint errors, formatting issues, or before committing code to ensure it passes CI.

243.6k
0
Developer

flow

Logo of facebook
facebook

Use when you need to run Flow type checking, or when seeing Flow type errors in React code.

243.6k
0
Developer