agent-evaluation — agent-evaluation install agent-evaluation, agent-studio, community, agent-evaluation install, ide skills, LLM-as-judge evaluation framework, AI-generated content quality assessment, Claude Code, Cursor, Windsurf

v1.2.0
GitHub

About this Skill

Perfect for AI Agents needing systematic quality verification of generated content, such as Cursor, Windsurf, or Claude Code, to ensure high-quality output. agent-evaluation is a LLM-as-judge evaluation framework that assesses AI-generated content quality using a weighted composite score and structured verdict with evidence citations.

Features

Scores AI-generated content on 5 dimensions using a 1-5 rubric
Computes a weighted composite score for comprehensive evaluation
Emits a structured verdict with evidence citations for transparency
Systematic quality verification before claiming task completion
Pairs with task completion to ensure high-quality output

# Core Topics

oimiragieo oimiragieo
[14]
[0]
Updated: 3/2/2026

Agent Capability Analysis

The agent-evaluation skill by oimiragieo is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for agent-evaluation install, LLM-as-judge evaluation framework, AI-generated content quality assessment.

Ideal Agent Persona

Perfect for AI Agents needing systematic quality verification of generated content, such as Cursor, Windsurf, or Claude Code, to ensure high-quality output.

Core Value

Empowers agents to evaluate AI-generated content using a 1-5 rubric, providing a weighted composite score and structured verdict with evidence citations, leveraging LLM-as-judge evaluation framework and supporting protocols like LangChain.

Capabilities Granted for agent-evaluation

Evaluating chatbot responses for accuracy
Assessing code quality generated by AutoGPT
Validating content created by AI agents for completeness and coherence

! Prerequisites & Limits

  • Requires LLM-as-judge evaluation framework integration
  • Limited to 5-dimensional scoring rubric
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

agent-evaluation

Install agent-evaluation, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Agent Evaluation

Overview

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.

Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.

When to Use

Always:

  • Before marking a task complete (pair with verification-before-completion)
  • After a plan is generated (evaluate plan quality)
  • After code review outputs (evaluate review quality)
  • During reflection cycles (evaluate agent responses)
  • When comparing multiple agent outputs

Don't Use:

  • For binary pass/fail checks (use verification-before-completion instead)
  • For security audits (use security-architect skill)
  • For syntax/lint checking (use pnpm lint:fix)

The 5-Dimension Rubric

Every evaluation scores all 5 dimensions on a 1-5 scale:

DimensionWeightWhat It Measures
Accuracy30%Factual correctness; no hallucinations; claims are verifiable
Groundedness25%Claims are supported by citations, file references, or evidence from the codebase
Coherence15%Logical flow; internally consistent; no contradictions
Completeness20%All required aspects addressed; no critical gaps
Helpfulness10%Actionable; provides concrete next steps; reduces ambiguity

Scoring Scale (1-5)

ScoreMeaning
5Excellent — fully meets the dimension's criteria with no gaps
4Good — meets criteria with minor gaps
3Adequate — partially meets criteria; some gaps present
2Poor — significant gaps or errors in this dimension
1Failing — does not meet the dimension's criteria

Execution Process

Step 1: Load the Output to Evaluate

Identify what is being evaluated:

- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)

Step 2: Score Each Dimension

For each of the 5 dimensions, provide:

  1. Score (1-5): The numeric score
  2. Evidence: Direct quote or file reference from the evaluated output
  3. Rationale: Why this score was given (1-2 sentences)

Dimension 1: Accuracy

Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation

Dimension 2: Groundedness

Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns

Dimension 3: Coherence

Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order

Dimension 4: Completeness

Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included

Dimension 5: Helpfulness

Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience

Step 3: Compute Weighted Composite Score

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)

Step 4: Determine Verdict

Composite ScoreVerdictAction
4.5 – 5.0EXCELLENTApprove; proceed
3.5 – 4.4GOODApprove with minor notes
2.5 – 3.4ADEQUATERequest targeted improvements
1.5 – 2.4POORReject; requires significant rework
1.0 – 1.4FAILINGReject; restart task

Step 5: Emit Structured Verdict

Output the verdict in this format:

markdown
1## Evaluation Verdict 2 3**Output Evaluated**: [Brief description of what was evaluated] 4**Evaluator**: [Agent name / task ID] 5**Date**: [ISO 8601 date] 6 7### Dimension Scores 8 9| Dimension | Score | Weight | Weighted Score | 10| ------------- | ----- | ------ | -------------- | 11| Accuracy | X/5 | 30% | X.X | 12| Groundedness | X/5 | 25% | X.X | 13| Completeness | X/5 | 20% | X.X | 14| Coherence | X/5 | 15% | X.X | 15| Helpfulness | X/5 | 10% | X.X | 16| **Composite** | | | **X.X / 5.0** | 17 18### Evidence Citations 19 20**Accuracy (X/5)**: 21 22> [Direct quote or file:line reference] 23> Rationale: [Why this score] 24 25**Groundedness (X/5)**: 26 27> [Direct quote or file:line reference] 28> Rationale: [Why this score] 29 30**Completeness (X/5)**: 31 32> [Direct quote or file:line reference] 33> Rationale: [Why this score] 34 35**Coherence (X/5)**: 36 37> [Direct quote or file:line reference] 38> Rationale: [Why this score] 39 40**Helpfulness (X/5)**: 41 42> [Direct quote or file:line reference] 43> Rationale: [Why this score] 44 45### Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING] 46 47**Summary**: [1-2 sentence overall assessment] 48 49**Required Actions** (if verdict is ADEQUATE or worse): 50 511. [Specific improvement needed] 522. [Specific improvement needed]

Usage Examples

Evaluate a Plan Document

javascript
1// Load plan document 2Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' }); 3 4// Evaluate against 5-dimension rubric 5Skill({ skill: 'agent-evaluation' }); 6// Provide the plan content as the output to evaluate

Evaluate Agent Response Before Completion

javascript
1// Agent generates implementation summary 2// Before marking task complete, evaluate the summary quality 3Skill({ skill: 'agent-evaluation' }); 4// If composite < 3.5, request improvements before TaskUpdate(completed)

Evaluate Code Review Output

javascript
1// After code-reviewer runs, evaluate the review quality 2Skill({ skill: 'agent-evaluation' }); 3// Ensures review is grounded in actual code evidence, not assertions

Batch Evaluation (comparing two outputs)

javascript
1// Evaluate output A 2// Save verdict A 3// Evaluate output B 4// Save verdict B 5// Compare composites → choose higher scoring output

Integration with Verification-Before-Completion

The recommended quality gate pattern:

javascript
1// Step 1: Do the work 2// Step 2: Evaluate with agent-evaluation 3Skill({ skill: 'agent-evaluation' }); 4// If verdict is POOR or FAILING → rework before proceeding 5// If verdict is ADEQUATE or better → proceed to verification 6// Step 3: Final gate 7Skill({ skill: 'verification-before-completion' }); 8// Step 4: Mark complete 9TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

  1. NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
  2. ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
  3. ALWAYS cite specific evidence for every dimension score — "Evidence: [file:line or direct quote]" is mandatory, not optional. Assertions without grounding are invalid.
  4. ALWAYS use the weighted compositeaccuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10. Never use simple average.
  5. NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.

Anti-Patterns

Anti-PatternWhy It FailsCorrect Approach
Skipping dimensions to save timeEach dimension catches different failuresAlways score all 5 dimensions
No evidence citation per dimensionAssertions without grounding are invalidQuote specific text or file:line for every score
Using simple average for compositeAccuracy (30%) matters more than helpfulness (10%)Use the weighted composite formula
Only checking EXCELLENT vs FAILINGADEQUATE outputs need targeted improvements, not full reworkUse all 5 verdict tiers with appropriate action per tier
Evaluating before work is doneIncomplete outputs score falsely lowEvaluate completed outputs only
Treating evaluation as binary gateQuality is a spectrum; binary pass/fail loses nuanceUse composite score + per-dimension breakdown together

Assigned Agents

This skill is used by:

  • qa — Primary: validates test outputs and QA reports before completion
  • code-reviewer — Supporting: evaluates code review quality
  • reflection-agent — Supporting: evaluates agent responses during reflection cycles

Memory Protocol (MANDATORY)

Before starting:

bash
1cat .claude/context/memory/learnings.md

Check for:

  • Previous evaluation scores for similar outputs
  • Known quality patterns in this codebase
  • Common failure modes for this task type

After completing:

  • Evaluation pattern found -> .claude/context/memory/learnings.md
  • Quality issue identified -> .claude/context/memory/issues.md
  • Decision about rubric weights -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is agent-evaluation?

Perfect for AI Agents needing systematic quality verification of generated content, such as Cursor, Windsurf, or Claude Code, to ensure high-quality output. agent-evaluation is a LLM-as-judge evaluation framework that assesses AI-generated content quality using a weighted composite score and structured verdict with evidence citations.

How do I install agent-evaluation?

Run the command: npx killer-skills add oimiragieo/agent-studio/agent-evaluation. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-evaluation?

Key use cases include: Evaluating chatbot responses for accuracy, Assessing code quality generated by AutoGPT, Validating content created by AI agents for completeness and coherence.

Which IDEs are compatible with agent-evaluation?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-evaluation?

Requires LLM-as-judge evaluation framework integration. Limited to 5-dimensional scoring rubric.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add oimiragieo/agent-studio/agent-evaluation. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use agent-evaluation immediately in the current project.

Related Skills

Looking for an alternative to agent-evaluation or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI