What is analyze-eval?

Perfect for Debugging Agents needing advanced eval failure diagnosis capabilities. analyze-eval is a skill that analyzes eval failures by extracting eval IDs from URLs, supporting Convex document IDs and visualizer URL patterns.

How do I install analyze-eval?

Run the command: npx killer-skills add get-convex/convex-evals. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for analyze-eval?

Key use cases include: Debugging eval failures with specific IDs, Extracting eval IDs from visualizer URLs, Diagnosing experiment run issues.

Which IDEs are compatible with analyze-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for analyze-eval?

Requires Convex document ID and visualizer URL pattern support. Limited to specific URL formats.

analyze-eval

Install analyze-eval, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md

Readonly

Analyze Eval

Name: analyze-eval
Availability: InStock
Author: get-convex

When to use

User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
User asks "why did this eval fail?" or "what went wrong with this eval?"
User references a specific eval ID

Step 1: Extract the eval ID from the URL

The visualizer URL pattern is:

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps

$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Step 2: Query the debug action

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

bash
1npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

Field	Contents
`eval`	Name, category, evalPath, status (pass/fail + failure reason), task text
`run`	Model name, provider, experiment name, run status
`steps`	Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
`outputFiles`	Map of file path -> file content from the model's generated output (unzipped)
`evalSourceFiles`	Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

Step 3: Analyze the failure

With the data returned, compare:

Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
What did the model generate? — Look at outputFiles for the model's code.
What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
convex dev fail — Schema or function definition issues that prevent Convex from deploying.
tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Step 4: Classify and report findings

Classify the failure as one of:

MODEL_FAULT: The model genuinely got it wrong
OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Summarize:

The eval name, model, and experiment
Which step failed and the exact error
The classification and reasoning
The relevant code from the model output that caused the failure
What the correct code should look like (from the answer/eval source)
Whether any action is recommended (config change, task clarification, etc.)

analyze-eval — convex document id analyze-eval, convex-evals, community, convex document id, ide skills, visualizer url pattern, eval failure analysis, analyze-eval install, Claude Code, Cursor, Windsurf

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for analyze-eval

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

analyze-eval

Analyze Eval

When to use

Step 1: Extract the eval ID from the URL

Step 2: Query the debug action

Step 3: Analyze the failure

Step 4: Classify and report findings

FAQ & Installation Steps

? Frequently Asked Questions

What is analyze-eval?

How do I install analyze-eval?

What are the use cases for analyze-eval?

Which IDEs are compatible with analyze-eval?

Are there any limitations for analyze-eval?

↓ How To Install

Related Skills

Looking for an alternative to analyze-eval or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching

analyze-eval — convex document id analyze-eval, convex-evals, community, convex document id, ide skills, visualizer url pattern, eval failure analysis, analyze-eval install, Claude Code, Cursor, Windsurf

About this Skill

Features

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for analyze-eval

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

analyze-eval

Analyze Eval

When to use

Step 1: Extract the eval ID from the URL

Step 2: Query the debug action

Step 3: Analyze the failure

Step 4: Classify and report findings

FAQ & Installation Steps

? Frequently Asked Questions

What is analyze-eval?

How do I install analyze-eval?

What are the use cases for analyze-eval?

Which IDEs are compatible with analyze-eval?

Are there any limitations for analyze-eval?

↓ How To Install

Related Skills

Looking for an alternative to analyze-eval or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching