analyze-eval — convex document id analyze-eval, convex-evals, community, convex document id, ide skills, visualizer url pattern, eval failure analysis, analyze-eval install, Claude Code, Cursor, Windsurf

v1.0.0
GitHub

About this Skill

Perfect for Debugging Agents needing advanced eval failure diagnosis capabilities. analyze-eval is a skill that analyzes eval failures by extracting eval IDs from URLs, supporting Convex document IDs and visualizer URL patterns.

Features

Extracts eval IDs from visualizer URLs using the pattern /experiment/$experimentId/run/$runId/$category/$evalId
Supports Convex document IDs for runs, such as jn7922j1w29pdxm76bj9ps0enx80mg9e
Analyzes eval failures based on specific eval IDs
Handles URLs like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
Diagnoses issues with evals using the extracted eval ID

# Core Topics

get-convex get-convex
[0]
[0]
Updated: 3/8/2026

Agent Capability Analysis

The analyze-eval skill by get-convex is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for convex document id, visualizer url pattern, eval failure analysis.

Ideal Agent Persona

Perfect for Debugging Agents needing advanced eval failure diagnosis capabilities.

Core Value

Empowers agents to extract eval IDs from URLs, supporting Convex document IDs and visualizer URL patterns, and diagnose eval failures using experiment and run IDs.

Capabilities Granted for analyze-eval

Debugging eval failures with specific IDs
Extracting eval IDs from visualizer URLs
Diagnosing experiment run issues

! Prerequisites & Limits

  • Requires Convex document ID and visualizer URL pattern support
  • Limited to specific URL formats
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

analyze-eval

Install analyze-eval, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Analyze Eval

When to use

  • User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
  • User asks "why did this eval fail?" or "what went wrong with this eval?"
  • User references a specific eval ID

Step 1: Extract the eval ID from the URL

The visualizer URL pattern is:

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps
  • $runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
  • $evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Step 2: Query the debug action

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

bash
1npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

FieldContents
evalName, category, evalPath, status (pass/fail + failure reason), task text
runModel name, provider, experiment name, run status
stepsArray of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
outputFilesMap of file path -> file content from the model's generated output (unzipped)
evalSourceFilesMap of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

Step 3: Analyze the failure

With the data returned, compare:

  1. Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
  2. What did the model generate? — Look at outputFiles for the model's code.
  3. What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
  4. What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

  • eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
  • tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
  • convex dev fail — Schema or function definition issues that prevent Convex from deploying.
  • tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Step 4: Classify and report findings

Classify the failure as one of:

  • MODEL_FAULT: The model genuinely got it wrong
  • OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
  • AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
  • KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Summarize:

  1. The eval name, model, and experiment
  2. Which step failed and the exact error
  3. The classification and reasoning
  4. The relevant code from the model output that caused the failure
  5. What the correct code should look like (from the answer/eval source)
  6. Whether any action is recommended (config change, task clarification, etc.)

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is analyze-eval?

Perfect for Debugging Agents needing advanced eval failure diagnosis capabilities. analyze-eval is a skill that analyzes eval failures by extracting eval IDs from URLs, supporting Convex document IDs and visualizer URL patterns.

How do I install analyze-eval?

Run the command: npx killer-skills add get-convex/convex-evals. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for analyze-eval?

Key use cases include: Debugging eval failures with specific IDs, Extracting eval IDs from visualizer URLs, Diagnosing experiment run issues.

Which IDEs are compatible with analyze-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for analyze-eval?

Requires Convex document ID and visualizer URL pattern support. Limited to specific URL formats.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add get-convex/convex-evals. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use analyze-eval immediately in the current project.

Related Skills

Looking for an alternative to analyze-eval or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI