Benchmark Manager — AILANG evaluation benchmarks Benchmark Manager, ailang, community, AILANG evaluation benchmarks, ide skills, debugging workflows for AI models, install Benchmark Manager, Benchmark Manager for AI agents, managing AI model evaluation benchmarks, Claude Code, Cursor

v1.0.0
GitHub

About this Skill

Perfect for AI Agents needing advanced AILANG evaluation benchmark management with correct prompt integration and debugging workflows. Benchmark Manager is a skill that manages AILANG evaluation benchmarks with features like prompt integration, debugging, and best practices for AI model evaluation.

Features

Manages AILANG evaluation benchmarks with correct prompt integration
Provides debugging workflows using scripts like show_full_prompt.sh
Supports testing benchmarks with specific models like claude-haiku-4-5
Checks benchmark YAML for common issues
Integrates with ailang eval-suite for model evaluation
Offers best practices learned from real benchmark failures

# Core Topics

sunholo-data sunholo-data
[0]
[0]
Updated: 3/8/2026

Agent Capability Analysis

The Benchmark Manager skill by sunholo-data is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for AILANG evaluation benchmarks, debugging workflows for AI models, install Benchmark Manager.

Ideal Agent Persona

Perfect for AI Agents needing advanced AILANG evaluation benchmark management with correct prompt integration and debugging workflows.

Core Value

Empowers agents to manage AILANG evaluation benchmarks with best practices learned from real benchmark failures, utilizing scripts like show_full_prompt.sh and ailang eval-suite for efficient debugging and testing with specific models like claude-haiku-4-5.

Capabilities Granted for Benchmark Manager

Debugging failing benchmarks with full prompt visibility
Testing benchmarks with specific AI models like claude-haiku-4-5
Analyzing benchmark YAML for common issues to improve evaluation suite efficiency

! Prerequisites & Limits

  • Requires AILANG evaluation benchmarks setup
  • Dependent on specific model availability like claude-haiku-4-5 for testing
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

Benchmark Manager

Discover how to manage AILANG evaluation benchmarks with ease. Learn to debug workflows and integrate prompts correctly with the Benchmark Manager skill.

SKILL.md
Readonly

Benchmark Manager

Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.

Quick Start

Debugging a failing benchmark:

bash
1# 1. Show the full prompt that models see 2.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse 3 4# 2. Test a benchmark with a specific model 5ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse 6 7# 3. Check benchmark YAML for common issues 8.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml

When to Use This Skill

Invoke this skill when:

  • User asks to create a new benchmark
  • User asks to debug/fix a failing benchmark
  • User wants to understand why models generate wrong code
  • User asks about benchmark YAML format
  • Benchmarks show 0% pass rate despite language support

CRITICAL: prompt vs task_prompt

This is the most important concept for benchmark management.

The Problem (v0.4.8 Discovery)

Benchmarks have TWO different prompt fields with VERY different behavior:

FieldBehaviorUse When
prompt:REPLACES the teaching prompt entirelyTesting raw model capability (rare)
task_prompt:APPENDS to teaching promptNormal benchmarks (99% of cases)

Why This Matters

yaml
1# BAD - Model never sees AILANG syntax! 2prompt: | 3 Write a program that prints "Hello" 4 5# GOOD - Model sees teaching prompt + task 6task_prompt: | 7 Write a program that prints "Hello"

With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.

How Prompts Combine

From internal/eval_harness/spec.go (lines 91-93):

go
1fullPrompt := basePrompt // Teaching prompt from prompts/v0.4.x.md 2if s.TaskPrompt != "" { 3 fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt 4}

The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.

Available Scripts

scripts/show_full_prompt.sh

Shows the complete prompt that models receive for a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id> 2 3# Example: 4.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

scripts/check_benchmark.sh

Validates a benchmark YAML file for common issues.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml

Checks for:

  • Using prompt: instead of task_prompt: (warning)
  • Missing required fields
  • Invalid capability names
  • Syntax errors in YAML

scripts/test_benchmark.sh

Runs a quick single-model test of a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model] 2 3# Examples: 4.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse 5.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5

Benchmark YAML Format

Required Fields

yaml
1id: my_benchmark # Unique identifier (snake_case) 2description: "Short description of what this tests" 3languages: ["python", "ailang"] 4entrypoint: "main" # Function to call 5caps: ["IO"] # Required capabilities 6difficulty: "easy|medium|hard" 7expected_gain: "low|medium|high" 8task_prompt: | # ALWAYS use task_prompt, not prompt! 9 Write a program in <LANG> that: 10 1. Does something 11 2. Prints the result 12 13 Output only the code, no explanations. 14expected_stdout: | # Exact expected output 15 expected output here

Capability Names

Valid capabilities: IO, FS, Clock, Net

yaml
1# File I/O 2caps: ["IO"] 3 4# HTTP requests 5caps: ["Net", "IO"] 6 7# File system operations 8caps: ["FS", "IO"]

Creating New Benchmarks

Step 1: Determine Requirements

  • What language feature/capability is being tested?
  • Can models solve this with just the teaching prompt?
  • What's the expected output?

Step 2: Write the Benchmark

yaml
1id: my_new_benchmark 2description: "Test feature X capability" 3languages: ["python", "ailang"] 4entrypoint: "main" 5caps: ["IO"] 6difficulty: "medium" 7expected_gain: "medium" 8task_prompt: | 9 Write a program in <LANG> that: 10 1. Clear description of task 11 2. Another step 12 3. Print the result 13 14 Output only the code, no explanations. 15expected_stdout: | 16 exact expected output

Step 3: Validate and Test

bash
1# Check for issues 2.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml 3 4# Test with cheap model first 5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Check 1: Is it using task_prompt:?

bash
1grep -E "^prompt:" benchmarks/failing_benchmark.yml 2# If this returns a match, change to task_prompt:

Check 2: What prompt do models see?

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark

Check 3: Is the teaching prompt up to date?

bash
1# After editing prompts/v0.x.x.md, you MUST rebuild: 2make quick-install

Symptom: Models Copy Template Instead of Solving Task

The teaching prompt includes a template structure. If models copy it verbatim:

  1. Make sure task is clearly different from examples in teaching prompt
  2. Check that task_prompt explicitly describes what to do
  3. Consider if the task description is ambiguous

Symptom: compile_error on Valid Syntax

Common AILANG-specific issues models get wrong:

WrongCorrectNotes
print(42)print(show(42))print expects string
a % bmod_Int(a, b)No % operator
def main()export func main()Wrong keyword
for x in xsmatch xs { ... }No for loops

If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).

Common Mistakes

1. Using prompt: Instead of task_prompt:

yaml
1# WRONG - Models never see AILANG syntax 2prompt: | 3 Write code that... 4 5# CORRECT - Teaching prompt + task 6task_prompt: | 7 Write code that...

2. Forgetting to Rebuild After Prompt Changes

bash
1# After editing prompts/v0.x.x.md: 2make quick-install # REQUIRED!

3. Putting Hints in Benchmarks

yaml
1# WRONG - Hints in benchmark 2task_prompt: | 3 Write code that prints 42. 4 Hint: Use print(show(42)) in AILANG. 5 6# CORRECT - No hints; if models fail, fix the teaching prompt 7task_prompt: | 8 Write code that prints 42.

If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.

4. Testing Too Many Models at Once

bash
1# WRONG - Expensive and slow for debugging 2ailang eval-suite --full --benchmarks my_test 3 4# CORRECT - Use one cheap model first 5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test

Resources

Reference Guide

See resources/reference.md for:

  • Complete list of valid benchmark fields
  • Capability reference
  • Example benchmarks
  • prompt-manager: When benchmark failures indicate teaching prompt issues
  • eval-analyzer: For analyzing results across many benchmarks
  • use-ailang: For writing correct AILANG code
  • devtools-prompt: For toolchain docs (debugging, tracing, eval workflows) — ailang devtools-prompt

Notes

  • Benchmarks live in benchmarks/ directory
  • Eval results go to eval_results/ directory
  • Teaching prompts are embedded in binary - rebuild after changes (ailang prompt for syntax, ailang devtools-prompt for toolchain)
  • Use <LANG> placeholder in task_prompt - it's replaced with "AILANG" or "Python"

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is Benchmark Manager?

Perfect for AI Agents needing advanced AILANG evaluation benchmark management with correct prompt integration and debugging workflows. Benchmark Manager is a skill that manages AILANG evaluation benchmarks with features like prompt integration, debugging, and best practices for AI model evaluation.

How do I install Benchmark Manager?

Run the command: npx killer-skills add sunholo-data/ailang/Benchmark Manager. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for Benchmark Manager?

Key use cases include: Debugging failing benchmarks with full prompt visibility, Testing benchmarks with specific AI models like claude-haiku-4-5, Analyzing benchmark YAML for common issues to improve evaluation suite efficiency.

Which IDEs are compatible with Benchmark Manager?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for Benchmark Manager?

Requires AILANG evaluation benchmarks setup. Dependent on specific model availability like claude-haiku-4-5 for testing.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add sunholo-data/ailang/Benchmark Manager. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use Benchmark Manager immediately in the current project.

Related Skills

Looking for an alternative to Benchmark Manager or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI