What is Benchmark Manager?

Perfect for AI Agents needing advanced AILANG evaluation benchmark management with correct prompt integration and debugging workflows. Benchmark Manager is a skill that manages AILANG evaluation benchmarks with features like prompt integration, debugging, and best practices for AI model evaluation.

How do I install Benchmark Manager?

Run the command: npx killer-skills add sunholo-data/ailang/Benchmark Manager. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for Benchmark Manager?

Key use cases include: Debugging failing benchmarks with full prompt visibility, Testing benchmarks with specific AI models like claude-haiku-4-5, Analyzing benchmark YAML for common issues to improve evaluation suite efficiency.

Which IDEs are compatible with Benchmark Manager?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for Benchmark Manager?

Requires AILANG evaluation benchmarks setup. Dependent on specific model availability like claude-haiku-4-5 for testing.

Benchmark Manager

Name: Benchmark Manager
Availability: InStock
Author: sunholo-data

Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.

Quick Start

Debugging a failing benchmark:

bash
1# 1. Show the full prompt that models see
2.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse
3
4# 2. Test a benchmark with a specific model
5ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse
6
7# 3. Check benchmark YAML for common issues
8.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml

When to Use This Skill

Invoke this skill when:

User asks to create a new benchmark
User asks to debug/fix a failing benchmark
User wants to understand why models generate wrong code
User asks about benchmark YAML format
Benchmarks show 0% pass rate despite language support

CRITICAL: prompt vs task_prompt

This is the most important concept for benchmark management.

The Problem (v0.4.8 Discovery)

Benchmarks have TWO different prompt fields with VERY different behavior:

Field	Behavior	Use When
`prompt:`	REPLACES the teaching prompt entirely	Testing raw model capability (rare)
`task_prompt:`	APPENDS to teaching prompt	Normal benchmarks (99% of cases)

Why This Matters

yaml
1# BAD - Model never sees AILANG syntax!
2prompt: |
3  Write a program that prints "Hello"
4
5# GOOD - Model sees teaching prompt + task
6task_prompt: |
7  Write a program that prints "Hello"

With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.

How Prompts Combine

From internal/eval_harness/spec.go (lines 91-93):

go
1fullPrompt := basePrompt  // Teaching prompt from prompts/v0.4.x.md
2if s.TaskPrompt != "" {
3    fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt
4}

The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.

Available Scripts

`scripts/show_full_prompt.sh`

Shows the complete prompt that models receive for a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id>
2
3# Example:
4.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

`scripts/check_benchmark.sh`

Validates a benchmark YAML file for common issues.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml

Checks for:

Using prompt: instead of task_prompt: (warning)
Missing required fields
Invalid capability names
Syntax errors in YAML

`scripts/test_benchmark.sh`

Runs a quick single-model test of a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model]
2
3# Examples:
4.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse
5.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5

Benchmark YAML Format

Required Fields

yaml
1id: my_benchmark              # Unique identifier (snake_case)
2description: "Short description of what this tests"
3languages: ["python", "ailang"]
4entrypoint: "main"            # Function to call
5caps: ["IO"]                  # Required capabilities
6difficulty: "easy|medium|hard"
7expected_gain: "low|medium|high"
8task_prompt: |                # ALWAYS use task_prompt, not prompt!
9  Write a program in <LANG> that:
10  1. Does something
11  2. Prints the result
12
13  Output only the code, no explanations.
14expected_stdout: |            # Exact expected output
15  expected output here

Capability Names

Valid capabilities: IO, FS, Clock, Net

yaml
1# File I/O
2caps: ["IO"]
3
4# HTTP requests
5caps: ["Net", "IO"]
6
7# File system operations
8caps: ["FS", "IO"]

Creating New Benchmarks

Step 1: Determine Requirements

What language feature/capability is being tested?
Can models solve this with just the teaching prompt?
What's the expected output?

Step 2: Write the Benchmark

yaml
1id: my_new_benchmark
2description: "Test feature X capability"
3languages: ["python", "ailang"]
4entrypoint: "main"
5caps: ["IO"]
6difficulty: "medium"
7expected_gain: "medium"
8task_prompt: |
9  Write a program in <LANG> that:
10  1. Clear description of task
11  2. Another step
12  3. Print the result
13
14  Output only the code, no explanations.
15expected_stdout: |
16  exact expected output

Step 3: Validate and Test

bash
1# Check for issues
2.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml
3
4# Test with cheap model first
5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Check 1: Is it using task_prompt:?

bash
1grep -E "^prompt:" benchmarks/failing_benchmark.yml
2# If this returns a match, change to task_prompt:

Check 2: What prompt do models see?

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark

Check 3: Is the teaching prompt up to date?

bash
1# After editing prompts/v0.x.x.md, you MUST rebuild:
2make quick-install

Symptom: Models Copy Template Instead of Solving Task

The teaching prompt includes a template structure. If models copy it verbatim:

Make sure task is clearly different from examples in teaching prompt
Check that task_prompt explicitly describes what to do
Consider if the task description is ambiguous

Symptom: compile_error on Valid Syntax

Common AILANG-specific issues models get wrong:

Wrong	Correct	Notes
`print(42)`	`print(show(42))`	print expects string
`a % b`	`mod_Int(a, b)`	No % operator
`def main()`	`export func main()`	Wrong keyword
`for x in xs`	`match xs { ... }`	No for loops

If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).

Common Mistakes

1. Using `prompt:` Instead of `task_prompt:`

yaml
1# WRONG - Models never see AILANG syntax
2prompt: |
3  Write code that...
4
5# CORRECT - Teaching prompt + task
6task_prompt: |
7  Write code that...

2. Forgetting to Rebuild After Prompt Changes

bash
1# After editing prompts/v0.x.x.md:
2make quick-install  # REQUIRED!

3. Putting Hints in Benchmarks

yaml
1# WRONG - Hints in benchmark
2task_prompt: |
3  Write code that prints 42.
4  Hint: Use print(show(42)) in AILANG.
5
6# CORRECT - No hints; if models fail, fix the teaching prompt
7task_prompt: |
8  Write code that prints 42.

If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.

4. Testing Too Many Models at Once

bash
1# WRONG - Expensive and slow for debugging
2ailang eval-suite --full --benchmarks my_test
3
4# CORRECT - Use one cheap model first
5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test

Resources

Reference Guide

See resources/reference.md for:

Complete list of valid benchmark fields
Capability reference
Example benchmarks

prompt-manager: When benchmark failures indicate teaching prompt issues
eval-analyzer: For analyzing results across many benchmarks
use-ailang: For writing correct AILANG code
devtools-prompt: For toolchain docs (debugging, tracing, eval workflows) — ailang devtools-prompt

Notes

Benchmarks live in benchmarks/ directory
Eval results go to eval_results/ directory
Teaching prompts are embedded in binary - rebuild after changes (ailang prompt for syntax, ailang devtools-prompt for toolchain)
Use <LANG> placeholder in task_prompt - it's replaced with "AILANG" or "Python"

Benchmark Manager — AILANG evaluation benchmarks Benchmark Manager, ailang, community, AILANG evaluation benchmarks, ide skills, debugging workflows for AI models, install Benchmark Manager, Benchmark Manager for AI agents, managing AI model evaluation benchmarks, Claude Code, Cursor

About this Skill

Features

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for Benchmark Manager

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

Benchmark Manager

Benchmark Manager

Quick Start

When to Use This Skill

CRITICAL: prompt vs task_prompt

The Problem (v0.4.8 Discovery)

Why This Matters

How Prompts Combine

Available Scripts

scripts/show_full_prompt.sh

scripts/check_benchmark.sh

scripts/test_benchmark.sh

Benchmark YAML Format

Required Fields

Capability Names

Creating New Benchmarks

Step 1: Determine Requirements

Step 2: Write the Benchmark

Step 3: Validate and Test

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Symptom: Models Copy Template Instead of Solving Task

Symptom: compile_error on Valid Syntax

Common Mistakes

1. Using prompt: Instead of task_prompt:

2. Forgetting to Rebuild After Prompt Changes

3. Putting Hints in Benchmarks

4. Testing Too Many Models at Once

Resources

Reference Guide

Related Skills

Notes

FAQ & Installation Steps

? Frequently Asked Questions

What is Benchmark Manager?

How do I install Benchmark Manager?

What are the use cases for Benchmark Manager?

Which IDEs are compatible with Benchmark Manager?

Are there any limitations for Benchmark Manager?

↓ How To Install

Related Skills

Looking for an alternative to Benchmark Manager or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching

`scripts/show_full_prompt.sh`

`scripts/check_benchmark.sh`

`scripts/test_benchmark.sh`

1. Using `prompt:` Instead of `task_prompt:`