Benchmark Manager
Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.
Quick Start
Debugging a failing benchmark:
bash
1# 1. Show the full prompt that models see
2.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse
3
4# 2. Test a benchmark with a specific model
5ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse
6
7# 3. Check benchmark YAML for common issues
8.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml
When to Use This Skill
Invoke this skill when:
- User asks to create a new benchmark
- User asks to debug/fix a failing benchmark
- User wants to understand why models generate wrong code
- User asks about benchmark YAML format
- Benchmarks show 0% pass rate despite language support
CRITICAL: prompt vs task_prompt
This is the most important concept for benchmark management.
The Problem (v0.4.8 Discovery)
Benchmarks have TWO different prompt fields with VERY different behavior:
| Field | Behavior | Use When |
|---|
prompt: | REPLACES the teaching prompt entirely | Testing raw model capability (rare) |
task_prompt: | APPENDS to teaching prompt | Normal benchmarks (99% of cases) |
Why This Matters
yaml
1# BAD - Model never sees AILANG syntax!
2prompt: |
3 Write a program that prints "Hello"
4
5# GOOD - Model sees teaching prompt + task
6task_prompt: |
7 Write a program that prints "Hello"
With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.
How Prompts Combine
From internal/eval_harness/spec.go (lines 91-93):
go
1fullPrompt := basePrompt // Teaching prompt from prompts/v0.4.x.md
2if s.TaskPrompt != "" {
3 fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt
4}
The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.
Available Scripts
scripts/show_full_prompt.sh
Shows the complete prompt that models receive for a benchmark.
Usage:
bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id>
2
3# Example:
4.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse
scripts/check_benchmark.sh
Validates a benchmark YAML file for common issues.
Usage:
bash
1.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml
Checks for:
- Using
prompt: instead of task_prompt: (warning)
- Missing required fields
- Invalid capability names
- Syntax errors in YAML
scripts/test_benchmark.sh
Runs a quick single-model test of a benchmark.
Usage:
bash
1.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model]
2
3# Examples:
4.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse
5.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5
Required Fields
yaml
1id: my_benchmark # Unique identifier (snake_case)
2description: "Short description of what this tests"
3languages: ["python", "ailang"]
4entrypoint: "main" # Function to call
5caps: ["IO"] # Required capabilities
6difficulty: "easy|medium|hard"
7expected_gain: "low|medium|high"
8task_prompt: | # ALWAYS use task_prompt, not prompt!
9 Write a program in <LANG> that:
10 1. Does something
11 2. Prints the result
12
13 Output only the code, no explanations.
14expected_stdout: | # Exact expected output
15 expected output here
Capability Names
Valid capabilities: IO, FS, Clock, Net
yaml
1# File I/O
2caps: ["IO"]
3
4# HTTP requests
5caps: ["Net", "IO"]
6
7# File system operations
8caps: ["FS", "IO"]
Creating New Benchmarks
Step 1: Determine Requirements
- What language feature/capability is being tested?
- Can models solve this with just the teaching prompt?
- What's the expected output?
Step 2: Write the Benchmark
yaml
1id: my_new_benchmark
2description: "Test feature X capability"
3languages: ["python", "ailang"]
4entrypoint: "main"
5caps: ["IO"]
6difficulty: "medium"
7expected_gain: "medium"
8task_prompt: |
9 Write a program in <LANG> that:
10 1. Clear description of task
11 2. Another step
12 3. Print the result
13
14 Output only the code, no explanations.
15expected_stdout: |
16 exact expected output
Step 3: Validate and Test
bash
1# Check for issues
2.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml
3
4# Test with cheap model first
5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark
Debugging Failing Benchmarks
Symptom: 0% Pass Rate Despite Language Support
Check 1: Is it using task_prompt:?
bash
1grep -E "^prompt:" benchmarks/failing_benchmark.yml
2# If this returns a match, change to task_prompt:
Check 2: What prompt do models see?
bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark
Check 3: Is the teaching prompt up to date?
bash
1# After editing prompts/v0.x.x.md, you MUST rebuild:
2make quick-install
Symptom: Models Copy Template Instead of Solving Task
The teaching prompt includes a template structure. If models copy it verbatim:
- Make sure task is clearly different from examples in teaching prompt
- Check that
task_prompt explicitly describes what to do
- Consider if the task description is ambiguous
Symptom: compile_error on Valid Syntax
Common AILANG-specific issues models get wrong:
| Wrong | Correct | Notes |
|---|
print(42) | print(show(42)) | print expects string |
a % b | mod_Int(a, b) | No % operator |
def main() | export func main() | Wrong keyword |
for x in xs | match xs { ... } | No for loops |
If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).
Common Mistakes
1. Using prompt: Instead of task_prompt:
yaml
1# WRONG - Models never see AILANG syntax
2prompt: |
3 Write code that...
4
5# CORRECT - Teaching prompt + task
6task_prompt: |
7 Write code that...
2. Forgetting to Rebuild After Prompt Changes
bash
1# After editing prompts/v0.x.x.md:
2make quick-install # REQUIRED!
3. Putting Hints in Benchmarks
yaml
1# WRONG - Hints in benchmark
2task_prompt: |
3 Write code that prints 42.
4 Hint: Use print(show(42)) in AILANG.
5
6# CORRECT - No hints; if models fail, fix the teaching prompt
7task_prompt: |
8 Write code that prints 42.
If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.
4. Testing Too Many Models at Once
bash
1# WRONG - Expensive and slow for debugging
2ailang eval-suite --full --benchmarks my_test
3
4# CORRECT - Use one cheap model first
5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test
Resources
Reference Guide
See resources/reference.md for:
- Complete list of valid benchmark fields
- Capability reference
- Example benchmarks
- prompt-manager: When benchmark failures indicate teaching prompt issues
- eval-analyzer: For analyzing results across many benchmarks
- use-ailang: For writing correct AILANG code
- devtools-prompt: For toolchain docs (debugging, tracing, eval workflows) —
ailang devtools-prompt
Notes
- Benchmarks live in
benchmarks/ directory
- Eval results go to
eval_results/ directory
- Teaching prompts are embedded in binary - rebuild after changes (
ailang prompt for syntax, ailang devtools-prompt for toolchain)
- Use
<LANG> placeholder in task_prompt - it's replaced with "AILANG" or "Python"