What is summarize-experiment?

Perfect for Research Agents needing automated experiment result summarization in YAML and SLURM stdout formats. Tools for conducting social research with LLMs

How do I install summarize-experiment?

Run the command: npx killer-skills add niznik-dev/cruijff_kit/summarize-experiment. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for summarize-experiment?

Key use cases include: Automating experiment result summarization, Generating summary.md files for experiment directories, Extracting training loss and accuracy metrics from SLURM and inspect-ai files.

Which IDEs are compatible with summarize-experiment?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for summarize-experiment?

Requires experiment_summary.yaml file. Dependent on SLURM and inspect-ai tools. Logging requires access to logs directory.

Summarize Experiment

Name: summarize-experiment
Availability: InStock
Author: niznik-dev

Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.

Your Task

Create a lightweight summary of experiment results:

Parse run status from experiment_summary.yaml
Extract final training loss from SLURM stdout
Extract accuracy from inspect-ai .eval files
Generate summary.md in experiment directory
Log the process in logs/summarize-experiment.log

Prerequisites

experiment_summary.yaml exists
At least some runs have completed (partial results acceptable)
run-experiment has been executed (or manual SLURM jobs run)
Conda environment activated - The parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.

Workflow

1. Locate Experiment

Find the experiment directory:

If in an experiment directory (contains experiment_summary.yaml): use current directory
Otherwise: ask user for path

2. Parse Run Status

Read experiment_summary.yaml to identify runs:

From runs: section:

name: Run identifier
type: "fine-tuned" or "control"
model: Model name
parameters: Dict of hyperparameters (empty for control runs)

From evaluation.matrix: section:

run: Run name
tasks: List of evaluation task names
epochs: List of epochs to evaluate (null for control runs)

Determine status by checking filesystem:

Fine-tuning: Check for {output_base}/ck-out-{run_name}/ and SLURM outputs
Evaluation: Check for {run_dir}/eval/logs/*.eval files

3. Extract Training Loss

For each COMPLETED fine-tuning run:

Find SLURM stdout in the output directory:
- Parse experiment_summary.yaml "Output" section for output_dir_base
- Look in: {output_dir_base}/ck-out-{run_name}/slurm-*.out
- If multiple files, use most recent by modification time
Extract final loss using regex: (\d+)\|(\d+)\|Loss: ([0-9.]+)
- Pattern matches: {epoch}|{step}|Loss: {value}
- Take the LAST match to get final loss
- The step number (group 2) from the last match is the total training steps
Record: run_name, final_loss, total_steps, epoch, step

Note: Training SLURM outputs are in the output directory, NOT the run directory.

If SLURM stdout missing:

Log warning
Record "N/A" for loss
Continue with other runs

4. Extract Evaluation Accuracy

For each COMPLETED evaluation:

Find .eval files: {run_dir}/eval/logs/*.eval

For each .eval file, run:

bash
1python tools/inspect/parse_eval_log.py {path}

Parse JSON output for accuracy
Map to epoch using SLURM job names (see below)
For binary tasks, also run summary_binary.py to get balanced accuracy and F1
Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples

Script output format:

json
1{
2  "status": "success",
3  "task": "capitalization",
4  "accuracy": 0.85,
5  "samples": 100,
6  "scorer": "exact_match",
7  "model": "..."
8}

Mapping Epochs via SLURM Job Names

The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:

Find SLURM output files in the eval directory: {run_dir}/eval/slurm-*.out
Extract job IDs from filenames (e.g., slurm-2773062.out → job ID 2773062)

Query job names via sacct:

bash
1sacct -j {job_ids} --format=JobID,JobName%50

Parse epoch from job name - scaffold-inspect names jobs like eval-{task}-{run}-ep{N}:
- eval-general_eval-lowlr-ep0 → epoch 0
- eval-general_eval-lowlr-ep9 → epoch 9

Extract accuracy from SLURM output:

bash
1grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out

Example workflow:

bash
1# Get job names for all eval jobs
2sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50
3
4# Output shows epoch in job name:
5# 2773062  eval-general_eval-lowlr-ep0
6# 2773063  eval-general_eval-lowlr-ep1
7# 2773065  eval-general_eval-lowlr-ep2

This approach is reliable because:

Job names are set by scaffold-inspect and include epoch info
Works regardless of submission order or timing
Survives job failures and resubmissions

If extraction fails:

Script returns {"status": "error", "message": "..."}
Log the error
Record "ERROR" for accuracy
Continue with other evaluations

Computing Balanced Accuracy and F1 (Binary Classification)

For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:

bash
1python tools/inspect/summary_binary.py {path_to_eval_file} --json

JSON output format:

json
1{
2  "status": "success",
3  "path": "/path/to/file.eval",
4  "samples": 100,
5  "accuracy": 0.85,
6  "balanced_accuracy": 0.83,
7  "f1": 0.82,
8  "precision_1": 0.80,
9  "recall_1": 0.84,
10  "recall_0": 0.82,
11  "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
12}

Why these metrics matter for imbalanced data:

Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
F1 Score = harmonic mean of precision and recall — penalizes class imbalance

Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").

5. Generate summary.md

Create {experiment_dir}/summary.md with the following structure:

markdown
1# Experiment Summary
2
3**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete
4
5## Run Status
6
7| Run | Type | Fine-tuning | Evaluation |
8|-----|------|-------------|------------|
9| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
10| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
11| base_model | Control | N/A | COMPLETED |
12
13## Training Results
14
15| Run | Final Loss | Total Steps | Epochs | Duration |
16|-----|------------|-------------|--------|----------|
17| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
18| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |
19
20**Notes:**
21- Base model runs have no training loss (control)
22- Duration from SLURM elapsed time (if available)
23
24## Evaluation Results
25
26| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
27|-----|------|-------|----------|----------|------|---------|
28| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
29| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
30| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
31| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
32| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |
33
34**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy
35
36## Incomplete Runs
37
38| Run | Stage | Status | Notes |
39|-----|-------|--------|-------|
40| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |
41
42## Next Steps
43
441. View detailed evaluation results: `inspect view --port=$(get_free_port)`
452. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
463. Full analysis: `analyze-experiment` (when available)
47
48---
49*Generated by summarize-experiment skill*

6. Create Log

Document the process in {experiment_dir}/logs/summarize-experiment.log.

See logging.md for action types and format.

Error Handling

If SLURM stdout missing

Log warning with action type EXTRACT_LOSS
Record "N/A" for loss in summary
Continue with other runs

If .eval file cannot be parsed

Log error with file path
Record "ERROR" for accuracy in summary
Continue with other evaluations

If all runs failed

Generate summary noting all failures
Include failure states in "Incomplete Runs" section
Suggest troubleshooting steps

If partial results

Generate summary with available data
Clearly indicate which runs are missing in "Incomplete Runs" section
Still identify best performing run from available data

Idempotency

Running summarize-experiment multiple times overwrites summary.md. This is intentional:

Allows re-running after fixing failed runs
Summary always reflects current state

Output Files

{experiment_dir}/
├── summary.md                    # Human-readable summary (new)
└── logs/
    └── summarize-experiment.log  # Process log (new)

Relationship to Other Skills

After: run-experiment (or manual execution)
Before: analyze-experiment (when available)
Optional hook: run-experiment can invoke this at completion

Future Compatibility

When analyze-experiment is built, summarize-experiment can either:

Remain as a quick summary option (text only, no plots)
Be deprecated in favor of richer output
Become a first stage that analyze-experiment builds upon

summarize-experiment — community summarize-experiment, cruijff_kit, community, ide skills, Claude Code, Cursor, Windsurf

About this Skill

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for summarize-experiment

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

summarize-experiment