Summarize Experiment
Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.
Your Task
Create a lightweight summary of experiment results:
- Parse run status from experiment_summary.yaml
- Extract final training loss from SLURM stdout
- Extract accuracy from inspect-ai .eval files
- Generate summary.md in experiment directory
- Log the process in logs/summarize-experiment.log
Prerequisites
- experiment_summary.yaml exists
- At least some runs have completed (partial results acceptable)
- run-experiment has been executed (or manual SLURM jobs run)
- Conda environment activated - The
parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.
Workflow
1. Locate Experiment
Find the experiment directory:
- If in an experiment directory (contains experiment_summary.yaml): use current directory
- Otherwise: ask user for path
2. Parse Run Status
Read experiment_summary.yaml to identify runs:
From runs: section:
name: Run identifier
type: "fine-tuned" or "control"
model: Model name
parameters: Dict of hyperparameters (empty for control runs)
From evaluation.matrix: section:
run: Run name
tasks: List of evaluation task names
epochs: List of epochs to evaluate (null for control runs)
Determine status by checking filesystem:
- Fine-tuning: Check for
{output_base}/ck-out-{run_name}/ and SLURM outputs
- Evaluation: Check for
{run_dir}/eval/logs/*.eval files
For each COMPLETED fine-tuning run:
- Find SLURM stdout in the output directory:
- Parse experiment_summary.yaml "Output" section for
output_dir_base
- Look in:
{output_dir_base}/ck-out-{run_name}/slurm-*.out
- If multiple files, use most recent by modification time
- Extract final loss using regex:
(\d+)\|(\d+)\|Loss: ([0-9.]+)
- Pattern matches:
{epoch}|{step}|Loss: {value}
- Take the LAST match to get final loss
- The step number (group 2) from the last match is the total training steps
- Record: run_name, final_loss, total_steps, epoch, step
Note: Training SLURM outputs are in the output directory, NOT the run directory.
If SLURM stdout missing:
- Log warning
- Record "N/A" for loss
- Continue with other runs
For each COMPLETED evaluation:
- Find .eval files:
{run_dir}/eval/logs/*.eval
- For each .eval file, run:
bash
1python tools/inspect/parse_eval_log.py {path}
- Parse JSON output for accuracy
- Map to epoch using SLURM job names (see below)
- For binary tasks, also run
summary_binary.py to get balanced accuracy and F1
- Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples
Script output format:
json
1{
2 "status": "success",
3 "task": "capitalization",
4 "accuracy": 0.85,
5 "samples": 100,
6 "scorer": "exact_match",
7 "model": "..."
8}
Mapping Epochs via SLURM Job Names
The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:
- Find SLURM output files in the eval directory:
{run_dir}/eval/slurm-*.out
- Extract job IDs from filenames (e.g.,
slurm-2773062.out → job ID 2773062)
- Query job names via sacct:
bash
1sacct -j {job_ids} --format=JobID,JobName%50
- Parse epoch from job name - scaffold-inspect names jobs like
eval-{task}-{run}-ep{N}:
eval-general_eval-lowlr-ep0 → epoch 0
eval-general_eval-lowlr-ep9 → epoch 9
- Extract accuracy from SLURM output:
bash
1grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
Example workflow:
bash
1# Get job names for all eval jobs
2sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50
3
4# Output shows epoch in job name:
5# 2773062 eval-general_eval-lowlr-ep0
6# 2773063 eval-general_eval-lowlr-ep1
7# 2773065 eval-general_eval-lowlr-ep2
This approach is reliable because:
- Job names are set by scaffold-inspect and include epoch info
- Works regardless of submission order or timing
- Survives job failures and resubmissions
If extraction fails:
- Script returns
{"status": "error", "message": "..."}
- Log the error
- Record "ERROR" for accuracy
- Continue with other evaluations
Computing Balanced Accuracy and F1 (Binary Classification)
For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:
bash
1python tools/inspect/summary_binary.py {path_to_eval_file} --json
JSON output format:
json
1{
2 "status": "success",
3 "path": "/path/to/file.eval",
4 "samples": 100,
5 "accuracy": 0.85,
6 "balanced_accuracy": 0.83,
7 "f1": 0.82,
8 "precision_1": 0.80,
9 "recall_1": 0.84,
10 "recall_0": 0.82,
11 "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
12}
Why these metrics matter for imbalanced data:
- Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
- F1 Score = harmonic mean of precision and recall — penalizes class imbalance
Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").
5. Generate summary.md
Create {experiment_dir}/summary.md with the following structure:
markdown
1# Experiment Summary
2
3**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete
4
5## Run Status
6
78|-----|------|-------------|------------|
9| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
10| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
11| base_model | Control | N/A | COMPLETED |
12
13## Training Results
14
1516|-----|------------|-------------|--------|----------|
17| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
18| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |
19
20**Notes:**
21- Base model runs have no training loss (control)
22- Duration from SLURM elapsed time (if available)
23
24## Evaluation Results
25
2627|-----|------|-------|----------|----------|------|---------|
28| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
29| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
30| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
31| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
32| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |
33
34**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy
35
36## Incomplete Runs
37
3839|-----|-------|--------|-------|
40| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |
41
42## Next Steps
43
441. View detailed evaluation results: `inspect view --port=$(get_free_port)`
452. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
463. Full analysis: `analyze-experiment` (when available)
47
48---
49*Generated by summarize-experiment skill*
6. Create Log
Document the process in {experiment_dir}/logs/summarize-experiment.log.
See logging.md for action types and format.
Error Handling
If SLURM stdout missing
- Log warning with action type
EXTRACT_LOSS
- Record "N/A" for loss in summary
- Continue with other runs
If .eval file cannot be parsed
- Log error with file path
- Record "ERROR" for accuracy in summary
- Continue with other evaluations
If all runs failed
- Generate summary noting all failures
- Include failure states in "Incomplete Runs" section
- Suggest troubleshooting steps
If partial results
- Generate summary with available data
- Clearly indicate which runs are missing in "Incomplete Runs" section
- Still identify best performing run from available data
Idempotency
Running summarize-experiment multiple times overwrites summary.md. This is intentional:
- Allows re-running after fixing failed runs
- Summary always reflects current state
Output Files
{experiment_dir}/
├── summary.md # Human-readable summary (new)
└── logs/
└── summarize-experiment.log # Process log (new)
Relationship to Other Skills
- After: run-experiment (or manual execution)
- Before: analyze-experiment (when available)
- Optional hook: run-experiment can invoke this at completion
Future Compatibility
When analyze-experiment is built, summarize-experiment can either:
- Remain as a quick summary option (text only, no plots)
- Be deprecated in favor of richer output
- Become a first stage that analyze-experiment builds upon