Fair-Forge Metric Creator
Create new metrics for the Fair-Forge AI evaluation library. This skill generates all required files following the established patterns.
Usage
/metric-creator [metric-name] [optional description]
Examples:
/metric-creator safety "Evaluate AI response safety and harmlessness"
/metric-creator coherence "Measure logical coherence in multi-turn conversations"
/metric-creator factuality
Files to Create
For a new metric called {MetricName}:
| File | Purpose |
|---|
fair_forge/metrics/{metric_name}.py | Metric implementation |
fair_forge/schemas/{metric_name}.py | Pydantic schema for results |
tests/metrics/test_{metric_name}.py | Unit tests |
tests/fixtures/mock_data.py | Add create_{metric_name}_dataset() |
tests/fixtures/mock_retriever.py | Add {MetricName}DatasetRetriever |
pyproject.toml | Add optional dependency group |
examples/{metric_name}/jupyter/{metric_name}.ipynb | Example notebook |
examples/{metric_name}/data/dataset.json | Sample dataset for examples |
For LLM-Judge Metrics (additional files)
| File | Purpose |
|---|
fair_forge/llm/schemas.py | Add {MetricName}JudgeOutput schema |
fair_forge/llm/prompts.py | Add {metric_name}_reasoning_system_prompt |
fair_forge/llm/__init__.py | Export {MetricName}JudgeOutput |
tests/llm/test_schemas.py | Add Test{MetricName}JudgeOutput tests |
Architecture Pattern
All metrics follow this pattern:
FairForge (base class)
└── YourMetric
├── __init__(): Initialize with retriever and config
├── batch(): Process each conversation batch
└── (optional) _process(): Override for custom aggregation
Data Flow
Retriever.load_dataset() -> list[Dataset]
↓
FairForge._process() iterates datasets
↓
YourMetric.batch() processes each conversation
↓
Results appended to self.metrics
Step-by-Step Workflow
1. Create the Schema
First, create the schema in fair_forge/schemas/{metric_name}.py:
python
1"""{{MetricName}} metric schemas."""
2
3from .metrics import BaseMetric
4
5
6class {{MetricName}}Metric(BaseMetric):
7 """
8 {{MetricName}} metric for evaluating {{description}}.
9
10 Attributes:
11 qa_id: Unique identifier for the Q&A interaction
12 {{metric_name}}_score: Main evaluation score (0.0-1.0)
13 {{metric_name}}_insight: Explanation of the evaluation
14 # Add additional fields as needed
15 """
16
17 qa_id: str
18 {{metric_name}}_score: float
19 {{metric_name}}_insight: str
20 # Add more metric-specific fields
2. Create the Metric Implementation
Create fair_forge/metrics/{metric_name}.py:
python
1"""{{MetricName}} metric for {{description}}."""
2
3from fair_forge.core import FairForge, Retriever
4from fair_forge.schemas import Batch
5from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric
6
7
8class {{MetricName}}(FairForge):
9 """{{Description}}.
10
11 Args:
12 retriever: Retriever class for loading datasets
13 # Add constructor parameters with defaults
14 **kwargs: Additional arguments passed to FairForge base class
15 """
16
17 def __init__(
18 self,
19 retriever: type[Retriever],
20 # Add your parameters here
21 **kwargs,
22 ):
23 super().__init__(retriever, **kwargs)
24 # Initialize your metric-specific attributes
25
26 self.logger.info("--{{METRIC_NAME}} CONFIGURATION--")
27 # Log configuration for debugging
28
29 def batch(
30 self,
31 session_id: str,
32 context: str,
33 assistant_id: str,
34 batch: list[Batch],
35 language: str | None = "english",
36 ):
37 """Process a batch of conversations.
38
39 Args:
40 session_id: Unique session identifier
41 context: Context information for the conversation
42 assistant_id: ID of the assistant being evaluated
43 batch: List of Q&A interactions to evaluate
44 language: Language of the conversation
45 """
46 for interaction in batch:
47 self.logger.debug(f"QA ID: {interaction.qa_id}")
48
49 # Your evaluation logic here
50 score = self._evaluate(interaction)
51
52 metric = {{MetricName}}Metric(
53 session_id=session_id,
54 assistant_id=assistant_id,
55 qa_id=interaction.qa_id,
56 {{metric_name}}_score=score,
57 {{metric_name}}_insight="Evaluation explanation",
58 )
59
60 self.metrics.append(metric)
61
62 def _evaluate(self, interaction: Batch) -> float:
63 """Evaluate a single interaction.
64
65 Args:
66 interaction: The Q&A interaction to evaluate
67
68 Returns:
69 Evaluation score between 0.0 and 1.0
70 """
71 # Implement your evaluation logic
72 return 0.0
3. Update Module Exports
Add to fair_forge/metrics/__init__.py:
python
1# In __all__ list:
2__all__ = [
3 # ... existing metrics
4 "{{MetricName}}",
5]
6
7# In docstring:
8"""
9 from fair_forge.metrics.{{metric_name}} import {{MetricName}}
10"""
3b. Update pyproject.toml
Add the metric to the optional dependencies in pyproject.toml:
toml
1[project.optional-dependencies]
2# For LLM-based metrics (no extra dependencies, user installs their LLM provider):
3{{metric_name}} = []
4
5# For data-based metrics with dependencies:
6{{metric_name}} = [
7 "numpy>=1.24.0",
8 # Add required dependencies
9]
10
11# Also update the metrics group to include the new metric:
12metrics = [
13 "alquimia-fair-forge[context,conversational,bestof,agentic,regulatory,{{metric_name}},humanity,toxicity,bias]",
14]
4. Create Test Fixtures
Add to tests/fixtures/mock_data.py:
python
1def create_{{metric_name}}_dataset() -> Dataset:
2 """Create a dataset for {{MetricName}} metric testing."""
3 return Dataset(
4 session_id="{{metric_name}}_session_001",
5 assistant_id="test_assistant",
6 language="english",
7 context="Test context for {{metric_name}} evaluation.",
8 conversation=[
9 Batch(
10 qa_id="{{metric_name}}_qa_001",
11 query="Test query",
12 assistant="Test assistant response",
13 ground_truth_assistant="Expected response",
14 ),
15 # Add more test interactions
16 ],
17 )
Add to tests/fixtures/mock_retriever.py:
python
1from tests.fixtures.mock_data import create_{{metric_name}}_dataset
2
3class {{MetricName}}DatasetRetriever(Retriever):
4 """Mock retriever for {{MetricName}} metric testing."""
5
6 def load_dataset(self) -> list[Dataset]:
7 """Return {{metric_name}} testing dataset."""
8 return [create_{{metric_name}}_dataset()]
5. Update conftest.py
Add to tests/conftest.py:
python
1# Import in the imports section:
2from tests.fixtures.mock_data import create_{{metric_name}}_dataset
3from tests.fixtures.mock_retriever import {{MetricName}}DatasetRetriever
4
5# Add fixture:
6@pytest.fixture
7def {{metric_name}}_dataset() -> Dataset:
8 """Fixture providing a {{metric_name}} testing dataset."""
9 return create_{{metric_name}}_dataset()
10
11@pytest.fixture
12def {{metric_name}}_dataset_retriever() -> type[{{MetricName}}DatasetRetriever]:
13 """Fixture providing {{MetricName}}DatasetRetriever class."""
14 return {{MetricName}}DatasetRetriever
6. Create Tests
Create tests/metrics/test_{metric_name}.py:
python
1"""Unit tests for {{MetricName}} metric."""
2
3from fair_forge.metrics.{{metric_name}} import {{MetricName}}
4from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric
5
6
7class Test{{MetricName}}Metric:
8 """Test suite for {{MetricName}} metric."""
9
10 def test_initialization(self, {{metric_name}}_dataset_retriever):
11 """Test that {{MetricName}} metric initializes correctly."""
12 metric = {{MetricName}}({{metric_name}}_dataset_retriever)
13 assert metric is not None
14 assert hasattr(metric, "metrics")
15 assert metric.metrics == []
16
17 def test_batch_processing(self, {{metric_name}}_dataset_retriever, {{metric_name}}_dataset):
18 """Test batch processing of interactions."""
19 metric = {{MetricName}}({{metric_name}}_dataset_retriever)
20
21 dataset = {{metric_name}}_dataset
22 metric.batch(
23 session_id=dataset.session_id,
24 context=dataset.context,
25 assistant_id=dataset.assistant_id,
26 batch=dataset.conversation,
27 language=dataset.language,
28 )
29
30 assert len(metric.metrics) == len(dataset.conversation)
31
32 for m in metric.metrics:
33 assert isinstance(m, {{MetricName}}Metric)
34 assert hasattr(m, "{{metric_name}}_score")
35
36 def test_run_method(self, {{metric_name}}_dataset_retriever):
37 """Test the run class method."""
38 metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)
39
40 assert isinstance(metrics, list)
41 assert len(metrics) > 0
42
43 for m in metrics:
44 assert isinstance(m, {{MetricName}}Metric)
45
46 def test_verbose_mode(self, {{metric_name}}_dataset_retriever):
47 """Test that verbose mode works without errors."""
48 metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=True)
49 assert isinstance(metrics, list)
50
51 def test_metric_attributes(self, {{metric_name}}_dataset_retriever):
52 """Test that all expected attributes exist in {{MetricName}}Metric."""
53 metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)
54
55 assert len(metrics) > 0
56 m = metrics[0]
57
58 required_attributes = [
59 "session_id",
60 "assistant_id",
61 "qa_id",
62 "{{metric_name}}_score",
63 "{{metric_name}}_insight",
64 ]
65
66 for attr in required_attributes:
67 assert hasattr(m, attr), f"Missing attribute: {attr}"
Metric Categories
Simple Metrics (like Humanity)
- No external dependencies beyond base libraries
- Process each interaction independently
- Use lexicons or rule-based evaluation
LLM-Judge Metrics (like Context, Conversational)
- Require a
BaseChatModel parameter
- Use the
Judge class from fair_forge.llm
- Need prompt templates in
fair_forge/llm/prompts.py
Guardian-Based Metrics (like Bias)
- Require a
Guardian class for evaluation
- Use statistical confidence intervals
- Need guardian implementations in
fair_forge/guardians/
Aggregation Metrics (like BestOf, Agentic)
- Override
_process() instead of just batch()
- Compare multiple responses or assistants
- Return aggregated results
Common Patterns
Using the Judge for LLM Evaluation
python
1from fair_forge.llm import Judge
2
3judge = Judge(
4 model=self.model,
5 use_structured_output=self.use_structured_output,
6 bos_json_clause=self.bos_json_clause,
7 eos_json_clause=self.eos_json_clause,
8)
9
10reasoning, result = judge.check(
11 system_prompt,
12 user_query,
13 data_dict,
14 output_schema=YourOutputSchema,
15)
Statistical Analysis
python
1from fair_forge.statistical import FrequentistMode, BayesianMode
2
3# For frequentist statistics
4mode = FrequentistMode()
5rate = mode.rate_estimation(successes=k, trials=n)
6
7# For Bayesian statistics
8mode = BayesianMode(mc_samples=5000)
9rate = mode.rate_estimation(successes=k, trials=n)
Logging Best Practices
python
1# Use self.logger for all logging
2self.logger.info("Processing batch...")
3self.logger.debug(f"QA ID: {interaction.qa_id}")
4self.logger.warning("Optional field missing, using default")
7. Create Example Notebook
Create the example directory structure and files:
bash
1mkdir -p examples/{{metric_name}}/jupyter examples/{{metric_name}}/data
Create examples/{{metric_name}}/data/dataset.json with sample test data:
json
1[
2 {
3 "session_id": "{{metric_name}}_session_001",
4 "assistant_id": "test_assistant",
5 "language": "english",
6 "context": "Sample context for {{metric_name}} evaluation",
7 "conversation": [
8 {
9 "qa_id": "qa_001",
10 "query": "Sample user query",
11 "assistant": "Sample assistant response",
12 "ground_truth_assistant": "Expected response"
13 }
14 ]
15 }
16]
Create examples/{{metric_name}}/jupyter/{{metric_name}}.ipynb with:
- Title & Introduction - Explain the metric and use cases
- Installation -
!pip install "alquimia-fair-forge[{{metric_name}}]" langchain-groq -q
- Setup - Import modules and configure API keys
- Custom Retriever - Load the sample dataset
- Configuration - Any metric-specific parameters (e.g., regulations list)
- Run Metric - Execute and show results
- Analyze Results - Display scores and insights
- Export Results - Save to JSON for reporting
8. For LLM-Judge Metrics: Add Judge Output Schema
Add to fair_forge/llm/schemas.py:
python
1class {{MetricName}}JudgeOutput(BaseModel):
2 """Structured output for {{metric_name}} evaluation."""
3
4 {{metric_name}}_score: float = Field(
5 ge=0, le=1, description="{{MetricName}} score (0-1)"
6 )
7 insight: str = Field(description="Insight about the evaluation")
8 # Add metric-specific fields
Add to fair_forge/llm/__init__.py:
python
1from .schemas import (
2 # ... existing exports
3 {{MetricName}}JudgeOutput,
4)
5
6__all__ = [
7 # ... existing exports
8 "{{MetricName}}JudgeOutput",
9]
Add prompt to fair_forge/llm/prompts.py:
python
1{{metric_name}}_reasoning_system_prompt = """
2You are a {{MetricName}} Analyzer. Your role is to evaluate...
3
41. **Step 1:** ...
52. **Step 2:** ...
6
7## Input Data:
8{input_field}
9
10## Assistant's Response:
11{assistant_answer}
12"""
Add tests to tests/llm/test_schemas.py:
python
1class Test{{MetricName}}JudgeOutput:
2 """Tests for {{MetricName}}JudgeOutput schema."""
3
4 def test_valid_output(self):
5 output = {{MetricName}}JudgeOutput(
6 {{metric_name}}_score=0.85,
7 insight="Good evaluation"
8 )
9 assert output.{{metric_name}}_score == 0.85
10
11 def test_score_bounds(self):
12 with pytest.raises(ValidationError):
13 {{MetricName}}JudgeOutput({{metric_name}}_score=1.5, insight="Test")
Verification Checklist
After creating all files, verify:
Template Files
See templates/ directory for ready-to-use boilerplate:
metric.py.template - Basic metric implementation
schema.py.template - Schema definition
test.py.template - Test file structure