What is metric-creator?

Ideal for AI Evaluator Agents requiring customizable metric creation for Fair-Forge AI evaluation library Evaluate & measure your AI assistant's implementation

How do I install metric-creator?

Run the command: npx killer-skills add Alquimia-ai/fair-forge. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for metric-creator?

Key use cases include: Creating custom safety metrics for AI response evaluation, Developing coherence metrics for multi-turn conversation analysis, Generating factuality metrics for AI-generated content assessment.

Which IDEs are compatible with metric-creator?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for metric-creator?

Requires Fair-Forge AI evaluation library integration. Limited to metric creation within the Fair-Forge framework.

Fair-Forge Metric Creator

Name: metric-creator
Availability: InStock
Author: Alquimia-ai

Create new metrics for the Fair-Forge AI evaluation library. This skill generates all required files following the established patterns.

Usage

/metric-creator [metric-name] [optional description]

Examples:

/metric-creator safety "Evaluate AI response safety and harmlessness"
/metric-creator coherence "Measure logical coherence in multi-turn conversations"
/metric-creator factuality

Files to Create

For a new metric called {MetricName}:

File	Purpose
`fair_forge/metrics/{metric_name}.py`	Metric implementation
`fair_forge/schemas/{metric_name}.py`	Pydantic schema for results
`tests/metrics/test_{metric_name}.py`	Unit tests
`tests/fixtures/mock_data.py`	Add `create_{metric_name}_dataset()`
`tests/fixtures/mock_retriever.py`	Add `{MetricName}DatasetRetriever`
`pyproject.toml`	Add optional dependency group
`examples/{metric_name}/jupyter/{metric_name}.ipynb`	Example notebook
`examples/{metric_name}/data/dataset.json`	Sample dataset for examples

For LLM-Judge Metrics (additional files)

File	Purpose
`fair_forge/llm/schemas.py`	Add `{MetricName}JudgeOutput` schema
`fair_forge/llm/prompts.py`	Add `{metric_name}_reasoning_system_prompt`
`fair_forge/llm/__init__.py`	Export `{MetricName}JudgeOutput`
`tests/llm/test_schemas.py`	Add `Test{MetricName}JudgeOutput` tests

Architecture Pattern

All metrics follow this pattern:

FairForge (base class)
    └── YourMetric
            ├── __init__(): Initialize with retriever and config
            ├── batch(): Process each conversation batch
            └── (optional) _process(): Override for custom aggregation

Data Flow

Retriever.load_dataset() -> list[Dataset]
    ↓
FairForge._process() iterates datasets
    ↓
YourMetric.batch() processes each conversation
    ↓
Results appended to self.metrics

Step-by-Step Workflow

1. Create the Schema

First, create the schema in fair_forge/schemas/{metric_name}.py:

python
1"""{{MetricName}} metric schemas."""
2
3from .metrics import BaseMetric
4
5
6class {{MetricName}}Metric(BaseMetric):
7    """
8    {{MetricName}} metric for evaluating {{description}}.
9
10    Attributes:
11        qa_id: Unique identifier for the Q&A interaction
12        {{metric_name}}_score: Main evaluation score (0.0-1.0)
13        {{metric_name}}_insight: Explanation of the evaluation
14        # Add additional fields as needed
15    """
16
17    qa_id: str
18    {{metric_name}}_score: float
19    {{metric_name}}_insight: str
20    # Add more metric-specific fields

2. Create the Metric Implementation

Create fair_forge/metrics/{metric_name}.py:

python
1"""{{MetricName}} metric for {{description}}."""
2
3from fair_forge.core import FairForge, Retriever
4from fair_forge.schemas import Batch
5from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric
6
7
8class {{MetricName}}(FairForge):
9    """{{Description}}.
10
11    Args:
12        retriever: Retriever class for loading datasets
13        # Add constructor parameters with defaults
14        **kwargs: Additional arguments passed to FairForge base class
15    """
16
17    def __init__(
18        self,
19        retriever: type[Retriever],
20        # Add your parameters here
21        **kwargs,
22    ):
23        super().__init__(retriever, **kwargs)
24        # Initialize your metric-specific attributes
25
26        self.logger.info("--{{METRIC_NAME}} CONFIGURATION--")
27        # Log configuration for debugging
28
29    def batch(
30        self,
31        session_id: str,
32        context: str,
33        assistant_id: str,
34        batch: list[Batch],
35        language: str | None = "english",
36    ):
37        """Process a batch of conversations.
38
39        Args:
40            session_id: Unique session identifier
41            context: Context information for the conversation
42            assistant_id: ID of the assistant being evaluated
43            batch: List of Q&A interactions to evaluate
44            language: Language of the conversation
45        """
46        for interaction in batch:
47            self.logger.debug(f"QA ID: {interaction.qa_id}")
48
49            # Your evaluation logic here
50            score = self._evaluate(interaction)
51
52            metric = {{MetricName}}Metric(
53                session_id=session_id,
54                assistant_id=assistant_id,
55                qa_id=interaction.qa_id,
56                {{metric_name}}_score=score,
57                {{metric_name}}_insight="Evaluation explanation",
58            )
59
60            self.metrics.append(metric)
61
62    def _evaluate(self, interaction: Batch) -> float:
63        """Evaluate a single interaction.
64
65        Args:
66            interaction: The Q&A interaction to evaluate
67
68        Returns:
69            Evaluation score between 0.0 and 1.0
70        """
71        # Implement your evaluation logic
72        return 0.0

3. Update Module Exports

Add to fair_forge/metrics/__init__.py:

python
1# In __all__ list:
2__all__ = [
3    # ... existing metrics
4    "{{MetricName}}",
5]
6
7# In docstring:
8"""
9    from fair_forge.metrics.{{metric_name}} import {{MetricName}}
10"""

3b. Update pyproject.toml

Add the metric to the optional dependencies in pyproject.toml:

toml
1[project.optional-dependencies]
2# For LLM-based metrics (no extra dependencies, user installs their LLM provider):
3{{metric_name}} = []
4
5# For data-based metrics with dependencies:
6{{metric_name}} = [
7    "numpy>=1.24.0",
8    # Add required dependencies
9]
10
11# Also update the metrics group to include the new metric:
12metrics = [
13    "alquimia-fair-forge[context,conversational,bestof,agentic,regulatory,{{metric_name}},humanity,toxicity,bias]",
14]

4. Create Test Fixtures

Add to tests/fixtures/mock_data.py:

python
1def create_{{metric_name}}_dataset() -> Dataset:
2    """Create a dataset for {{MetricName}} metric testing."""
3    return Dataset(
4        session_id="{{metric_name}}_session_001",
5        assistant_id="test_assistant",
6        language="english",
7        context="Test context for {{metric_name}} evaluation.",
8        conversation=[
9            Batch(
10                qa_id="{{metric_name}}_qa_001",
11                query="Test query",
12                assistant="Test assistant response",
13                ground_truth_assistant="Expected response",
14            ),
15            # Add more test interactions
16        ],
17    )

Add to tests/fixtures/mock_retriever.py:

python
1from tests.fixtures.mock_data import create_{{metric_name}}_dataset
2
3class {{MetricName}}DatasetRetriever(Retriever):
4    """Mock retriever for {{MetricName}} metric testing."""
5
6    def load_dataset(self) -> list[Dataset]:
7        """Return {{metric_name}} testing dataset."""
8        return [create_{{metric_name}}_dataset()]

5. Update conftest.py

Add to tests/conftest.py:

python
1# Import in the imports section:
2from tests.fixtures.mock_data import create_{{metric_name}}_dataset
3from tests.fixtures.mock_retriever import {{MetricName}}DatasetRetriever
4
5# Add fixture:
6@pytest.fixture
7def {{metric_name}}_dataset() -> Dataset:
8    """Fixture providing a {{metric_name}} testing dataset."""
9    return create_{{metric_name}}_dataset()
10
11@pytest.fixture
12def {{metric_name}}_dataset_retriever() -> type[{{MetricName}}DatasetRetriever]:
13    """Fixture providing {{MetricName}}DatasetRetriever class."""
14    return {{MetricName}}DatasetRetriever

6. Create Tests

Create tests/metrics/test_{metric_name}.py:

python
1"""Unit tests for {{MetricName}} metric."""
2
3from fair_forge.metrics.{{metric_name}} import {{MetricName}}
4from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric
5
6
7class Test{{MetricName}}Metric:
8    """Test suite for {{MetricName}} metric."""
9
10    def test_initialization(self, {{metric_name}}_dataset_retriever):
11        """Test that {{MetricName}} metric initializes correctly."""
12        metric = {{MetricName}}({{metric_name}}_dataset_retriever)
13        assert metric is not None
14        assert hasattr(metric, "metrics")
15        assert metric.metrics == []
16
17    def test_batch_processing(self, {{metric_name}}_dataset_retriever, {{metric_name}}_dataset):
18        """Test batch processing of interactions."""
19        metric = {{MetricName}}({{metric_name}}_dataset_retriever)
20
21        dataset = {{metric_name}}_dataset
22        metric.batch(
23            session_id=dataset.session_id,
24            context=dataset.context,
25            assistant_id=dataset.assistant_id,
26            batch=dataset.conversation,
27            language=dataset.language,
28        )
29
30        assert len(metric.metrics) == len(dataset.conversation)
31
32        for m in metric.metrics:
33            assert isinstance(m, {{MetricName}}Metric)
34            assert hasattr(m, "{{metric_name}}_score")
35
36    def test_run_method(self, {{metric_name}}_dataset_retriever):
37        """Test the run class method."""
38        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)
39
40        assert isinstance(metrics, list)
41        assert len(metrics) > 0
42
43        for m in metrics:
44            assert isinstance(m, {{MetricName}}Metric)
45
46    def test_verbose_mode(self, {{metric_name}}_dataset_retriever):
47        """Test that verbose mode works without errors."""
48        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=True)
49        assert isinstance(metrics, list)
50
51    def test_metric_attributes(self, {{metric_name}}_dataset_retriever):
52        """Test that all expected attributes exist in {{MetricName}}Metric."""
53        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)
54
55        assert len(metrics) > 0
56        m = metrics[0]
57
58        required_attributes = [
59            "session_id",
60            "assistant_id",
61            "qa_id",
62            "{{metric_name}}_score",
63            "{{metric_name}}_insight",
64        ]
65
66        for attr in required_attributes:
67            assert hasattr(m, attr), f"Missing attribute: {attr}"

Metric Categories

Simple Metrics (like Humanity)

No external dependencies beyond base libraries
Process each interaction independently
Use lexicons or rule-based evaluation

LLM-Judge Metrics (like Context, Conversational)

Require a BaseChatModel parameter
Use the Judge class from fair_forge.llm
Need prompt templates in fair_forge/llm/prompts.py

Guardian-Based Metrics (like Bias)

Require a Guardian class for evaluation
Use statistical confidence intervals
Need guardian implementations in fair_forge/guardians/

Aggregation Metrics (like BestOf, Agentic)

Override _process() instead of just batch()
Compare multiple responses or assistants
Return aggregated results

Common Patterns

Using the Judge for LLM Evaluation

python
1from fair_forge.llm import Judge
2
3judge = Judge(
4    model=self.model,
5    use_structured_output=self.use_structured_output,
6    bos_json_clause=self.bos_json_clause,
7    eos_json_clause=self.eos_json_clause,
8)
9
10reasoning, result = judge.check(
11    system_prompt,
12    user_query,
13    data_dict,
14    output_schema=YourOutputSchema,
15)

Statistical Analysis

python
1from fair_forge.statistical import FrequentistMode, BayesianMode
2
3# For frequentist statistics
4mode = FrequentistMode()
5rate = mode.rate_estimation(successes=k, trials=n)
6
7# For Bayesian statistics
8mode = BayesianMode(mc_samples=5000)
9rate = mode.rate_estimation(successes=k, trials=n)

Logging Best Practices

python
1# Use self.logger for all logging
2self.logger.info("Processing batch...")
3self.logger.debug(f"QA ID: {interaction.qa_id}")
4self.logger.warning("Optional field missing, using default")

7. Create Example Notebook

Create the example directory structure and files:

bash
1mkdir -p examples/{{metric_name}}/jupyter examples/{{metric_name}}/data

Create examples/{{metric_name}}/data/dataset.json with sample test data:

json
1[
2  {
3    "session_id": "{{metric_name}}_session_001",
4    "assistant_id": "test_assistant",
5    "language": "english",
6    "context": "Sample context for {{metric_name}} evaluation",
7    "conversation": [
8      {
9        "qa_id": "qa_001",
10        "query": "Sample user query",
11        "assistant": "Sample assistant response",
12        "ground_truth_assistant": "Expected response"
13      }
14    ]
15  }
16]

Create examples/{{metric_name}}/jupyter/{{metric_name}}.ipynb with:

Title & Introduction - Explain the metric and use cases
Installation - !pip install "alquimia-fair-forge[{{metric_name}}]" langchain-groq -q
Setup - Import modules and configure API keys
Custom Retriever - Load the sample dataset
Configuration - Any metric-specific parameters (e.g., regulations list)
Run Metric - Execute and show results
Analyze Results - Display scores and insights
Export Results - Save to JSON for reporting

8. For LLM-Judge Metrics: Add Judge Output Schema

Add to fair_forge/llm/schemas.py:

python
1class {{MetricName}}JudgeOutput(BaseModel):
2    """Structured output for {{metric_name}} evaluation."""
3
4    {{metric_name}}_score: float = Field(
5        ge=0, le=1, description="{{MetricName}} score (0-1)"
6    )
7    insight: str = Field(description="Insight about the evaluation")
8    # Add metric-specific fields

Add to fair_forge/llm/__init__.py:

python
1from .schemas import (
2    # ... existing exports
3    {{MetricName}}JudgeOutput,
4)
5
6__all__ = [
7    # ... existing exports
8    "{{MetricName}}JudgeOutput",
9]

Add prompt to fair_forge/llm/prompts.py:

python
1{{metric_name}}_reasoning_system_prompt = """
2You are a {{MetricName}} Analyzer. Your role is to evaluate...
3
41. **Step 1:** ...
52. **Step 2:** ...
6
7## Input Data:
8{input_field}
9
10## Assistant's Response:
11{assistant_answer}
12"""

Add tests to tests/llm/test_schemas.py:

python
1class Test{{MetricName}}JudgeOutput:
2    """Tests for {{MetricName}}JudgeOutput schema."""
3
4    def test_valid_output(self):
5        output = {{MetricName}}JudgeOutput(
6            {{metric_name}}_score=0.85,
7            insight="Good evaluation"
8        )
9        assert output.{{metric_name}}_score == 0.85
10
11    def test_score_bounds(self):
12        with pytest.raises(ValidationError):
13            {{MetricName}}JudgeOutput({{metric_name}}_score=1.5, insight="Test")

Verification Checklist

After creating all files, verify:

Template Files

See templates/ directory for ready-to-use boilerplate:

metric.py.template - Basic metric implementation
schema.py.template - Schema definition
test.py.template - Test file structure

metric-creator — community metric-creator, fair-forge, community, ide skills, Claude Code, Cursor, Windsurf

About this Skill

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for metric-creator

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

metric-creator