LLM Evaluation
Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.
When to Use This Skill
Apply this skill when:
- Testing individual prompts for correctness and formatting
- Validating RAG (Retrieval-Augmented Generation) pipeline quality
- Measuring hallucinations, bias, or toxicity in LLM outputs
- Comparing different models or prompt configurations (A/B testing)
- Running benchmark tests (MMLU, HumanEval) to assess model capabilities
- Setting up production monitoring for LLM applications
- Integrating LLM quality checks into CI/CD pipelines
Common triggers:
- "How do I test if my RAG system is working correctly?"
- "How can I measure hallucinations in LLM outputs?"
- "What metrics should I use to evaluate generation quality?"
- "How do I compare GPT-4 vs Claude for my use case?"
- "How do I detect bias in LLM responses?"
Evaluation Strategy Selection
Decision Framework: Which Evaluation Approach?
By Task Type:
| Task Type | Primary Approach | Metrics | Tools |
|---|
| Classification (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn |
| Generation (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging |
| Question Answering | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators |
| RAG Systems | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library |
| Code Generation | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest |
| Multi-step Agents | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators |
By Volume and Cost:
| Samples | Speed | Cost | Recommended Approach |
|---|
| 1,000+ | Immediate | $0 | Automated metrics (regex, JSON validation) |
| 100-1,000 | Minutes | $0.01-0.10 each | LLM-as-judge (GPT-4, Claude) |
| < 100 | Hours | $1-10 each | Human evaluation (pairwise comparison) |
Layered Approach (Recommended for Production):
- Layer 1: Automated metrics for all outputs (fast, cheap)
- Layer 2: LLM-as-judge for 10% sample (nuanced quality)
- Layer 3: Human review for 1% edge cases (validation)
Core Evaluation Patterns
Unit Evaluation (Individual Prompts)
Test single prompt-response pairs for correctness.
Methods:
- Exact Match: Response exactly matches expected output
- Regex Matching: Response follows expected pattern
- JSON Schema Validation: Structured output validation
- Keyword Presence: Required terms appear in response
- LLM-as-Judge: Binary pass/fail using evaluation prompt
Example Use Cases:
- Email classification (spam/not spam)
- Entity extraction (dates, names, locations)
- JSON output formatting validation
- Sentiment analysis (positive/negative/neutral)
Quick Start (Python):
python
1import pytest
2from openai import OpenAI
3
4client = OpenAI()
5
6def classify_sentiment(text: str) -> str:
7 response = client.chat.completions.create(
8 model="gpt-3.5-turbo",
9 messages=[
10 {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
11 {"role": "user", "content": text}
12 ],
13 temperature=0
14 )
15 return response.choices[0].message.content.strip().lower()
16
17def test_positive_sentiment():
18 result = classify_sentiment("I love this product!")
19 assert result == "positive"
For complete unit evaluation examples, see examples/python/unit_evaluation.py and examples/typescript/unit-evaluation.ts.
RAG (Retrieval-Augmented Generation) Evaluation
Evaluate RAG systems using RAGAS framework metrics.
Critical Metrics (Priority Order):
-
Faithfulness (Target: > 0.8) - MOST CRITICAL
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
-
Answer Relevance (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
-
Context Relevance (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
-
Context Precision (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
-
Context Recall (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy
Quick Start (Python with RAGAS):
python
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
3from datasets import Dataset
4
5data = {
6 "question": ["What is the capital of France?"],
7 "answer": ["The capital of France is Paris."],
8 "contexts": [["Paris is the capital of France."]],
9 "ground_truth": ["Paris"]
10}
11
12dataset = Dataset.from_dict(data)
13results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
14print(f"Faithfulness: {results['faithfulness']:.2f}")
For comprehensive RAG evaluation patterns, see references/rag-evaluation.md and examples/python/ragas_example.py.
LLM-as-Judge Evaluation
Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.
When to Use:
- Generation quality assessment (summaries, creative writing)
- Nuanced evaluation criteria (tone, clarity, helpfulness)
- Custom rubrics for domain-specific tasks
- Medium-volume evaluation (100-1,000 samples)
Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics
Best Practices:
- Use clear, specific rubrics (1-5 scale with detailed criteria)
- Include few-shot examples in evaluation prompt
- Average multiple evaluations to reduce variance
- Be aware of biases (position bias, verbosity bias, self-preference)
Quick Start (Python):
python
1from openai import OpenAI
2
3client = OpenAI()
4
5def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
6 """Returns (score 1-5, reasoning)"""
7 eval_prompt = f"""
8Rate the following LLM response on relevance and helpfulness.
9
10USER PROMPT: {prompt}
11LLM RESPONSE: {response}
12
13Provide:
14Score: [1-5, where 5 is best]
15Reasoning: [1-2 sentences]
16"""
17 result = client.chat.completions.create(
18 model="gpt-4",
19 messages=[{"role": "user", "content": eval_prompt}],
20 temperature=0.3
21 )
22 content = result.choices[0].message.content
23 lines = content.strip().split('\n')
24 score = int(lines[0].split(':')[1].strip())
25 reasoning = lines[1].split(':', 1)[1].strip()
26 return score, reasoning
For detailed LLM-as-judge patterns and prompt templates, see references/llm-as-judge.md and examples/python/llm_as_judge.py.
Safety and Alignment Evaluation
Measure hallucinations, bias, and toxicity in LLM outputs.
Hallucination Detection
Methods:
-
Faithfulness to Context (RAG):
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
-
Factual Accuracy (Closed-Book):
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
-
Self-Consistency:
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination
Bias Evaluation
Types of Bias:
- Gender bias (stereotypical associations)
- Racial/ethnic bias (discriminatory outputs)
- Cultural bias (Western-centric assumptions)
- Age/disability bias (ableist or ageist language)
Evaluation Methods:
-
Stereotype Tests:
- BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
- BOLD (Bias in Open-Ended Language Generation)
-
Counterfactual Evaluation:
- Generate responses with demographic swaps
- Example: "Dr. Smith (he/she) recommended..." → compare outputs
- Measure consistency across variations
Toxicity Detection
Tools:
- Perspective API (Google): Toxicity, threat, insult scores
- Detoxify (HuggingFace): Open-source toxicity classifier
- OpenAI Moderation API: Hate, harassment, violence detection
For comprehensive safety evaluation patterns, see references/safety-evaluation.md.
Benchmark Testing
Assess model capabilities using standardized benchmarks.
Standard Benchmarks:
| Benchmark | Coverage | Format | Difficulty | Use Case |
|---|
| MMLU | 57 subjects (STEM, humanities) | Multiple choice | High school - professional | General intelligence |
| HellaSwag | Sentence completion | Multiple choice | Common sense | Reasoning validation |
| GPQA | PhD-level science | Multiple choice | Very high (expert-level) | Frontier model testing |
| HumanEval | 164 Python problems | Code generation | Medium | Code capability |
| MATH | 12,500 competition problems | Math solving | High school competitions | Math reasoning |
Domain-Specific Benchmarks:
- Medical: MedQA (USMLE), PubMedQA
- Legal: LegalBench
- Finance: FinQA, ConvFinQA
When to Use Benchmarks:
- Comparing multiple models (GPT-4 vs Claude vs Llama)
- Model selection for specific domains
- Baseline capability assessment
- Academic research and publication
Quick Start (lm-evaluation-harness):
bash
1pip install lm-eval
2
3# Evaluate GPT-4 on MMLU
4lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5
For detailed benchmark testing patterns, see references/benchmarks.md and scripts/benchmark_runner.py.
Production Evaluation
Monitor and optimize LLM quality in production environments.
A/B Testing
Compare two LLM configurations:
- Variant A: GPT-4 (expensive, high quality)
- Variant B: Claude Sonnet (cheaper, fast)
Metrics:
- User satisfaction scores (thumbs up/down)
- Task completion rates
- Response time and latency
- Cost per successful interaction
Online Evaluation
Real-time quality monitoring:
- Response Quality: LLM-as-judge scoring every Nth response
- User Feedback: Explicit ratings, thumbs up/down
- Business Metrics: Conversion rates, support ticket resolution
- Cost Tracking: Tokens used, inference costs
Human-in-the-Loop
Sample-based human evaluation:
- Random Sampling: Evaluate 10% of responses
- Confidence-Based: Evaluate low-confidence outputs
- Error-Triggered: Flag suspicious responses for review
For production evaluation patterns and monitoring strategies, see references/production-evaluation.md.
Classification Task Evaluation
For tasks with discrete outputs (sentiment, intent, category).
Metrics:
- Accuracy: Correct predictions / Total predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1 Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed breakdown of prediction errors
Quick Start (Python):
python
1from sklearn.metrics import accuracy_score, precision_recall_fscore_support
2
3y_true = ["positive", "negative", "neutral", "positive", "negative"]
4y_pred = ["positive", "negative", "neutral", "neutral", "negative"]
5
6accuracy = accuracy_score(y_true, y_pred)
7precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
8
9print(f"Accuracy: {accuracy:.2f}")
10print(f"Precision: {precision:.2f}")
11print(f"Recall: {recall:.2f}")
12print(f"F1 Score: {f1:.2f}")
For complete classification evaluation examples, see examples/python/classification_metrics.py.
Generation Task Evaluation
For open-ended text generation (summaries, creative writing, responses).
Automated Metrics (Use with Caution):
- BLEU: N-gram overlap with reference text (0-1 score)
- ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-L)
- METEOR: Semantic similarity with stemming
- BERTScore: Contextual embedding similarity (0-1 score)
Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.
Recommended Approach:
- Automated metrics: Fast feedback for objective aspects (length, format)
- LLM-as-judge: Nuanced quality assessment (relevance, coherence, helpfulness)
- Human evaluation: Final validation for subjective criteria (preference, creativity)
For detailed generation evaluation patterns, see references/evaluation-types.md.
Quick Reference Tables
Evaluation Framework Selection
| If Task Is... | Use This Framework | Primary Metric |
|---|
| RAG system | RAGAS | Faithfulness > 0.8 |
| Classification | scikit-learn metrics | Accuracy, F1 |
| Generation quality | LLM-as-judge | Quality rubric (1-5) |
| Code generation | HumanEval | Pass@1, Test pass rate |
| Model comparison | Benchmark testing | MMLU, HellaSwag scores |
| Safety validation | Hallucination detection | Faithfulness, Fact-check |
| Production monitoring | Online evaluation | User feedback, Business KPIs |
Python Library Recommendations
| Library | Use Case | Installation |
|---|
| RAGAS | RAG evaluation | pip install ragas |
| DeepEval | General LLM evaluation, pytest integration | pip install deepeval |
| LangSmith | Production monitoring, A/B testing | pip install langsmith |
| lm-eval | Benchmark testing (MMLU, HumanEval) | pip install lm-eval |
| scikit-learn | Classification metrics | pip install scikit-learn |
Safety Evaluation Priority Matrix
| Application | Hallucination Risk | Bias Risk | Toxicity Risk | Evaluation Priority |
|---|
| Customer Support | High | Medium | High | 1. Faithfulness, 2. Toxicity, 3. Bias |
| Medical Diagnosis | Critical | High | Low | 1. Factual Accuracy, 2. Hallucination, 3. Bias |
| Creative Writing | Low | Medium | Medium | 1. Quality/Fluency, 2. Content Policy |
| Code Generation | Medium | Low | Low | 1. Functional Correctness, 2. Security |
| Content Moderation | Low | Critical | Critical | 1. Bias, 2. False Positives/Negatives |
Detailed References
For comprehensive documentation on specific topics:
- Evaluation types (classification, generation, QA, code):
references/evaluation-types.md
- RAG evaluation deep dive (RAGAS framework):
references/rag-evaluation.md
- Safety evaluation (hallucination, bias, toxicity):
references/safety-evaluation.md
- Benchmark testing (MMLU, HumanEval, domain benchmarks):
references/benchmarks.md
- LLM-as-judge best practices and prompts:
references/llm-as-judge.md
- Production evaluation (A/B testing, monitoring):
references/production-evaluation.md
- All metrics definitions and formulas:
references/metrics-reference.md
Working Examples
Python Examples:
examples/python/unit_evaluation.py - Basic prompt testing with pytest
examples/python/ragas_example.py - RAGAS RAG evaluation
examples/python/deepeval_example.py - DeepEval framework usage
examples/python/llm_as_judge.py - GPT-4 as evaluator
examples/python/classification_metrics.py - Accuracy, precision, recall
examples/python/benchmark_testing.py - HumanEval example
TypeScript Examples:
examples/typescript/unit-evaluation.ts - Vitest + OpenAI
examples/typescript/llm-as-judge.ts - GPT-4 evaluation
examples/typescript/langsmith-integration.ts - Production monitoring
Executable Scripts
Run evaluations without loading code into context (token-free):
scripts/run_ragas_eval.py - Run RAGAS evaluation on dataset
scripts/compare_models.py - A/B test two models
scripts/benchmark_runner.py - Run MMLU/HumanEval benchmarks
scripts/hallucination_checker.py - Detect hallucinations in outputs
Example usage:
bash
1# Run RAGAS evaluation on custom dataset
2python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
3
4# Compare GPT-4 vs Claude on benchmark
5python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
Integration with Other Skills
Related Skills:
building-ai-chat: Evaluate AI chat applications (this skill tests what that skill builds)
prompt-engineering: Test prompt quality and effectiveness
testing-strategies: Apply testing pyramid to LLM evaluation (unit → integration → E2E)
observability: Production monitoring and alerting for LLM quality
building-ci-pipelines: Integrate LLM evaluation into CI/CD
Workflow Integration:
- Write prompt (use
prompt-engineering skill)
- Unit test prompt (use
llm-evaluation skill)
- Build AI feature (use
building-ai-chat skill)
- Integration test RAG pipeline (use
llm-evaluation skill)
- Deploy to production (use
deploying-applications skill)
- Monitor quality (use
llm-evaluation + observability skills)
Common Pitfalls
1. Over-reliance on Automated Metrics for Generation
- BLEU/ROUGE correlate weakly with human judgment for creative text
- Solution: Layer LLM-as-judge or human evaluation
2. Ignoring Faithfulness in RAG Systems
- Hallucinations are the #1 RAG failure mode
- Solution: Prioritize faithfulness metric (target > 0.8)
3. No Production Monitoring
- Models can degrade over time, prompts can break with updates
- Solution: Set up continuous evaluation (LangSmith, custom monitoring)
4. Biased LLM-as-Judge Evaluation
- Evaluator LLMs have biases (position bias, verbosity bias)
- Solution: Average multiple evaluations, use diverse evaluation prompts
5. Insufficient Benchmark Coverage
- Single benchmark doesn't capture full model capability
- Solution: Use 3-5 benchmarks across different domains
6. Missing Safety Evaluation
- Production LLMs can generate harmful content
- Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline