llm-evaluation AI Agent Skills Search Results

typescript-sdk

comet-ml

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

★ 17.8k

⑂ 0

AI

hugging-face-evaluation

[ Official ]

huggingface

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

★ 8.2k

⑂ 0

AI

agent-evaluation

oimiragieo

agent-evaluation is a LLM-as-judge evaluation framework that assesses AI-generated content quality using a weighted composite score and structured verdict with evidence citations.

★ 14

⑂ 0

Developer

evaluation

mshraditya

Evaluation is a process for assessing agent systems, requiring approaches that account for dynamic decision-making and non-deterministic behavior.

★ 0

⑂ 0

Developer

e2e

langwatch

The platform for LLM evaluations and AI agent testing

★ 2.8k

⑂ 0

AI

debug-stuck-eval

METR

Running UK AISI's Inspect in the Cloud

★ 20

⑂ 0

AI

api-rules

skysheng7

api-rules is a Python-based skill for evaluating different Large Language Models (LLMs) using the OpenAI API and supporting libraries.

★ 0

⑂ 0

Developer

deep-research

[ Featured ]

affaan-m

deep-research is a skill that utilizes firecrawl and exa MCPs to synthesize findings from multiple sources, delivering comprehensive reports with source attribution.