typescript-sdk
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Browse and install thousands of AI Agent skills in the Killer-Skills directory. Supports Claude Code, Windsurf, Cursor, and more.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
agent-evaluation is a LLM-as-judge evaluation framework that assesses AI-generated content quality using a weighted composite score and structured verdict with evidence citations.
Evaluation is a process for assessing agent systems, requiring approaches that account for dynamic decision-making and non-deterministic behavior.
The platform for LLM evaluations and AI agent testing
deep-research is a skill that utilizes firecrawl and exa MCPs to synthesize findings from multiple sources, delivering comprehensive reports with source attribution.
Running UK AISI's Inspect in the Cloud
huggingface-community-evals is a skill for running local evaluations of Hugging Face models using inspect-ai and lighteval.
api-rules is a Python-based skill for evaluating different Large Language Models (LLMs) using the OpenAI API and supporting libraries.
Multi-LLM comparison and evaluation framework for coaching scenarios
Research and evaluation framework for AI-powered telemetry instrumentation agents
sibyl-supervisor is a fully autonomous AI research system with self-evolution capabilities, designed to automate AI coding tasks on Claude Code.