hugging-face-datasets — official hugging-face-datasets, official, ide skills, Claude Code, Cursor, Windsurf

Verified
v1.0.0
GitHub

About this Skill

Perfect for Data Science Agents needing advanced dataset management and SQL querying capabilities on Hugging Face Hub. Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

huggingface huggingface
[8.2k]
[481]
Updated: 3/5/2026

Agent Capability Analysis

The hugging-face-datasets skill by huggingface is an open-source official AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance.

Ideal Agent Persona

Perfect for Data Science Agents needing advanced dataset management and SQL querying capabilities on Hugging Face Hub.

Core Value

Empowers agents to create and manage datasets on Hugging Face Hub, supporting initializing repos, defining configs, streaming row updates, and SQL-based dataset querying and transformation, all while complementing the HF MCP server for comprehensive dataset workflows.

Capabilities Granted for hugging-face-datasets

Creating and configuring datasets for machine learning model training
Streaming row updates for real-time data reflection
Querying and transforming datasets using SQL for data analysis and visualization

! Prerequisites & Limits

  • Requires Hugging Face Hub account and HF MCP server setup
  • Designed specifically for Hugging Face datasets and workflows
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

hugging-face-datasets

Install hugging-face-datasets, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Overview

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.

Integration with HF MCP Server

  • Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
  • Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting

Version

2.1.0

Dependencies

This skill uses PEP 723 scripts with inline dependency management

Scripts auto-install requirements when run with: uv run scripts/script_name.py

  • uv (Python package manager)
  • Getting Started: See "Usage Instructions" below for PEP 723 usage

Core Capabilities

1. Dataset Lifecycle Management

  • Initialize: Create new dataset repositories with proper structure
  • Configure: Store detailed configuration including system prompts and metadata
  • Stream Updates: Add rows efficiently without downloading entire datasets

2. SQL-Based Dataset Querying (NEW)

Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py:

  • Direct Queries: Run SQL on datasets using the hf:// protocol
  • Schema Discovery: Describe dataset structure and column types
  • Data Sampling: Get random samples for exploration
  • Aggregations: Count, histogram, unique values analysis
  • Transformations: Filter, join, reshape data with SQL
  • Export & Push: Save results locally or push to new Hub repos

3. Multi-Format Dataset Support

Supports diverse dataset types through template system:

  • Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
  • Text Classification: Sentiment analysis, intent detection, topic classification
  • Question-Answering: Reading comprehension, factual QA, knowledge bases
  • Text Completion: Language modeling, code completion, creative writing
  • Tabular Data: Structured data for regression/classification tasks
  • Custom Formats: Flexible schema definition for specialized needs

4. Quality Assurance Features

  • JSON Validation: Ensures data integrity during uploads
  • Batch Processing: Efficient handling of large datasets
  • Error Recovery: Graceful handling of upload failures and conflicts

Usage Instructions

The skill includes two Python scripts that use PEP 723 inline dependency management:

All paths are relative to the directory containing this SKILL.md file. Scripts are run with: uv run scripts/script_name.py [arguments]

  • scripts/dataset_manager.py - Dataset creation and management
  • scripts/sql_manager.py - SQL-based dataset querying and transformation

Prerequisites

  • uv package manager installed
  • HF_TOKEN environment variable must be set with a Write-access token

SQL Dataset Querying (sql_manager.py)

Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).

Quick Start

bash
1# Query a dataset 2uv run scripts/sql_manager.py query \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10" 5 6# Get dataset schema 7uv run scripts/sql_manager.py describe --dataset "cais/mmlu" 8 9# Sample random rows 10uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5 11 12# Count rows with filter 13uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"

SQL Query Syntax

Use data as the table name in your SQL - it gets replaced with the actual hf:// path:

sql
1-- Basic select 2SELECT * FROM data LIMIT 10 3 4-- Filtering 5SELECT * FROM data WHERE subject='nutrition' 6 7-- Aggregations 8SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC 9 10-- Column selection and transformation 11SELECT question, choices[answer] AS correct_answer FROM data 12 13-- Regex matching 14SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet') 15 16-- String functions 17SELECT regexp_replace(question, '\n', '') AS cleaned FROM data

Common Operations

1. Explore Dataset Structure

bash
1# Get schema 2uv run scripts/sql_manager.py describe --dataset "cais/mmlu" 3 4# Get unique values in column 5uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject" 6 7# Get value distribution 8uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20

2. Filter and Transform

bash
1# Complex filtering with SQL 2uv run scripts/sql_manager.py query \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100" 5 6# Using transform command 7uv run scripts/sql_manager.py transform \ 8 --dataset "cais/mmlu" \ 9 --select "subject, COUNT(*) as cnt" \ 10 --group-by "subject" \ 11 --order-by "cnt DESC" \ 12 --limit 10

3. Create Subsets and Push to Hub

bash
1# Query and push to new dataset 2uv run scripts/sql_manager.py query \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT * FROM data WHERE subject='nutrition'" \ 5 --push-to "username/mmlu-nutrition-subset" \ 6 --private 7 8# Transform and push 9uv run scripts/sql_manager.py transform \ 10 --dataset "ibm/duorc" \ 11 --config "ParaphraseRC" \ 12 --select "question, answers" \ 13 --where "LENGTH(question) > 50" \ 14 --push-to "username/duorc-long-questions"

4. Export to Local Files

bash
1# Export to Parquet 2uv run scripts/sql_manager.py export \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT * FROM data WHERE subject='nutrition'" \ 5 --output "nutrition.parquet" \ 6 --format parquet 7 8# Export to JSONL 9uv run scripts/sql_manager.py export \ 10 --dataset "cais/mmlu" \ 11 --sql "SELECT * FROM data LIMIT 100" \ 12 --output "sample.jsonl" \ 13 --format jsonl

5. Working with Dataset Configs/Splits

bash
1# Specify config (subset) 2uv run scripts/sql_manager.py query \ 3 --dataset "ibm/duorc" \ 4 --config "ParaphraseRC" \ 5 --sql "SELECT * FROM data LIMIT 5" 6 7# Specify split 8uv run scripts/sql_manager.py query \ 9 --dataset "cais/mmlu" \ 10 --split "test" \ 11 --sql "SELECT COUNT(*) FROM data" 12 13# Query all splits 14uv run scripts/sql_manager.py query \ 15 --dataset "cais/mmlu" \ 16 --split "*" \ 17 --sql "SELECT * FROM data LIMIT 10"

6. Raw SQL with Full Paths

For complex queries or joining datasets:

bash
1uv run scripts/sql_manager.py raw --sql " 2 SELECT a.*, b.* 3 FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a 4 JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b 5 ON a.id = b.id 6 LIMIT 100 7"

Python API Usage

python
1from sql_manager import HFDatasetSQL 2 3sql = HFDatasetSQL() 4 5# Query 6results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10") 7 8# Get schema 9schema = sql.describe("cais/mmlu") 10 11# Sample 12samples = sql.sample("cais/mmlu", n=5, seed=42) 13 14# Count 15count = sql.count("cais/mmlu", where="subject='nutrition'") 16 17# Histogram 18dist = sql.histogram("cais/mmlu", "subject") 19 20# Filter and transform 21results = sql.filter_and_transform( 22 "cais/mmlu", 23 select="subject, COUNT(*) as cnt", 24 group_by="subject", 25 order_by="cnt DESC", 26 limit=10 27) 28 29# Push to Hub 30url = sql.push_to_hub( 31 "cais/mmlu", 32 "username/nutrition-subset", 33 sql="SELECT * FROM data WHERE subject='nutrition'", 34 private=True 35) 36 37# Export locally 38sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100") 39 40sql.close()

HF Path Format

DuckDB uses the hf:// protocol to access datasets:

hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet

Examples:

  • hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
  • hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet

The @~parquet revision provides auto-converted Parquet files for any dataset format.

Useful DuckDB SQL Functions

sql
1-- String functions 2LENGTH(column) -- String length 3regexp_replace(col, '\n', '') -- Regex replace 4regexp_matches(col, 'pattern') -- Regex match 5LOWER(col), UPPER(col) -- Case conversion 6 7-- Array functions 8choices[0] -- Array indexing (0-based) 9array_length(choices) -- Array length 10unnest(choices) -- Expand array to rows 11 12-- Aggregations 13COUNT(*), SUM(col), AVG(col) 14GROUP BY col HAVING condition 15 16-- Sampling 17USING SAMPLE 10 -- Random sample 18USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample 19 20-- Window functions 21ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

Dataset Creation (dataset_manager.py)

1. Discovery (Use HF MCP Server):

python
1# Use HF MCP tools to find existing datasets 2search_datasets("conversational AI training") 3get_dataset_details("username/dataset-name")

2. Creation (Use This Skill):

bash
1# Initialize new dataset 2uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private] 3 4# Configure with detailed system prompt 5uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

3. Content Management (Use This Skill):

bash
1# Quick setup with any template 2uv run scripts/dataset_manager.py quick_setup \ 3 --repo_id "your-username/dataset-name" \ 4 --template classification 5 6# Add data with template validation 7uv run scripts/dataset_manager.py add_rows \ 8 --repo_id "your-username/dataset-name" \ 9 --template qa \ 10 --rows_json "$(cat your_qa_data.json)"

Template-Based Data Structures

1. Chat Template (--template chat)

json
1{ 2 "messages": [ 3 {"role": "user", "content": "Natural user request"}, 4 {"role": "assistant", "content": "Response with tool usage"}, 5 {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"} 6 ], 7 "scenario": "Description of use case", 8 "complexity": "simple|intermediate|advanced" 9}

2. Classification Template (--template classification)

json
1{ 2 "text": "Input text to be classified", 3 "label": "classification_label", 4 "confidence": 0.95, 5 "metadata": {"domain": "technology", "language": "en"} 6}

3. QA Template (--template qa)

json
1{ 2 "question": "What is the question being asked?", 3 "answer": "The complete answer", 4 "context": "Additional context if needed", 5 "answer_type": "factual|explanatory|opinion", 6 "difficulty": "easy|medium|hard" 7}

4. Completion Template (--template completion)

json
1{ 2 "prompt": "The beginning text or context", 3 "completion": "The expected continuation", 4 "domain": "code|creative|technical|conversational", 5 "style": "description of writing style" 6}

5. Tabular Template (--template tabular)

json
1{ 2 "columns": [ 3 {"name": "feature1", "type": "numeric", "description": "First feature"}, 4 {"name": "target", "type": "categorical", "description": "Target variable"} 5 ], 6 "data": [ 7 {"feature1": 123, "target": "class_a"}, 8 {"feature1": 456, "target": "class_b"} 9 ] 10}

Advanced System Prompt Template

For high-quality training data generation:

text
1You are an AI assistant expert at using MCP tools effectively. 2 3## MCP SERVER DEFINITIONS 4[Define available servers and tools] 5 6## TRAINING EXAMPLE STRUCTURE 7[Specify exact JSON schema for chat templating] 8 9## QUALITY GUIDELINES 10[Detail requirements for realistic scenarios, progressive complexity, proper tool usage] 11 12## EXAMPLE CATEGORIES 13[List development workflows, debugging scenarios, data management tasks]

Example Categories & Templates

The skill includes diverse training examples beyond just MCP usage:

Available Example Sets:

  • training_examples.json - MCP tool usage examples (debugging, project setup, database analysis)
  • diverse_training_examples.json - Broader scenarios including:
    • Educational Chat - Explaining programming concepts, tutorials
    • Git Workflows - Feature branches, version control guidance
    • Code Analysis - Performance optimization, architecture review
    • Content Generation - Professional writing, creative brainstorming
    • Codebase Navigation - Legacy code exploration, systematic analysis
    • Conversational Support - Problem-solving, technical discussions

Using Different Example Sets:

bash
1# Add MCP-focused examples 2uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ 3 --rows_json "$(cat examples/training_examples.json)" 4 5# Add diverse conversational examples 6uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ 7 --rows_json "$(cat examples/diverse_training_examples.json)" 8 9# Mix both for comprehensive training data 10uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ 11 --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"

Commands Reference

List Available Templates:

bash
1uv run scripts/dataset_manager.py list_templates

Quick Setup (Recommended):

bash
1uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification

Manual Setup:

bash
1# Initialize repository 2uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private] 3 4# Configure with system prompt 5uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here" 6 7# Add data with validation 8uv run scripts/dataset_manager.py add_rows \ 9 --repo_id "your-username/dataset-name" \ 10 --template qa \ 11 --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'

View Dataset Statistics:

bash
1uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

Error Handling

  • Repository exists: Script will notify and continue with configuration
  • Invalid JSON: Clear error message with parsing details
  • Network issues: Automatic retry for transient failures
  • Token permissions: Validation before operations begin

Combined Workflow Examples

Example 1: Create Training Subset from Existing Dataset

bash
1# 1. Explore the source dataset 2uv run scripts/sql_manager.py describe --dataset "cais/mmlu" 3uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" 4 5# 2. Query and create subset 6uv run scripts/sql_manager.py query \ 7 --dataset "cais/mmlu" \ 8 --sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \ 9 --push-to "username/mmlu-medical-subset" \ 10 --private

Example 2: Transform and Reshape Data

bash
1# Transform MMLU to QA format with correct answers extracted 2uv run scripts/sql_manager.py query \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \ 5 --push-to "username/mmlu-qa-format"

Example 3: Merge Multiple Dataset Splits

bash
1# Export multiple splits and combine 2uv run scripts/sql_manager.py export \ 3 --dataset "cais/mmlu" \ 4 --split "*" \ 5 --output "mmlu_all.parquet"

Example 4: Quality Filtering

bash
1# Filter for high-quality examples 2uv run scripts/sql_manager.py query \ 3 --dataset "squad" \ 4 --sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \ 5 --push-to "username/squad-filtered"

Example 5: Create Custom Training Dataset

bash
1# 1. Query source data 2uv run scripts/sql_manager.py export \ 3 --dataset "cais/mmlu" \ 4 --sql "SELECT question, subject FROM data WHERE subject='nutrition'" \ 5 --output "nutrition_source.jsonl" \ 6 --format jsonl 7 8# 2. Process with your pipeline (add answers, format, etc.) 9 10# 3. Push processed data 11uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training" 12uv run scripts/dataset_manager.py add_rows \ 13 --repo_id "username/nutrition-training" \ 14 --template qa \ 15 --rows_json "$(cat processed_data.json)"

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is hugging-face-datasets?

Perfect for Data Science Agents needing advanced dataset management and SQL querying capabilities on Hugging Face Hub. Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

How do I install hugging-face-datasets?

Run the command: npx killer-skills add huggingface/skills. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for hugging-face-datasets?

Key use cases include: Creating and configuring datasets for machine learning model training, Streaming row updates for real-time data reflection, Querying and transforming datasets using SQL for data analysis and visualization.

Which IDEs are compatible with hugging-face-datasets?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for hugging-face-datasets?

Requires Hugging Face Hub account and HF MCP server setup. Designed specifically for Hugging Face datasets and workflows.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add huggingface/skills. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use hugging-face-datasets immediately in the current project.

Related Skills

Looking for an alternative to hugging-face-datasets or another official skill for your workflow? Explore these related open-source skills.

View All

flags

Logo of facebook
facebook

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

243.6k
0
Developer

extract-errors

Logo of facebook
facebook

Use when adding new error messages to React, or seeing unknown error code warnings.

243.6k
0
Developer

fix

Logo of facebook
facebook

Use when you have lint errors, formatting issues, or before committing code to ensure it passes CI.

243.6k
0
Developer

flow

Logo of facebook
facebook

Use when you need to run Flow type checking, or when seeing Flow type errors in React code.

243.6k
0
Developer