book-sft-pipeline — book-sft-pipeline install book-sft-pipeline, KittyCourt, community, book-sft-pipeline install, ide skills, converting eBooks to SFT datasets, Claude Code, Cursor, Windsurf

v2.0.0
GitHub

About this Skill

Perfect for NLP Agents needing advanced text analysis and style-transfer capabilities for literary works book-sft-pipeline is a complete system for converting books into SFT datasets and training style-transfer models, supporting text segmentation pipelines for long-form content.

Features

Converts raw ePub files to SFT datasets
Trains style-transfer models for author-voice replication
Supports text segmentation pipelines for long-form content
Prepares training data for Tinker or similar SFT platforms
Enables building fine-tuning datasets from literary works
Creates author-voice or style-transfer models

# Core Topics

goodnight000 goodnight000
[0]
[0]
Updated: 3/8/2026

Agent Capability Analysis

The book-sft-pipeline skill by goodnight000 is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for book-sft-pipeline install, converting eBooks to SFT datasets.

Ideal Agent Persona

Perfect for NLP Agents needing advanced text analysis and style-transfer capabilities for literary works

Core Value

Empowers agents to convert raw ePub files into SFT datasets and train style-transfer models using protocols like fine-tuning datasets, enabling the creation of author-voice models and text segmentation pipelines for long-form content

Capabilities Granted for book-sft-pipeline

Building fine-tuning datasets from literary works
Creating author-voice or style-transfer models
Preparing training data for Tinker or similar SFT platforms

! Prerequisites & Limits

  • Requires raw ePub files as input
  • Limited to training small models
  • Specifically designed for SFT datasets and style-transfer models
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

book-sft-pipeline

Install book-sft-pipeline, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

When to Activate

Activate this skill when:

  • Building fine-tuning datasets from literary works
  • Creating author-voice or style-transfer models
  • Preparing training data for Tinker or similar SFT platforms
  • Designing text segmentation pipelines for long-form content
  • Training small models (8B or less) on limited data

Core Concepts

The Three Pillars of Book SFT

1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐               ┌──────────────┐
│   TRAINING   │               │  VALIDATION  │
│    AGENT     │               │    AGENT     │
│ LoRA on      │               │ AI detector  │
│ Tinker       │               │ Originality  │
└──────────────┘               └──────────────┘

Phase 1: Text Extraction

Critical Rules

  1. Always source ePub over PDF - OCR errors become learned patterns
  2. Use paragraph-level extraction - Extract from <p> tags to preserve breaks
  3. Remove front/back matter - Copyright and TOC pollute the dataset
python
1# Extract text from ePub paragraphs 2from epub2 import EPub 3from bs4 import BeautifulSoup 4 5def extract_epub(path): 6 book = EPub(path) 7 chapters = [] 8 for item in book.flow: 9 html = book.get_chapter(item.id) 10 soup = BeautifulSoup(html, 'html.parser') 11 paragraphs = [p.get_text().strip() for p in soup.find_all('p')] 12 chapters.append('\n\n'.join(p for p in paragraphs if p)) 13 return '\n\n'.join(chapters)

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).

python
1def segment(text, min_words=150, max_words=400): 2 paragraphs = text.split('\n\n') 3 chunks, buffer, buffer_words = [], [], 0 4 5 for para in paragraphs: 6 words = len(para.split()) 7 if buffer_words + words > max_words and buffer_words >= min_words: 8 chunks.append('\n\n'.join(buffer)) 9 # Keep last paragraph for overlap 10 buffer = [buffer[-1], para] if buffer else [para] 11 buffer_words = sum(len(p.split()) for p in buffer) 12 else: 13 buffer.append(para) 14 buffer_words += words 15 16 if buffer: 17 chunks.append('\n\n'.join(buffer)) 18 return chunks

Expected Results

For an 86,000-word book:

  • Old method (250-650 words): ~150 chunks
  • New method (150-400 + overlap): ~300 chunks
  • With 2 variants per chunk: 600+ training examples

Phase 3: Diverse Instruction Generation

The Key Insight

Using a single prompt template causes memorization. Diverse templates teach the underlying style.

python
1SYSTEM_PROMPTS = [ 2 "You are an expert creative writer capable of emulating specific literary styles.", 3 "You are a literary writer with deep knowledge of classic prose styles.", 4 "You are a creative writer skilled at emulating distinctive authorial voices.", 5 "You write prose that captures the essence of modernist literature.", 6 "You are a talented writer who can channel classic American authors.", 7] 8 9PROMPT_TEMPLATES = [ 10 "Write a passage in the style of {author}: {desc}", 11 "Channel {author}'s voice to write about: {desc}", 12 "In {author}'s distinctive prose style, describe: {desc}", 13 "Write this scene as {author} would have: {desc}", 14 "Using {author}'s repetitive technique, describe: {desc}", 15 "Capture the rhythm of {author} in this passage: {desc}", 16 "Write like {author}: {desc}", 17 "In the voice of {author}, write: {desc}", 18 "This is a literary exercise. Write like {author}: {desc}", 19 "Can you write in {author}'s style? {desc}", 20]

Instruction Generation

python
1INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences. 2Focus on: characters present, actions, emotions, setting. 3Do NOT quote the text directly. 4 5Excerpt: 6{text} 7""" 8 9# Use a fast, cheap LLM (e.g., Gemini Flash) 10instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))

Phase 4: Dataset Construction

Message Format

json
1{ 2 "messages": [ 3 {"role": "system", "content": "You are an expert creative writer..."}, 4 {"role": "user", "content": "Write in the style of Author: Scene description..."}, 5 {"role": "assistant", "content": "The actual book text from chunk..."} 6 ] 7}

Multiple Variants Per Chunk

python
1def build_examples(chunk, instruction, author, variants=2): 2 examples = [] 3 for i in range(variants): 4 system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)] 5 template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)] 6 user = template.format(author=author, desc=instruction) 7 examples.append({"messages": [ 8 {"role": "system", "content": system}, 9 {"role": "user", "content": user}, 10 {"role": "assistant", "content": chunk.text} 11 ]}) 12 return examples

Phase 5: LoRA Training on Tinker

Configuration

python
1CONFIG = { 2 "model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct 3 "lora_rank": 32, # 352MB adapter 4 "learning_rate": 5e-4, # Higher for LoRA 5 "batch_size": 4, 6 "epochs": 3, 7}

Why Base Model?

Use base (pretrained) models, not instruction-tuned versions:

  • Base models are more malleable for new styles
  • Instruct models have patterns that resist overwriting
  • Style is a low-level pattern that base models capture better

Training Loop

python
1import tinker 2from tinker import types 3 4training_client = await service_client.create_lora_training_client_async( 5 base_model="Qwen/Qwen3-8B-Base", 6 rank=32 7) 8 9for epoch in range(3): 10 for batch in batches: 11 await training_client.forward_backward_async(batch, loss_fn="cross_entropy") 12 await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4)) 13 14result = await training_client.save_weights_for_sampler_async(name="final")

Phase 6: Validation

Modern Scenario Test

Test with scenarios that couldn't exist in the original book:

python
1TEST_PROMPTS = [ 2 "Write about a barista making lattes", 3 "Describe lovers communicating through text messages", 4 "Write about someone anxious about climate change", 5]

If the model applies style markers to modern scenarios, it learned style, not content.

Originality Verification

bash
1# Search training data for output phrases 2grep "specific phrase from output" dataset.jsonl 3# Should return: No matches

AI Detector Testing

Test outputs with GPTZero, Pangram, or ZeroGPT.

Known Issues and Solutions

Character Name Leakage

Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.

Model Parrots Exact Phrases

Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.

Fragmented Outputs

Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.

Guidelines

  1. Always source ePub over PDF - OCR errors become learned patterns
  2. Never break mid-sentence - Boundaries must be grammatically complete
  3. Use diverse prompts - 15+ templates, 5+ system prompts
  4. Use base models - Not instruct versions
  5. Use smaller chunks - 150-400 words for more examples
  6. Reserve test set - 50 examples minimum
  7. Test on modern scenarios - Proves style transfer vs memorization
  8. Verify originality - Grep training data for output phrases

Expected Results

MetricValue
Training examples500-1000 per book
ModelQwen/Qwen3-8B-Base
LoRA rank32
Adapter size~350 MB
Training time~15 min
Loss reduction90%+
Style transfer success~50% perfect

Cost Estimate

ComponentCost
LLM (instruction generation)~$0.50
Tinker training (15 min)~$1.50
Total~$2.00

Integration with Context Engineering Skills

This example applies several skills from the Agent Skills for Context Engineering collection:

project-development

The pipeline follows the staged, idempotent architecture pattern:

  • Acquire: Extract text from ePub
  • Prepare: Segment into training chunks
  • Process: Generate synthetic instructions
  • Parse: Build message format
  • Render: Output Tinker-compatible JSONL
  • Train: LoRA fine-tuning
  • Validate: Modern scenario testing

Each phase is resumable and produces intermediate artifacts for debugging.

context-compression

Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.

The two-tier strategy mirrors context compression evaluation:

  • Tier 1: Fast, deterministic compression
  • Tier 2: LLM-assisted for edge cases

multi-agent-patterns

The pipeline uses the supervisor/orchestrator pattern:

  • Orchestrator coordinates phases and manages state
  • Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
  • Each agent receives only the information needed for its task

This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.

evaluation

Validation follows the end-state evaluation pattern:

  • Functional testing: Does output match expected style markers?
  • Originality verification: Is content genuinely generated?
  • External validation: AI detector scores

The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.

context-fundamentals

Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.

References

Internal references:

Related skills from Agent Skills for Context Engineering:

  • project-development - Pipeline architecture patterns
  • context-compression - Compression strategies
  • multi-agent-patterns - Agent coordination
  • evaluation - Evaluation frameworks
  • context-fundamentals - Attention and information density

External resources:


Skill Metadata

Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes (separate from main context-engineering collection)

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is book-sft-pipeline?

Perfect for NLP Agents needing advanced text analysis and style-transfer capabilities for literary works book-sft-pipeline is a complete system for converting books into SFT datasets and training style-transfer models, supporting text segmentation pipelines for long-form content.

How do I install book-sft-pipeline?

Run the command: npx killer-skills add goodnight000/KittyCourt. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for book-sft-pipeline?

Key use cases include: Building fine-tuning datasets from literary works, Creating author-voice or style-transfer models, Preparing training data for Tinker or similar SFT platforms.

Which IDEs are compatible with book-sft-pipeline?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for book-sft-pipeline?

Requires raw ePub files as input. Limited to training small models. Specifically designed for SFT datasets and style-transfer models.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add goodnight000/KittyCourt. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use book-sft-pipeline immediately in the current project.

Related Skills

Looking for an alternative to book-sft-pipeline or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI