What is extracting-pdf-text?

Perfect for Language Model Agents needing high-quality text extraction from diverse PDF sources. extracting-pdf-text is an AI Agent skill that provides tools and guidance for converting PDF content into text consumable by Large Language Models. It offers a quick decision guide and specific Python scripts for different PDF types, including simple text, tables, and scanned documents.

How do I install extracting-pdf-text?

Run the command: npx killer-skills add miwtoo/credit-card-extraction. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for extracting-pdf-text?

Key use cases include: Extracting text from simple PDF documents using PyMuPDF, Parsing tables from PDFs with pdfplumber, Converting scanned PDFs to text with pytesseract OCR.

Which IDEs are compatible with extracting-pdf-text?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for extracting-pdf-text?

Requires Python environment. Dependent on library compatibility (PyMuPDF, pdfplumber, pytesseract).

extracting-pdf-text

Install extracting-pdf-text, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md

Readonly

Extracting PDF Text for LLMs

Name: extracting-pdf-text
Availability: InStock
Author: miwtoo

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash
1uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash
1uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash
1uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash
1export MISTRAL_API_KEY="your-key"
2uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

extracting-pdf-text — install extracting-pdf-text skill extracting-pdf-text, credit-card-extraction, community, install extracting-pdf-text skill, ide skills, PDF OCR for AI agents pytesseract, Mistr PDF extraction, Claude Code, Cursor, Windsurf

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for extracting-pdf-text

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

extracting-pdf-text

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

FAQ & Installation Steps

? Frequently Asked Questions

What is extracting-pdf-text?

How do I install extracting-pdf-text?

What are the use cases for extracting-pdf-text?

Which IDEs are compatible with extracting-pdf-text?

Are there any limitations for extracting-pdf-text?

↓ How To Install

Related Skills

Looking for an alternative to extracting-pdf-text or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching

extracting-pdf-text — install extracting-pdf-text skill extracting-pdf-text, credit-card-extraction, community, install extracting-pdf-text skill, ide skills, PDF OCR for AI agents pytesseract, Mistr PDF extraction, Claude Code, Cursor, Windsurf

About this Skill

Features

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for extracting-pdf-text

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

extracting-pdf-text

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

FAQ & Installation Steps

? Frequently Asked Questions

What is extracting-pdf-text?

How do I install extracting-pdf-text?

What are the use cases for extracting-pdf-text?

Which IDEs are compatible with extracting-pdf-text?

Are there any limitations for extracting-pdf-text?

↓ How To Install

Related Skills

Looking for an alternative to extracting-pdf-text or another community skill for your workflow? Explore these related open-source skills.

widget-generator

flags

zustand

data-fetching