vllm-ascend-model-adapter — vllm-ascend-model-adapter install vllm-ascend-model-adapter, vllm-benchmarks, community, vllm-ascend-model-adapter install, ide skills, vllm-ascend-model-adapter for AI agents, vllm-ascend-model-adapter troubleshooting, vllm-ascend-model-adapter workflow checklist, Claude Code, Cursor, Windsurf

v1.0.0
GitHub

About this Skill

Perfect for AI Model Agents needing seamless integration with vllm-ascend for Hugging Face or local models vllm-ascend-model-adapter is a skill that enables adaptation of Hugging Face or local models to run on vllm-ascend, supporting both existing and new model architectures.

Features

Adapts Hugging Face models to run on vllm-ascend with minimal changes
Supports deterministic validation for reliable results
Enables single-commit delivery for streamlined workflow
Compatible with both already-supported models and new architectures
Provides troubleshooting guidance through references/troubleshooting.md
Includes a feature-first checklist in references/multimodal-ep-aclgraph-lessons.md

# Core Topics

nv-action nv-action
[2]
[2]
Updated: 3/6/2026

Agent Capability Analysis

The vllm-ascend-model-adapter skill by nv-action is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for vllm-ascend-model-adapter install, vllm-ascend-model-adapter for AI agents, vllm-ascend-model-adapter troubleshooting.

Ideal Agent Persona

Perfect for AI Model Agents needing seamless integration with vllm-ascend for Hugging Face or local models

Core Value

Empowers agents to adapt models with minimal changes, leveraging deterministic validation and single-commit delivery for both supported and new architectures, utilizing Hugging Face models and local models on vllm-ascend

Capabilities Granted for vllm-ascend-model-adapter

Adapting Hugging Face models for vllm-ascend
Enabling deterministic validation for local models
Delivering new model architectures with single-commit efficiency

! Prerequisites & Limits

  • Requires vllm-ascend compatibility
  • Limited to Hugging Face or local models
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

vllm-ascend-model-adapter

Install vllm-ascend-model-adapter, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command...

SKILL.md
Readonly

vLLM Ascend Model Adapter

Overview

Adapt Hugging Face or local models to run on vllm-ascend with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM.

Read order

  1. Start with references/workflow-checklist.md.
  2. Read references/multimodal-ep-aclgraph-lessons.md (feature-first checklist).
  3. If startup/inference fails, read references/troubleshooting.md.
  4. If checkpoint is fp8-on-NPU, read references/fp8-on-npu-lessons.md.
  5. Before handoff, read references/deliverables.md.

Hard constraints

  • Never upgrade transformers.
  • Primary implementation roots are fixed by Dockerfile:
    • /vllm-workspace/vllm
    • /vllm-workspace/vllm-ascend
  • Start vllm serve from /workspace with direct command by default.
  • Default API port is 8000 unless user explicitly asks otherwise.
  • Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box.
  • --enable-expert-parallel and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence.
  • If any feature cannot be enabled, keep evidence and explain reason in final report.
  • Do not rely on PYTHONPATH=<modified-src>:$PYTHONPATH unless debugging fallback is strictly needed.
  • Keep code changes minimal and focused on the target model.
  • Final deliverable commit must be one single signed commit in the current working repo (git commit -sm ...).
  • Keep final docs in Chinese and compact.
  • Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights.
  • Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.

Execution playbook

1) Collect context

  • Confirm model path (default /models/<model-name>; if environment differs, confirm with user explicitly).
  • Confirm implementation roots (/vllm-workspace/vllm, /vllm-workspace/vllm-ascend).
  • Confirm delivery root (the current git repo where the final commit is expected).
  • Confirm runtime import path points to /vllm-workspace/* install.
  • Use default expected feature set: ACLGraph + EP + flashcomm1 + MTP + multimodal (if model has VL capability).
  • User requirements extend this baseline, not replace it.

2) Analyze model first

  • Inspect config.json, processor files, modeling files, tokenizer files.
  • Identify architecture class, attention variant, quantization type, and multimodal requirements.
  • Check state-dict key prefixes (and safetensors index) to infer mapping needs.
  • Decide whether support already exists in vllm/model_executor/models/registry.py.

3) Choose adaptation strategy (new-model capable)

  • Reuse existing vLLM architecture if compatible.
  • If architecture is missing or incompatible, implement native support:
    • add model adapter under vllm/model_executor/models/;
    • add processor under vllm/transformers_utils/processors/ when needed;
    • register architecture in vllm/model_executor/models/registry.py;
    • implement explicit weight loading/remap rules (including fp8 scale pairing, KV/QK norm sharding, rope variants).
  • If remote code needs newer transformers symbols, do not upgrade dependency.
  • If unavoidable, copy required modeling files from sibling transformers source and keep scope explicit.
  • If failure is backend-specific (kernel/op/platform), patch minimal required code in /vllm-workspace/vllm-ascend.

4) Implement minimal code changes (in implementation roots)

  • Touch only files required for this model adaptation.
  • Keep weight mapping explicit and auditable.
  • Avoid unrelated refactors.

5) Two-stage validation on Ascend (direct run)

  • Run from /workspace with --load-format dummy.
  • Goal: fast validate architecture path / operator path / API path.
  • Do not treat Application startup complete as pass by itself; request smoke is mandatory.
  • Require at least:
    • startup readiness (/v1/models 200),
    • one text request 200,
    • if VL model, one text+image request 200,
    • ACLGraph evidence where expected.

Stage B: real-weight mandatory gate (must pass before sign-off)

  • Remove --load-format dummy and validate with real checkpoint.
  • Goal: validate real-only risks:
    • weight key mapping,
    • fp8/fp4 dequantization path,
    • KV/QK norm sharding with real tensor shapes,
    • load-time/runtime stability.
  • Require HTTP 200 and non-empty output before declaring success.
  • Do not pass Stage B on startup-only evidence.

6) Validate inference and features

  • Send GET /v1/models first.
  • Send at least one OpenAI-compatible text request.
  • For multimodal models, require at least one text+image request.
  • Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors).
  • Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation.
  • If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation.
  • For torch._dynamo + interpolate + NPU contiguous failures on VL paths, try TORCHDYNAMO_DISABLE=1 as diagnostic/stability fallback.
  • For multimodal processor API mismatch (for example skip_tensor_conversion signature mismatch), use text-only isolation (--limit-mm-per-prompt set image/video/audio to 0) to separate processor issues from core weight loading issues.
  • Capacity baseline by default (single machine): max-model-len=128k + max-num-seqs=16.
  • Then expand concurrency (e.g., 32/64) if requested or feasible.

7) Backport, generate artifacts, and commit in delivery repo

  • If implementation happened in /vllm-workspace/*, backport minimal final diff to current working repo.
  • Generate test config YAML at tests/e2e/models/configs/<ModelName>.yaml following the schema of existing configs (must include model_name, hardware, tasks with accuracy metrics, and num_fewshot). Use accuracy results from evaluation to populate metric values.
  • Generate tutorial markdown at docs/source/tutorials/models/<ModelName>.md following the standard template (Introduction, Supported Features, Environment Preparation with docker tabs, Deployment with serve script, Functional Verification with curl example, Accuracy Evaluation, Performance). Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, and accuracy table.
  • Update docs/source/tutorials/models/index.md to include the new tutorial.
  • Confirm test config YAML and tutorial doc are included in the staged files.
  • Commit code changes once (single signed commit).

8) Prepare handoff artifacts

  • Write comprehensive Chinese analysis report.
  • Write compact Chinese runbook for server startup and validation commands.
  • Include feature status matrix (supported / unsupported / checkpoint-missing / not-applicable).
  • Include dummy-vs-real validation matrix and explicit non-equivalence notes.
  • Include changed-file list, key logs, and final commit hash.
  • Post the SKILL.md content (or a link to it) as a comment on the originating GitHub issue to document the AI-assisted workflow.

Quality gate before final answer

  • Service starts successfully from /workspace with direct command.
  • OpenAI-compatible inference request succeeds (not startup-only).
  • Key feature set is attempted and reported: ACLGraph / EP / flashcomm1 / MTP / multimodal.
  • Capacity baseline (128k + bs16) result is reported, or explicit reason why not feasible.
  • Dummy stage evidence is present (if used), and real-weight stage evidence is present (mandatory).
  • Test config YAML exists at tests/e2e/models/configs/<ModelName>.yaml and follows the established schema (model_name, hardware, tasks, num_fewshot).
  • Tutorial doc exists at docs/source/tutorials/models/<ModelName>.md and follows the standard template (Introduction, Supported Features, Environment Preparation, Deployment, Functional Verification, Accuracy Evaluation, Performance).
  • Tutorial index at docs/source/tutorials/models/index.md includes the new model entry.
  • Exactly one signed commit contains all code changes in current working repo.
  • Final response includes commit hash, file paths, key commands, known limits, and failure reasons where applicable.

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is vllm-ascend-model-adapter?

Perfect for AI Model Agents needing seamless integration with vllm-ascend for Hugging Face or local models vllm-ascend-model-adapter is a skill that enables adaptation of Hugging Face or local models to run on vllm-ascend, supporting both existing and new model architectures.

How do I install vllm-ascend-model-adapter?

Run the command: npx killer-skills add nv-action/vllm-benchmarks. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for vllm-ascend-model-adapter?

Key use cases include: Adapting Hugging Face models for vllm-ascend, Enabling deterministic validation for local models, Delivering new model architectures with single-commit efficiency.

Which IDEs are compatible with vllm-ascend-model-adapter?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for vllm-ascend-model-adapter?

Requires vllm-ascend compatibility. Limited to Hugging Face or local models.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add nv-action/vllm-benchmarks. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use vllm-ascend-model-adapter immediately in the current project.

Related Skills

Looking for an alternative to vllm-ascend-model-adapter or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI