KTransformers Documentation
KTransformers is a CPU-GPU heterogeneous computing project for large MoE model inference and LoRA fine-tuning. The documentation follows a task-first structure: install the right package path, choose inference or fine-tuning, run a model, then move into model-specific tutorials, technical background, hardware boundaries, and command references.
Current Public Surface
| Task | Public packages | Primary entry |
|---|---|---|
| Inference serving | kt-kernel sglang-kt | kt run or python -m sglang.launch_server with --kt-* arguments |
| LoRA SFT | ktransformers[sft] through LLaMA-Factory | LLaMA-Factory training YAML with use_kt: true and an Accelerate KT config |
Older local_chat.py, ktransformers/server/main.py, balance_serve, and kt_optimize_rule paths are historical unless a page explicitly marks them as revalidated.
Getting Started
- Installation - choose the package set for inference or fine-tuning.
- First inference server - start from
kt runor manual SGLang-KT launch. - First LoRA SFT run - start from LLaMA-Factory KT examples.
Inference
- Inference overview - serving paths and method selection.
- Launch a server - choose between registry-driven and manual launches.
- Sending requests - use cURL, Python requests, or the OpenAI client.
- OpenAI-compatible API - endpoint and client behavior for SGLang-KT.
- Popular model usage - where to start for DeepSeek, Kimi, MiniMax, Qwen, and GLM.
Fine-Tuning
- Fine-tuning overview - LoRA SFT as a first-class KTransformers workflow.
- LoRA SFT with LLaMA-Factory - current public SFT entry and config shape.
- SFT backends and precision -
AMXBF16,AMXINT8, andAMXINT4. - Weight preparation - BF16, INT8, INT4, and DeepSeek V3 FP8 source checkpoints.
- DeepSeek SFT and Qwen SFT - current model tutorial status.
Advanced Features
- Server arguments - KT-specific launch parameters and tuning rules.
- Precision and quantization -
BF16,FP8,RAWINT4,AMXINT4,AMXINT8,MXFP4, andLLAMAFILE. - Expert placement - GPU expert count, deferred experts, and dynamic updates.
- AMX backend - AMX architecture, weight conversion, and launch flow.
- Layerwise Prefill - long-context prefill acceleration principles and tuning strategy.
Supported Models and Platforms
- Support matrix - model, precision, backend, and validation status.
- Text generation models - model-family entry points.
- Model status policy - how support claims are written.
- Hardware platform status - CPU/GPU requirements and known platform boundaries.
Technical Work
- Technical work - system background separated from user task guides.
- Heterogeneous inference - CPU-GPU MoE execution direction.
- Local fine-tuning - local inference to local LoRA SFT.
- Talks and slides - GOSIM 2026 and public decks.
- GitHub docs migration map - what moves from GitHub docs and what stays historical.
Developer and Command Reference
- Runtime smoke checklist - what must be verified before upgrading a support claim.
- Benchmark and profiling - repeatable reporting expectations.
- CLI reference -
ktcommand surface. - Troubleshooting - common install, serving, and SFT failures.