Heterogeneous Inference
KTransformers targets workstation-scale inference for frontier MoE models by mapping different parts of the model to the hardware that can run them efficiently.
Core Ideas
| Area | Documentation home |
|---|---|
| SGLang-KT serving path | Inference Overview and Launch a Server |
| CPU expert backends | Precision and Quantization |
| GPU expert count and placement | Expert Placement |
| Long-context prefill strategy | Layerwise Prefill |
| AMX execution | AMX Backend |
Migration Boundary
Older GitHub tutorials may mention local_chat.py, ktransformers/server/main.py, or balance_serve. Treat those as historical implementation paths. Current public inference documentation should use kt run or python -m sglang.launch_server with --kt-* arguments.
Performance claims belong in technical pages only when the exact model, method, hardware, command, and profiling method are recorded.