Heterogeneous Inference

KTransformers targets workstation-scale inference for frontier MoE models by mapping different parts of the model to the hardware that can run them efficiently.

Core Ideas

Area	Documentation home
SGLang-KT serving path	Inference Overview and Launch a Server
CPU expert backends	Precision and Quantization
GPU expert count and placement	Expert Placement
Long-context prefill strategy	Layerwise Prefill
AMX execution	AMX Backend

Migration Boundary

Older GitHub tutorials may mention local_chat.py, ktransformers/server/main.py, or balance_serve. Treat those as historical implementation paths. Current public inference documentation should use kt run or python -m sglang.launch_server with --kt-* arguments.

Performance claims belong in technical pages only when the exact model, method, hardware, command, and profiling method are recorded.