KTransformers

Heterogeneous Inference

KTransformers targets workstation-scale inference for frontier MoE models by mapping different parts of the model to the hardware that can run them efficiently.

Core Ideas

AreaDocumentation home
SGLang-KT serving pathInference Overview and Launch a Server
CPU expert backendsPrecision and Quantization
GPU expert count and placementExpert Placement
Long-context prefill strategyLayerwise Prefill
AMX executionAMX Backend

Migration Boundary

Older GitHub tutorials may mention local_chat.py, ktransformers/server/main.py, or balance_serve. Treat those as historical implementation paths. Current public inference documentation should use kt run or python -m sglang.launch_server with --kt-* arguments.

Performance claims belong in technical pages only when the exact model, method, hardware, command, and profiling method are recorded.