Inference
The current KTransformers inference path uses kt-kernel for CPU expert execution and sglang-kt for serving.
pip install kt-kernel sglang-kt
User Paths
| Path | Use when |
|---|---|
kt run <model> | The model is in the KT built-in registry and you want model-specific defaults. |
python -m sglang.launch_server ... --kt-* | You need explicit model paths, tensor parallelism, expert placement, or custom serving args. |
For step-by-step launch and request examples, see Launch a server and Sending requests.
Method Selection
Select --kt-method from the exact model page or Support Matrix. Current public methods include BF16, FP8, FP8_PERCHANNEL, RAWINT4, GPTQ_INT4, AMXINT4, AMXINT8, MXFP4, and LLAMAFILE.
Method names are not interchangeable across model families. The same precision label may require different weight layouts, CPU ISA backends, attention backends, or package versions.
Legacy Entry Points
Older tutorials using local_chat.py, ktransformers/server/main.py, balance_serve, old GGUF integrated-framework commands, or old optimize-rule paths are historical unless rewritten for SGLang-KT.