Server Arguments
This page summarizes the KT-specific arguments used with SGLang-KT. For general SGLang server arguments, keep the SGLang documentation as the source of truth. For KTransformers, the key rule is that KT arguments must match the model, weight format, CPU backend, and hardware layout.
Common Launch Shape
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 30000 \
--model-path /path/to/model \
--served-model-name my-model \
--trust-remote-code \
--tensor-parallel-size 1 \
--kt-weight-path /path/to/kt-weights \
--kt-method FP8 \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32
KT Argument Reference
| Argument | Role | Guidance |
|---|---|---|
--kt-method | Selects KT expert backend / weight format | Use the exact method from the model page or support matrix. |
--kt-weight-path | CPU-side expert weight path | May be original model weights, converted AMX weights, native FP8/INT4 weights, or GGUF weights depending on method. |
--kt-cpuinfer | CPU inference worker count | Start from physical cores, not hyperthreads. |
--kt-threadpool-count | Threadpool / NUMA grouping | Start from NUMA node count, then tune. |
--kt-num-gpu-experts | Number of experts resident on GPU | Higher values may reduce latency but increase VRAM pressure. |
--kt-max-deferred-experts-per-token | Deferred expert execution | Tune carefully; aggressive values can affect quality/latency tradeoffs. |
--kt-gpu-prefill-token-threshold | Switch point for native FP8/RAWINT4 prefill behavior | Applies to native precision paths; use model-specific defaults first. |
--kt-enable-dynamic-expert-update | Updates GPU expert placement from observed routing statistics | Requires runtime validation for the target model and workload. |
--kt-expert-placement-strategy | Initial expert placement strategy | Use uniform as the conservative default unless profiling data exists. |
Tuning Order
- Confirm
--kt-methodand--kt-weight-path. - Confirm CPU features with
lscpu. - Set
--kt-cpuinferto physical cores and--kt-threadpool-countto NUMA domains. - Start with conservative
--kt-num-gpu-experts. - Tune prefill, deferred experts, and dynamic updates only after baseline correctness is stable.
Do Not Generalize
A working command is not automatically reusable across model families. Treat the command as current support only when the exact support tuple has been recorded:
model family + checkpoint + KT method + CPU backend + GPU layout + package versions