Server Arguments

This page summarizes the KT-specific arguments used with SGLang-KT. For general SGLang server arguments, keep the SGLang documentation as the source of truth. For KTransformers, KT arguments are tightly coupled to the model, weight format, CPU backend, and hardware layout.

Common Launch Shape

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 30000 \
  --model-path /path/to/model \
  --served-model-name my-model \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --kt-weight-path /path/to/kt-weights \
  --kt-method FP8 \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32

KT Argument Reference

Argument	Role	Guidance
`--kt-method`	Selects KT expert backend / weight format	Use the exact method from the model page or support matrix.
`--kt-weight-path`	CPU-side expert weight path	May be original model weights, converted AMX weights, native FP8/INT4 weights, or GGUF weights depending on method.
`--kt-cpuinfer`	CPU inference worker count	Start from physical cores, not hyperthreads.
`--kt-threadpool-count`	Threadpool / NUMA grouping	Start from NUMA node count, then tune.
`--kt-num-gpu-experts`	Number of experts resident on GPU	Higher values may reduce latency but increase VRAM pressure.
`--kt-max-deferred-experts-per-token`	Deferred expert execution	Tune carefully; aggressive values can affect quality/latency tradeoffs.
`--kt-gpu-prefill-token-threshold`	Switch point for native FP8/RAWINT4 prefill behavior	Applies to native precision paths; use model-specific defaults first.
`--kt-enable-dynamic-expert-update`	Updates GPU expert placement from observed routing statistics	Requires runtime validation for the target model and workload.
`--kt-expert-placement-strategy`	Initial expert placement strategy	Use `uniform` as the conservative default unless profiling data exists.

Tuning Order

Confirm --kt-method and --kt-weight-path.
Confirm CPU features with lscpu.
Set --kt-cpuinfer to physical cores and --kt-threadpool-count to NUMA domains.
Start with conservative --kt-num-gpu-experts.
Tune prefill, deferred experts, and dynamic updates only after baseline correctness is stable.

Scope of a Launch Tuple

A working launch command represents one tested combination of model, weights, backend, hardware, and package versions. Before copying it to another model family or hardware layout, check the model page or support matrix, then rerun server smoke and Prefill/Decode TPS sweeps.

Dimension	Examples
Model and checkpoint	Model family, exact weight directory, revision
KT method / backend	`FP8`, `RAWINT4`, `AMXINT8`, `MXFP4`
CPU backend	AMX, worker count, NUMA/threadpool settings
GPU layout	GPU model, GPU count, `--kt-num-gpu-experts`
Package versions	`ktransformers`, `kt-kernel`, `sglang-kt`, `transformers-kt`