KTransformers

Server Arguments

This page summarizes the KT-specific arguments used with SGLang-KT. For general SGLang server arguments, keep the SGLang documentation as the source of truth. For KTransformers, the key rule is that KT arguments must match the model, weight format, CPU backend, and hardware layout.

Common Launch Shape

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 30000 \
  --model-path /path/to/model \
  --served-model-name my-model \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --kt-weight-path /path/to/kt-weights \
  --kt-method FP8 \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32

KT Argument Reference

ArgumentRoleGuidance
--kt-methodSelects KT expert backend / weight formatUse the exact method from the model page or support matrix.
--kt-weight-pathCPU-side expert weight pathMay be original model weights, converted AMX weights, native FP8/INT4 weights, or GGUF weights depending on method.
--kt-cpuinferCPU inference worker countStart from physical cores, not hyperthreads.
--kt-threadpool-countThreadpool / NUMA groupingStart from NUMA node count, then tune.
--kt-num-gpu-expertsNumber of experts resident on GPUHigher values may reduce latency but increase VRAM pressure.
--kt-max-deferred-experts-per-tokenDeferred expert executionTune carefully; aggressive values can affect quality/latency tradeoffs.
--kt-gpu-prefill-token-thresholdSwitch point for native FP8/RAWINT4 prefill behaviorApplies to native precision paths; use model-specific defaults first.
--kt-enable-dynamic-expert-updateUpdates GPU expert placement from observed routing statisticsRequires runtime validation for the target model and workload.
--kt-expert-placement-strategyInitial expert placement strategyUse uniform as the conservative default unless profiling data exists.

Tuning Order

  1. Confirm --kt-method and --kt-weight-path.
  2. Confirm CPU features with lscpu.
  3. Set --kt-cpuinfer to physical cores and --kt-threadpool-count to NUMA domains.
  4. Start with conservative --kt-num-gpu-experts.
  5. Tune prefill, deferred experts, and dynamic updates only after baseline correctness is stable.

Do Not Generalize

A working command is not automatically reusable across model families. Treat the command as current support only when the exact support tuple has been recorded:

model family + checkpoint + KT method + CPU backend + GPU layout + package versions