KTransformers

Expert Placement

KTransformers serving moves part of MoE expert computation to CPU while keeping selected experts on GPU. Placement determines memory use, latency, and long-context behavior.

Main Controls

ControlRole
--kt-num-gpu-expertsNumber of experts placed on GPU.
--kt-expert-placement-strategyInitial placement strategy such as uniform, frequency, front-loading, or random.
--kt-enable-dynamic-expert-updateUpdates placement from runtime routing statistics.
--kt-max-deferred-experts-per-tokenAllows deferred expert execution for pipelining.
--kt-gpu-prefill-token-thresholdControls when native precision paths switch prefill behavior.

Conservative Defaults

  • Start with registry defaults when using kt run.
  • For manual launch, use uniform unless you have model-specific activation statistics.
  • Increase --kt-num-gpu-experts only after checking VRAM headroom.
  • Treat deferred expert values above the conservative range as experimental until quality checks pass.

Dynamic Updates

Dynamic expert update can help when activation patterns are skewed, but it is workload-sensitive. A support claim should include:

  • model and checkpoint
  • prompt length range
  • batch/concurrency shape
  • GPU expert count
  • whether dynamic update was enabled
  • observed quality and latency behavior