Expert Placement
KTransformers serving moves part of MoE expert computation to CPU while keeping selected experts on GPU. Placement determines memory use, latency, and long-context behavior.
Main Controls
| Control | Role |
|---|---|
--kt-num-gpu-experts | Number of experts placed on GPU. |
--kt-expert-placement-strategy | Initial placement strategy such as uniform, frequency, front-loading, or random. |
--kt-enable-dynamic-expert-update | Updates placement from runtime routing statistics. |
--kt-max-deferred-experts-per-token | Allows deferred expert execution for pipelining. |
--kt-gpu-prefill-token-threshold | Controls when native precision paths switch prefill behavior. |
Conservative Defaults
- Start with registry defaults when using
kt run. - For manual launch, use
uniformunless you have model-specific activation statistics. - Increase
--kt-num-gpu-expertsonly after checking VRAM headroom. - Treat deferred expert values above the conservative range as experimental until quality checks pass.
Dynamic Updates
Dynamic expert update can help when activation patterns are skewed, but it is workload-sensitive. A support claim should include:
- model and checkpoint
- prompt length range
- batch/concurrency shape
- GPU expert count
- whether dynamic update was enabled
- observed quality and latency behavior