Expert Placement

KTransformers serving moves part of MoE expert computation to CPU while keeping selected experts on GPU. Placement determines memory use, latency, and long-context behavior.

Main Controls

Control	Role
`--kt-num-gpu-experts`	Number of experts placed on GPU.
`--kt-expert-placement-strategy`	Initial placement strategy such as `uniform`, `frequency`, `front-loading`, or `random`.
`--kt-enable-dynamic-expert-update`	Updates placement from runtime routing statistics.
`--kt-max-deferred-experts-per-token`	Allows deferred expert execution for pipelining.
`--kt-gpu-prefill-token-threshold`	Controls when native precision paths switch prefill behavior.

Conservative Defaults

Start with registry defaults when using kt run.
For manual launch, use uniform unless you have model-specific activation statistics.
Increase --kt-num-gpu-experts only after checking VRAM headroom.
Treat deferred expert values above the conservative range as experimental until quality checks pass.

Dynamic Updates

Dynamic expert update can help when activation patterns are skewed, but it is workload-sensitive. Complete experiment records usually include:

model and checkpoint
prompt length range
batch/concurrency shape
GPU expert count
whether dynamic update was enabled
observed quality and latency behavior

Expert Placement

Main Controls

Conservative Defaults

Dynamic Updates

Related Pages