KTransformers

Layerwise Prefill

Layerwise prefill is a long-context acceleration strategy for the prefill stage in KT + SGLang:

  • Shorter sequences: use hybrid CPU+GPU prefill
  • Longer sequences: switch to layerwise GPU prefill

Its goal is not to speed up every case. The goal is to shift long-prefill bottlenecks away from CPU expert compute toward a more scalable GPU compute path.

When to use it

Layerwise prefill is most useful when:

  • Your workload is dominated by long inputs (long context, long-document processing, batched prefill) and prefill throughput is a key KPI
  • You are on a native-precision style path (aligned CPU/GPU weight semantics)
  • Your GPU has enough headroom for additional prefill working memory

Supported methods:

  • RAWINT4: commonly used with Kimi-K2-Thinking
  • FP8: commonly used with MiniMax-M2/M2.1, Qwen3-235B-A22B
  • FP8_PERCHANNEL: commonly used with GLM-4.7
  • BF16: commonly used with GLM-4.7, Qwen3-235B-A22B

Principle and bottlenecks

Layerwise prefill is a token-length-based switch controlled by --kt-gpu-prefill-token-threshold:

  • Prefill token count ≤ threshold: hybrid path
  • Prefill token count > threshold: layerwise prefill path

The key difference is bottleneck location:

  • Non-layerwise (hybrid): bottleneck is usually CPU expert compute
  • Layerwise: bottleneck usually shifts to weight movement, especially CPU→GPU transfer over PCIe

Why long prefills benefit more:

  • Larger token batches increase GPU math utilization
  • CPU-side expert cost becomes less dominant
  • Throughput scaling is usually more stable on long contexts

Trade-off: higher VRAM usage and stronger sensitivity to PCIe bandwidth.

Layerwise execution path in detail

After switching to layerwise mode, execution can be viewed as a chunk-by-chunk, layer-by-layer pipeline:

  1. Mode switch by threshold
  • If prefill token count exceeds --kt-gpu-prefill-token-threshold, layerwise prefill is activated.
  1. Chunking the prefill input
  • Prefill is split by --chunked-prefill-size.
  • Chunk count is approximately: chunk count ≈ ceil(total prefill tokens / chunked-prefill-size).
  1. For each chunk and each MoE layer, prepare full layer working weights
  • The layer needs to move CPU-side working weights to GPU (primarily over PCIe).
  • This is the key cost difference vs. hybrid prefill.
  1. Per-expert three-stage pipeline inside each layer
  • Stage A: weight format conversion (written into pinned CPU memory slots)
  • Stage B: PCIe transfer (from pinned CPU memory slots to GPU memory)
  • Stage C: Marlin repack postprocess
  • These stages are pipelined at expert granularity to reduce serialized waiting.
  1. DDIO + double-buffering optimization
  • Without this, format-conversion writes and PCIe reads both consume DRAM write/read bandwidth.
  • With tightly adjacent per-expert pipeline, expert-sized chunks often stay hot in LLC/L3; with DDIO, writes tend to land in LLC and subsequent PCIe reads can be served from LLC.
  • Net effect: much lower DRAM bandwidth pressure, with bottleneck concentrating more on PCIe and GPU compute.
  1. Run this layer’s MoE compute on GPU for the current chunk
  • Continue layer by layer until the chunk finishes.
  • Then move to the next chunk and repeat.

Representative optimization results

Using MiniMax-M2.1 (FP8) as an example (PCIe Gen5 platform):

  • 1 GPU (RTX 5090): prefill throughput can reach up to 1172 tokens/s
  • 2 GPUs (RTX 5090): prefill throughput can reach up to 2879 tokens/s
  • 4 GPUs (RTX 5090): prefill throughput can reach up to 4045 tokens/s

For long contexts, layerwise prefill ceiling is strongly tied to total PCIe bandwidth.

  • Higher PCIe generation (e.g., Gen5 vs Gen4) usually means a higher upper bound
  • More GPUs usually increase aggregate PCIe bandwidth, which raises the prefill throughput ceiling

Therefore, evaluate model/parameters together with PCIe generation and GPU count (aggregate bandwidth).

Extra VRAM overhead

Layerwise prefill increases VRAM usage mainly from:

  1. One full MoE layer expert-weight working set temporarily resident on GPU
  2. Prefill temporary buffers (roughly linear with --chunked-prefill-size)
  3. Additional workspace/intermediate buffers

A useful approximation:

  • extra VRAM ≈ one full MoE-layer weight working set + temporary buffers roughly linear in chunked-prefill-size

The actual value is model-dependent and can be several GB (often around 3.6GB to 9GB+ in practical reports).

Key parameters and tuning

1) --kt-gpu-prefill-token-threshold

This parameter is best understood as the switching equilibrium point between two paths:

  • The ideal threshold is near the sequence length where both paths take similar time: the best region is where T_hybrid(L) ≈ T_layerwise(L)

That equilibrium depends mainly on:

  • CPU kernel performance (hybrid path)
  • Aggregate PCIe bandwidth (generation × links × GPU count; layerwise path)

So there is no universal best threshold. Tune on real hardware/workload. In practice it is often in the few-thousand-token range.

2) --chunked-prefill-size

This is critical for layerwise prefill.

  • Prefill is processed in chunks
  • Each chunk triggers a full layer-level weight movement workflow (entering GPU working set through PCIe)

So the trade-off is direct:

  • Too small chunk size: more chunks, more repeated full-layer weight movement cycles, larger PCIe overhead share, lower throughput
  • Too large chunk size: higher temporary VRAM demand, higher OOM risk

Practical guidance:

  • Increase --chunked-prefill-size as much as possible without OOM
  • Co-tune with --max-total-tokens and --kt-num-gpu-experts

Launch example

python -m sglang.launch_server \
  --model /path/to/model \
  --trust-remote-code \
  --kt-method FP8 \
  --kt-weight-path /path/to/model \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --chunked-prefill-size 16384 \
  --kt-gpu-prefill-token-threshold 2048

Troubleshooting

No obvious gains on short prompts

Short requests often remain in hybrid mode or do not reach the region where layerwise advantage is visible.

Throughput improved but OOM happens more often

Expected behavior. Layerwise prefill needs extra VRAM for full-layer working set and larger prefill temporary buffers.

You can tune these parameters:

  • Reduce --chunked-prefill-size to lower temporary prefill VRAM usage
  • Reduce --max-total-tokens to lower KV-cache VRAM usage
  • Reduce --kt-num-gpu-experts to lower expert-weight VRAM usage