Layerwise Prefill
Layerwise prefill is a long-context acceleration strategy for the prefill stage in KT + SGLang:
- Shorter sequences: use hybrid CPU+GPU prefill
- Longer sequences: switch to layerwise GPU prefill
Its goal is not to speed up every case. The goal is to shift long-prefill bottlenecks away from CPU expert compute toward a more scalable GPU compute path.
- When to use it
- Principle and bottlenecks
- Layerwise execution path in detail
- Representative optimization results
- Extra VRAM overhead
- Key parameters and tuning
- Launch example
- Troubleshooting
When to use it
Layerwise prefill is most useful when:
- Your workload is dominated by long inputs (long context, long-document processing, batched prefill) and prefill throughput is a key KPI
- You are on a native-precision style path (aligned CPU/GPU weight semantics)
- Your GPU has enough headroom for additional prefill working memory
Supported methods:
RAWINT4: commonly used withKimi-K2-ThinkingFP8: commonly used withMiniMax-M2/M2.1,Qwen3-235B-A22BFP8_PERCHANNEL: commonly used withGLM-4.7BF16: commonly used withGLM-4.7,Qwen3-235B-A22B
Principle and bottlenecks
Layerwise prefill is a token-length-based switch controlled by --kt-gpu-prefill-token-threshold:
- Prefill token count ≤ threshold: hybrid path
- Prefill token count > threshold: layerwise prefill path
The key difference is bottleneck location:
- Non-layerwise (hybrid): bottleneck is usually CPU expert compute
- Layerwise: bottleneck usually shifts to weight movement, especially CPU→GPU transfer over PCIe
Why long prefills benefit more:
- Larger token batches increase GPU math utilization
- CPU-side expert cost becomes less dominant
- Throughput scaling is usually more stable on long contexts
Trade-off: higher VRAM usage and stronger sensitivity to PCIe bandwidth.
Layerwise execution path in detail
After switching to layerwise mode, execution can be viewed as a chunk-by-chunk, layer-by-layer pipeline:
- Mode switch by threshold
- If prefill token count exceeds
--kt-gpu-prefill-token-threshold, layerwise prefill is activated.
- Chunking the prefill input
- Prefill is split by
--chunked-prefill-size. - Chunk count is approximately:
chunk count ≈ ceil(total prefill tokens / chunked-prefill-size).
- For each chunk and each MoE layer, prepare full layer working weights
- The layer needs to move CPU-side working weights to GPU (primarily over PCIe).
- This is the key cost difference vs. hybrid prefill.
- Per-expert three-stage pipeline inside each layer
- Stage A: weight format conversion (written into pinned CPU memory slots)
- Stage B: PCIe transfer (from pinned CPU memory slots to GPU memory)
- Stage C: Marlin repack postprocess
- These stages are pipelined at expert granularity to reduce serialized waiting.
- DDIO + double-buffering optimization
- Without this, format-conversion writes and PCIe reads both consume DRAM write/read bandwidth.
- With tightly adjacent per-expert pipeline, expert-sized chunks often stay hot in LLC/L3; with DDIO, writes tend to land in LLC and subsequent PCIe reads can be served from LLC.
- Net effect: much lower DRAM bandwidth pressure, with bottleneck concentrating more on PCIe and GPU compute.
- Run this layer’s MoE compute on GPU for the current chunk
- Continue layer by layer until the chunk finishes.
- Then move to the next chunk and repeat.
Representative optimization results
Using MiniMax-M2.1 (FP8) as an example (PCIe Gen5 platform):
- 1 GPU (RTX 5090): prefill throughput can reach up to
1172 tokens/s - 2 GPUs (RTX 5090): prefill throughput can reach up to
2879 tokens/s - 4 GPUs (RTX 5090): prefill throughput can reach up to
4045 tokens/s
For long contexts, layerwise prefill ceiling is strongly tied to total PCIe bandwidth.
- Higher PCIe generation (e.g., Gen5 vs Gen4) usually means a higher upper bound
- More GPUs usually increase aggregate PCIe bandwidth, which raises the prefill throughput ceiling
Therefore, evaluate model/parameters together with PCIe generation and GPU count (aggregate bandwidth).
Extra VRAM overhead
Layerwise prefill increases VRAM usage mainly from:
- One full MoE layer expert-weight working set temporarily resident on GPU
- Prefill temporary buffers (roughly linear with
--chunked-prefill-size) - Additional workspace/intermediate buffers
A useful approximation:
extra VRAM ≈ one full MoE-layer weight working set + temporary buffers roughly linear in chunked-prefill-size
The actual value is model-dependent and can be several GB (often around 3.6GB to 9GB+ in practical reports).
Key parameters and tuning
1) --kt-gpu-prefill-token-threshold
This parameter is best understood as the switching equilibrium point between two paths:
- The ideal threshold is near the sequence length where both paths take similar time: the best region is where
T_hybrid(L) ≈ T_layerwise(L)
That equilibrium depends mainly on:
- CPU kernel performance (hybrid path)
- Aggregate PCIe bandwidth (generation × links × GPU count; layerwise path)
So there is no universal best threshold. Tune on real hardware/workload. In practice it is often in the few-thousand-token range.
2) --chunked-prefill-size
This is critical for layerwise prefill.
- Prefill is processed in chunks
- Each chunk triggers a full layer-level weight movement workflow (entering GPU working set through PCIe)
So the trade-off is direct:
- Too small chunk size: more chunks, more repeated full-layer weight movement cycles, larger PCIe overhead share, lower throughput
- Too large chunk size: higher temporary VRAM demand, higher OOM risk
Practical guidance:
- Increase
--chunked-prefill-sizeas much as possible without OOM - Co-tune with
--max-total-tokensand--kt-num-gpu-experts
Launch example
python -m sglang.launch_server \
--model /path/to/model \
--trust-remote-code \
--kt-method FP8 \
--kt-weight-path /path/to/model \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--chunked-prefill-size 16384 \
--kt-gpu-prefill-token-threshold 2048
Troubleshooting
No obvious gains on short prompts
Short requests often remain in hybrid mode or do not reach the region where layerwise advantage is visible.
Throughput improved but OOM happens more often
Expected behavior. Layerwise prefill needs extra VRAM for full-layer working set and larger prefill temporary buffers.
You can tune these parameters:
- Reduce
--chunked-prefill-sizeto lower temporary prefill VRAM usage - Reduce
--max-total-tokensto lower KV-cache VRAM usage - Reduce
--kt-num-gpu-expertsto lower expert-weight VRAM usage