Long Context Deployment
This tutorial shows how to deploy DeepSeek V4 Flash on 4×RTX 5090 with a 1,048,576 (1M) token context window. The recipe relies on KTransformers' MXFP4 hybrid quantization and CPU/GPU heterogeneous inference: 10 routed experts stay on GPU, the rest live on CPU.
- Prerequisites
- Step 1: Set environment variables
- Step 2: Launch the sglang server
- Step 3: Wait for READY and verify
- Step 4: Issue long-context requests
- Parameter walkthrough (why these values)
- Tuning: context length vs. KV cache capacity
- Known limitations
Prerequisites
Hardware
| Component | Recommended |
|---|---|
| GPU | 4 × NVIDIA RTX 5090 32 GB (Blackwell, SM_120) |
| CPU | Dual-socket 64-core (AVX-512) |
| Memory | 256 GB DDR5 |
| Storage | NVMe SSD (for fast weight loading) |
| Interconnect | PCIe 5.0 |
Software
| Component | Version | Notes |
|---|---|---|
| CUDA | 12.8 | Minimum for RTX 5090 |
| Python | 3.10 | |
| ktransformers | 0.6.2.post3+ | Includes MXFP4 support |
| sglang | kt-sglang submodule | Includes V4 Flash adaptation |
| transformers | 4.57.1 | 5.x has compatibility issues |
Model weights: DeepSeek V4 Flash (MXFP4 + FP8 mixed format), placed on local NVMe at e.g. /path/to/DeepSeek-V4-Flash.
Step 1: Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export FLASHINFER_CUDA_ARCH_LIST=12.0a
export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B
Why these matter:
FLASHINFER_CUDA_ARCH_LIST=12.0a/TORCH_CUDA_ARCH_LIST="12.0+PTX": required on SM_120 so flashinfer JIT does not falsely report "sm75 or higher" when compiling fp4 kernels.SGLANG_DSV4_MODE=2604+SGLANG_DSV4_2604_SUBMODE=2604B: required for V4 Flash; missing them triggers aswiglu_limitassertion inmxfp4_deepseek.py.
Step 2: Launch the sglang server
numactl --interleave=all python -m sglang.launch_server \
--host 0.0.0.0 --port 40001 \
--model /path/to/DeepSeek-V4-Flash \
--kt-weight-path /path/to/DeepSeek-V4-Flash \
--kt-method MXFP4 \
--kt-num-gpu-experts 10 \
--kt-cpuinfer 60 \
--kt-threadpool-count 2 \
--kt-gpu-prefill-token-threshold 2048 \
--kt-enable-dynamic-expert-update \
--tensor-parallel-size 4 \
--context-length 1048576 \
--attention-backend flashinfer \
--mem-fraction-static 0.7 \
--chunked-prefill-size 4096 \
--max-prefill-tokens 4096 \
--max-running-requests 2 \
--watchdog-timeout 1200 \
--disable-shared-experts-fusion \
--trust-remote-code \
--cuda-graph-bs 1 \
--cuda-graph-max-bs 1 \
--disable-radix-cache \
--skip-server-warmup
See the parameter walkthrough for the rationale behind each value.
Step 3: Wait for READY and verify
Startup performs weight loading + MXFP4 swizzling + CUDA Graph capture, totaling about 80 seconds. Once the server is ready, you'll see:
The server is fired up and ready to roll!
In the startup log, check these key indicators to confirm resources are allocated as expected:
max_total_num_tokens=964864 # KV cache capacity (should be >= context length)
chunked_prefill_size=4096 # Prefill chunk size
available_gpu_mem=5.34 GB # GPU memory left after CUDA Graph + KV pool
Capture cuda graph end. Time elapsed: 5.45 s # CUDA Graph captured successfully
Health check:
curl -s http://localhost:40001/health
# HTTP 200 OK
Step 4: Issue long-context requests
The server is compatible with the OpenAI Chat Completions API.
cURL — short request:
curl http://localhost:40001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello, please introduce yourself"}],
"max_tokens": 256,
"temperature": 0.7
}'
Python — long-form request:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:40001/v1", api_key="none")
with open("very_long_document.txt") as f:
long_text = f.read() # e.g. 600K tokens of long-form text
resp = client.chat.completions.create(
model="DeepSeek-V4-Flash",
messages=[{"role": "user", "content": long_text + "\n\nPlease summarize the above."}],
max_tokens=1024,
temperature=0.7,
timeout=3600, # ← must be generous; see below
)
print(resp.choices[0].message.content)
Long-request caveats
Prefill time grows linearly with context length. A 600K token request spends roughly 20–25 minutes in prefill, so:
- HTTP client timeout must be ≥ 1800 s,
timeout=3600recommended. - When invoking remotely over SSH, run the client script under
setsid nohup: an SSH disconnect would otherwise SIGHUP the client, and once the client dies the server detects the dropped peer and aborts the in-flight request. - The client machine must stay awake and online — sleep or network drop has the same effect as SSH disconnect.
Parameter walkthrough (why these values)
Context / KV cache
| Parameter | Value | Purpose |
|---|---|---|
--context-length | 1048576 | 1M token context window (hard upper bound) |
--mem-fraction-static | 0.7 | Caps weights + CUDA Graph + KV pool. 0.7 leaves ~5 GB workspace headroom |
--disable-radix-cache | - | Required for long context — prefix-cache memory management misbehaves at this scale |
--chunked-prefill-size | 4096 | Prefill in 4096-token chunks; avoids one-shot workspace blowups |
--max-prefill-tokens | 4096 | Caps tokens processed per forward; pairs with chunked prefill |
MoE routing (CPU/GPU split)
| Parameter | Value | Purpose |
|---|---|---|
--kt-method | MXFP4 | Routed experts use MXFP4 (I8 + ue8m0) |
--kt-num-gpu-experts | 10 | 10 stay on GPU; remaining routed experts run on CPU |
--kt-cpuinfer | 60 | CPU inference threads (×2 threadpool = 120 threads) |
--kt-gpu-prefill-token-threshold | 2048 | When per-forward token count exceeds this, GPU MoE prefill fires |
--kt-enable-dynamic-expert-update | - | Promote/demote GPU-resident experts based on runtime usage |
--kt-gpu-prefill-token-thresholdcompares against the per-forward token count after chunking, not the prompt's total length. With chunked size 4096, setting the threshold to 2048 ensures every chunk takes the GPU MoE path.
Parallelism and concurrency
| Parameter | Value | Purpose |
|---|---|---|
--tensor-parallel-size | 4 | 4-way tensor parallelism across the 4 GPUs |
--max-running-requests | 2 | At most 2 concurrent in-flight requests |
--cuda-graph-bs 1 | 1 | Capture CUDA Graph for batch=1 only |
--cuda-graph-max-bs 1 | 1 |
Stability
| Parameter | Value | Purpose |
|---|---|---|
--watchdog-timeout | 1200 | Long prefill survival — the default watchdog is too short |
--disable-shared-experts-fusion | - | V4 Flash currently requires this fusion to be off |
--skip-server-warmup | - | Skip sglang's own warmup (V4 path has compatibility issues with it) |
numactl --interleave=all | - | Prevent CPU inference threads from thrashing cache across NUMA nodes |
Tuning: context length vs. KV cache capacity
--mem-fraction-static controls the KV pool size (and thus how many tokens it can hold). Numbers below are for 4×RTX 5090:
| Target context | mem-fraction-static | KV pool capacity | Workspace left | Recommendation |
|---|---|---|---|---|
| ≤600K tokens | 0.6 | ~759k tokens | ~8.3 GB | Profiling / development |
| 600K – 900K | 0.7 | ~965k tokens | ~5.3 GB | Recommended for production |
| Approaching 1M | 0.75+ | ~1.1M tokens | <4 GB | Tight memory, edge cases only |
Tuning tips:
- On a single GPU / TP=1, values above 0.7 risk OOM: workspace shrinks so much that a 2048-chunk prefill's intermediate 512 MiB buffer alone can exhaust GPU memory.
- If you don't need the full 1M context, lower
--context-lengthbefore lowering--mem-fraction-static: the former immediately frees KV, the latter shifts memory back into workspace.
Known limitations
- Prefill throughput decays with context length: from a peak of ~590 tok/s down to ~320 tok/s at 600K tokens (~47% decay) — an inherent property of attention.
- Client-side timeout for long requests: 600K prefill takes ~23 minutes; HTTP timeout must be ≥1800 s.
- CUDA Graph only at batch=1: when two requests decode in parallel, the runtime falls back to eager mode and decode throughput drops.
- A client disconnect aborts the request: remote invocations must use
setsid nohup, and the client machine must not sleep.