Long Context Deployment

This tutorial shows how to deploy DeepSeek V4 Flash on 4×RTX 5090 with a 1,048,576 (1M) token context window. The recipe relies on KTransformers' MXFP4 hybrid quantization and CPU/GPU heterogeneous inference: 10 routed experts stay on GPU, the rest live on CPU.

Prerequisites
Step 1: Set environment variables
Step 2: Launch the sglang server
Step 3: Wait for READY and verify
Step 4: Issue long-context requests
Parameter walkthrough (why these values)
Tuning: context length vs. KV cache capacity
Known limitations

Prerequisites

Hardware

Component	Recommended
GPU	4 × NVIDIA RTX 5090 32 GB (Blackwell, SM_120)
CPU	Dual-socket 64-core (AVX-512)
Memory	256 GB DDR5
Storage	NVMe SSD (for fast weight loading)
Interconnect	PCIe 5.0

Software

Component	Version	Notes
CUDA	12.8	Minimum for RTX 5090
Python	3.10
ktransformers	0.6.2.post3+	Includes MXFP4 support
sglang	kt-sglang submodule	Includes V4 Flash adaptation
transformers	4.57.1	5.x has compatibility issues

Model weights: DeepSeek V4 Flash (MXFP4 + FP8 mixed format), placed on local NVMe at e.g. /path/to/DeepSeek-V4-Flash.

Step 1: Set environment variables

export CUDA_VISIBLE_DEVICES=0,1,2,3
export FLASHINFER_CUDA_ARCH_LIST=12.0a
export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B

Why these matter:

FLASHINFER_CUDA_ARCH_LIST=12.0a / TORCH_CUDA_ARCH_LIST="12.0+PTX": required on SM_120 so flashinfer JIT does not falsely report "sm75 or higher" when compiling fp4 kernels.
SGLANG_DSV4_MODE=2604 + SGLANG_DSV4_2604_SUBMODE=2604B: required for V4 Flash; missing them triggers a swiglu_limit assertion in mxfp4_deepseek.py.

Step 2: Launch the sglang server

numactl --interleave=all python -m sglang.launch_server \
  --host 0.0.0.0 --port 40001 \
  --model /path/to/DeepSeek-V4-Flash \
  --kt-weight-path /path/to/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 10 \
  --kt-cpuinfer 60 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 2048 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 4 \
  --context-length 1048576 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.7 \
  --chunked-prefill-size 4096 \
  --max-prefill-tokens 4096 \
  --max-running-requests 2 \
  --watchdog-timeout 1200 \
  --disable-shared-experts-fusion \
  --trust-remote-code \
  --cuda-graph-bs 1 \
  --cuda-graph-max-bs 1 \
  --disable-radix-cache \
  --skip-server-warmup

See the parameter walkthrough for the rationale behind each value.

Step 3: Wait for READY and verify

Startup performs weight loading + MXFP4 swizzling + CUDA Graph capture, totaling about 80 seconds. Once the server is ready, you'll see:

The server is fired up and ready to roll!

In the startup log, check these key indicators to confirm resources are allocated as expected:

max_total_num_tokens=964864       # KV cache capacity (should be >= context length)
chunked_prefill_size=4096         # Prefill chunk size
available_gpu_mem=5.34 GB         # GPU memory left after CUDA Graph + KV pool
Capture cuda graph end. Time elapsed: 5.45 s   # CUDA Graph captured successfully

Health check:

curl -s http://localhost:40001/health
# HTTP 200 OK

Step 4: Issue long-context requests

The server is compatible with the OpenAI Chat Completions API.

cURL — short request:

curl http://localhost:40001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Hello, please introduce yourself"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Python — long-form request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:40001/v1", api_key="none")

with open("very_long_document.txt") as f:
    long_text = f.read()  # e.g. 600K tokens of long-form text

resp = client.chat.completions.create(
    model="DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": long_text + "\n\nPlease summarize the above."}],
    max_tokens=1024,
    temperature=0.7,
    timeout=3600,                     # ← must be generous; see below
)
print(resp.choices[0].message.content)

Long-request caveats

Prefill time grows linearly with context length. A 600K token request spends roughly 20–25 minutes in prefill, so:

HTTP client timeout must be ≥ 1800 s, timeout=3600 recommended.
When invoking remotely over SSH, run the client script under setsid nohup: an SSH disconnect would otherwise SIGHUP the client, and once the client dies the server detects the dropped peer and aborts the in-flight request.
The client machine must stay awake and online — sleep or network drop has the same effect as SSH disconnect.

Parameter walkthrough (why these values)

Context / KV cache

Parameter	Value	Purpose
`--context-length`	1048576	1M token context window (hard upper bound)
`--mem-fraction-static`	0.7	Caps weights + CUDA Graph + KV pool. 0.7 leaves ~5 GB workspace headroom
`--disable-radix-cache`	-	Required for long context — prefix-cache memory management misbehaves at this scale
`--chunked-prefill-size`	4096	Prefill in 4096-token chunks; avoids one-shot workspace blowups
`--max-prefill-tokens`	4096	Caps tokens processed per forward; pairs with chunked prefill

MoE routing (CPU/GPU split)

Parameter	Value	Purpose
`--kt-method`	MXFP4	Routed experts use MXFP4 (I8 + ue8m0)
`--kt-num-gpu-experts`	10	10 stay on GPU; remaining routed experts run on CPU
`--kt-cpuinfer`	60	CPU inference threads (×2 threadpool = 120 threads)
`--kt-gpu-prefill-token-threshold`	2048	When per-forward token count exceeds this, GPU MoE prefill fires
`--kt-enable-dynamic-expert-update`	-	Promote/demote GPU-resident experts based on runtime usage

--kt-gpu-prefill-token-threshold compares against the per-forward token count after chunking, not the prompt's total length. With chunked size 4096, setting the threshold to 2048 ensures every chunk takes the GPU MoE path.

Parallelism and concurrency

Parameter	Value	Purpose
`--tensor-parallel-size`	4	4-way tensor parallelism across the 4 GPUs
`--max-running-requests`	2	At most 2 concurrent in-flight requests
`--cuda-graph-bs 1`	1	Capture CUDA Graph for batch=1 only
`--cuda-graph-max-bs 1`	1

Stability

Parameter	Value	Purpose
`--watchdog-timeout`	1200	Long prefill survival — the default watchdog is too short
`--disable-shared-experts-fusion`	-	V4 Flash currently requires this fusion to be off
`--skip-server-warmup`	-	Skip sglang's own warmup (V4 path has compatibility issues with it)
`numactl --interleave=all`	-	Prevent CPU inference threads from thrashing cache across NUMA nodes

Tuning: context length vs. KV cache capacity

--mem-fraction-static controls the KV pool size (and thus how many tokens it can hold). Numbers below are for 4×RTX 5090:

Target context	mem-fraction-static	KV pool capacity	Workspace left	Recommendation
≤600K tokens	0.6	~759k tokens	~8.3 GB	Profiling / development
600K – 900K	0.7	~965k tokens	~5.3 GB	Recommended for production
Approaching 1M	0.75+	~1.1M tokens	<4 GB	Tight memory, edge cases only

Tuning tips:

On a single GPU / TP=1, values above 0.7 risk OOM: workspace shrinks so much that a 2048-chunk prefill's intermediate 512 MiB buffer alone can exhaust GPU memory.
If you don't need the full 1M context, lower --context-length before lowering --mem-fraction-static: the former immediately frees KV, the latter shifts memory back into workspace.

Known limitations

Prefill throughput decays with context length: from a peak of ~590 tok/s down to ~320 tok/s at 600K tokens (~47% decay) — an inherent property of attention.
Client-side timeout for long requests: 600K prefill takes ~23 minutes; HTTP timeout must be ≥1800 s.
CUDA Graph only at batch=1: when two requests decode in parallel, the runtime falls back to eager mode and decode throughput drops.
A client disconnect aborts the request: remote invocations must use setsid nohup, and the client machine must not sleep.