KTransformers

Long Context Deployment

This tutorial shows how to deploy DeepSeek V4 Flash on 4×RTX 5090 with a 1,048,576 (1M) token context window. The recipe relies on KTransformers' MXFP4 hybrid quantization and CPU/GPU heterogeneous inference: 10 routed experts stay on GPU, the rest live on CPU.

Prerequisites

Hardware

ComponentRecommended
GPU4 × NVIDIA RTX 5090 32 GB (Blackwell, SM_120)
CPUDual-socket 64-core (AVX-512)
Memory256 GB DDR5
StorageNVMe SSD (for fast weight loading)
InterconnectPCIe 5.0

Software

ComponentVersionNotes
CUDA12.8Minimum for RTX 5090
Python3.10
ktransformers0.6.2.post3+Includes MXFP4 support
sglangkt-sglang submoduleIncludes V4 Flash adaptation
transformers4.57.15.x has compatibility issues

Model weights: DeepSeek V4 Flash (MXFP4 + FP8 mixed format), placed on local NVMe at e.g. /path/to/DeepSeek-V4-Flash.

Step 1: Set environment variables

export CUDA_VISIBLE_DEVICES=0,1,2,3
export FLASHINFER_CUDA_ARCH_LIST=12.0a
export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B

Why these matter:

  • FLASHINFER_CUDA_ARCH_LIST=12.0a / TORCH_CUDA_ARCH_LIST="12.0+PTX": required on SM_120 so flashinfer JIT does not falsely report "sm75 or higher" when compiling fp4 kernels.
  • SGLANG_DSV4_MODE=2604 + SGLANG_DSV4_2604_SUBMODE=2604B: required for V4 Flash; missing them triggers a swiglu_limit assertion in mxfp4_deepseek.py.

Step 2: Launch the sglang server

numactl --interleave=all python -m sglang.launch_server \
  --host 0.0.0.0 --port 40001 \
  --model /path/to/DeepSeek-V4-Flash \
  --kt-weight-path /path/to/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 10 \
  --kt-cpuinfer 60 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 2048 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 4 \
  --context-length 1048576 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.7 \
  --chunked-prefill-size 4096 \
  --max-prefill-tokens 4096 \
  --max-running-requests 2 \
  --watchdog-timeout 1200 \
  --disable-shared-experts-fusion \
  --trust-remote-code \
  --cuda-graph-bs 1 \
  --cuda-graph-max-bs 1 \
  --disable-radix-cache \
  --skip-server-warmup

See the parameter walkthrough for the rationale behind each value.

Step 3: Wait for READY and verify

Startup performs weight loading + MXFP4 swizzling + CUDA Graph capture, totaling about 80 seconds. Once the server is ready, you'll see:

The server is fired up and ready to roll!

In the startup log, check these key indicators to confirm resources are allocated as expected:

max_total_num_tokens=964864       # KV cache capacity (should be >= context length)
chunked_prefill_size=4096         # Prefill chunk size
available_gpu_mem=5.34 GB         # GPU memory left after CUDA Graph + KV pool
Capture cuda graph end. Time elapsed: 5.45 s   # CUDA Graph captured successfully

Health check:

curl -s http://localhost:40001/health
# HTTP 200 OK

Step 4: Issue long-context requests

The server is compatible with the OpenAI Chat Completions API.

cURL — short request:

curl http://localhost:40001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Hello, please introduce yourself"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Python — long-form request:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:40001/v1", api_key="none")

with open("very_long_document.txt") as f:
    long_text = f.read()  # e.g. 600K tokens of long-form text

resp = client.chat.completions.create(
    model="DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": long_text + "\n\nPlease summarize the above."}],
    max_tokens=1024,
    temperature=0.7,
    timeout=3600,                     # ← must be generous; see below
)
print(resp.choices[0].message.content)

Long-request caveats

Prefill time grows linearly with context length. A 600K token request spends roughly 20–25 minutes in prefill, so:

  1. HTTP client timeout must be ≥ 1800 s, timeout=3600 recommended.
  2. When invoking remotely over SSH, run the client script under setsid nohup: an SSH disconnect would otherwise SIGHUP the client, and once the client dies the server detects the dropped peer and aborts the in-flight request.
  3. The client machine must stay awake and online — sleep or network drop has the same effect as SSH disconnect.

Parameter walkthrough (why these values)

Context / KV cache

ParameterValuePurpose
--context-length10485761M token context window (hard upper bound)
--mem-fraction-static0.7Caps weights + CUDA Graph + KV pool. 0.7 leaves ~5 GB workspace headroom
--disable-radix-cache-Required for long context — prefix-cache memory management misbehaves at this scale
--chunked-prefill-size4096Prefill in 4096-token chunks; avoids one-shot workspace blowups
--max-prefill-tokens4096Caps tokens processed per forward; pairs with chunked prefill

MoE routing (CPU/GPU split)

ParameterValuePurpose
--kt-methodMXFP4Routed experts use MXFP4 (I8 + ue8m0)
--kt-num-gpu-experts1010 stay on GPU; remaining routed experts run on CPU
--kt-cpuinfer60CPU inference threads (×2 threadpool = 120 threads)
--kt-gpu-prefill-token-threshold2048When per-forward token count exceeds this, GPU MoE prefill fires
--kt-enable-dynamic-expert-update-Promote/demote GPU-resident experts based on runtime usage

--kt-gpu-prefill-token-threshold compares against the per-forward token count after chunking, not the prompt's total length. With chunked size 4096, setting the threshold to 2048 ensures every chunk takes the GPU MoE path.

Parallelism and concurrency

ParameterValuePurpose
--tensor-parallel-size44-way tensor parallelism across the 4 GPUs
--max-running-requests2At most 2 concurrent in-flight requests
--cuda-graph-bs 11Capture CUDA Graph for batch=1 only
--cuda-graph-max-bs 11

Stability

ParameterValuePurpose
--watchdog-timeout1200Long prefill survival — the default watchdog is too short
--disable-shared-experts-fusion-V4 Flash currently requires this fusion to be off
--skip-server-warmup-Skip sglang's own warmup (V4 path has compatibility issues with it)
numactl --interleave=all-Prevent CPU inference threads from thrashing cache across NUMA nodes

Tuning: context length vs. KV cache capacity

--mem-fraction-static controls the KV pool size (and thus how many tokens it can hold). Numbers below are for 4×RTX 5090:

Target contextmem-fraction-staticKV pool capacityWorkspace leftRecommendation
≤600K tokens0.6~759k tokens~8.3 GBProfiling / development
600K – 900K0.7~965k tokens~5.3 GBRecommended for production
Approaching 1M0.75+~1.1M tokens<4 GBTight memory, edge cases only

Tuning tips:

  • On a single GPU / TP=1, values above 0.7 risk OOM: workspace shrinks so much that a 2048-chunk prefill's intermediate 512 MiB buffer alone can exhaust GPU memory.
  • If you don't need the full 1M context, lower --context-length before lowering --mem-fraction-static: the former immediately frees KV, the latter shifts memory back into workspace.

Known limitations

  1. Prefill throughput decays with context length: from a peak of ~590 tok/s down to ~320 tok/s at 600K tokens (~47% decay) — an inherent property of attention.
  2. Client-side timeout for long requests: 600K prefill takes ~23 minutes; HTTP timeout must be ≥1800 s.
  3. CUDA Graph only at batch=1: when two requests decode in parallel, the runtime falls back to eager mode and decode throughput drops.
  4. A client disconnect aborts the request: remote invocations must use setsid nohup, and the client machine must not sleep.