GOSIM 2026

Workstation Heterogeneous Inference for Frontier MoE Models

From Local Inference to Local Finetune

Same hardware. If you can run it, you can tune it. Weiyu Xie (谢威宇) — Tsinghua University · Approaching AI

Three Inference Regimes

Positioning diagram for llama.cpp, KTransformers, and vLLM / SGLang inference targets

Key Idea

The Model Is Already Heterogeneous

Attention, KV cache, shared experts, routed experts, prefill, and decode do not want the same hardware behavior.

LLM and MoE model components with different compute and memory characteristics
The Core Thesis

Match model heterogeneity
to hardware heterogeneity.

GPU for hot compute. CPU/DRAM for sparse capacity. Runtime scheduling to make them work as one system.

Architecture

Map Work to the Right Device

Split the model by arithmetic intensity, not by a simple "GPU full, CPU fallback" rule.

KTransformers Runtime maps MoE attention and shared experts to GPU hot compute and routed experts to CPU DRAM sparse capacity

One Workstation,
Two Resources

KTransformers maps hot compute to GPU VRAM and sparse capacity to CPU DRAM on one local workstation

Architecture

NUMA-aware Tensor Parallel

Once CPU memory becomes part of the inference system, memory locality matters as much as kernel speed.

KTransformers: place expert weight slices in the local memory of each NUMA node.

  • Memory access stays local to each NUMA node
  • Avoids expensive cross-NUMA traffic
  • Uses AMX / VNNI / AVX paths for CPU expert compute
NUMA-aware tensor parallel places expert shards in local CPU memory close to each NUMA node

Architecture

Expert Deferral

The hard part is not putting experts on CPU. The hard part is keeping CPU and GPU busy at the same time.

Expert Deferral: defer non-critical experts so attention and expert compute overlap across layers.

  • CPU processes routed experts while GPU runs attention
  • GPU moves ahead instead of waiting for every expert immediately
  • Overlap turns heterogeneous hardware into one pipeline
Expert deferral scheduling diagram

Long Context

Layer-wise Prefill

Prefill is a different workload from decode. For 16K–64K contexts, CPU expert compute can become the bottleneck.

Layer-wise Prefill: transfer weights layer by layer to GPU, then use optimized GPU kernels for the long-context burst.

  • Multi-CUDA-stream overlap saturates PCIe 5.0 bandwidth
  • CPU and GPU formats are converted on the fly, with one stored copy
  • 7–9× speedup at 16K–64K context
Layer-wise prefill architecture

Optimization

Dynamic Expert Placement

Not all experts are equally hot. Expert activation shows stable hot/cold patterns inside a session.

Dynamic update: observe actual activations during prefill and adjust GPU expert placement on the fly.

  • Hot experts stay close to GPU compute
  • Cold experts remain in CPU memory
  • 10–30% decode speedup from better placement
Expert activation heatmap

Capability

Expanding the Workstation Sweet Spot

KTransformers features are not random additions. They expand the same heterogeneous workstation lane.

KTransformers expands the workstation sweet spot from quantized inference to native precision, consumer hardware, new formats, and Windows

Results

Performance in the Workstation Lane

vs llama.cpp on the same hardware

>4.5×
Prefill speedup
+30%
Decode throughput
  • FP8 native precision, no int4 compromise
  • Long-context prefill: 7–9× faster at 16K–64K
  • The point is not pure CPU or pure GPU. The point is the workstation as a system.
Benchmark comparison vs llama.cpp

Ecosystem

Officially Recommended

Leading open-source model teams recommend KTransformers in their official READMEs and deployment guides.

Kimi K2.5 recommends KTransformers
Kimi K2.5
GLM-5.1 recommends KTransformers for local deployment
GLM-5.1
Qwen3.5 recommends KTransformers
Qwen3.5 Series
MiniMax-M2.5 recommends KTransformers
MiniMax-M2.5

Ecosystem

Joined PyTorch Ecosystem

KTransformers was accepted into the PyTorch official ecosystem, bringing heterogeneous inference into the mainstream AI infrastructure conversation.

  • Recognized for advancing CPU-GPU heterogeneous execution
  • Built around PyTorch-native tensor workflows
  • Aligned with serving stacks such as SGLang and HF Transformers
KTransformers joins PyTorch ecosystem

Community

Website & Leaderboard

KTransformers website
Performance leaderboard

Community-driven benchmarks at kvcache.ai: submit hardware configurations, compare results, and reproduce the workstation lane.

Local Inference
to Local Finetune

Same local workstation runs inference, creates LoRA adapters, and serves the tuned model locally

KT-SFT

Same-Hardware SFT

Inference gives access. Fine-tuning gives ownership.

If a workstation can run a model with KTransformers, that same workstation should be able to tune LoRA adapters for it.

  • Use the same model files, the same hardware, and the same heterogeneous runtime
  • Train a small adapter locally, then serve it on the same local inference path
  • Move LoRA SFT from lab-specialist workflows to workstation users
hardware for inference + SFT
LoRA
personalized adapters, served locally

Use Cases

Where Local Finetune Matters

The point is not just cheaper training. It is letting teams build private, specialized, local models without sending data away.

Private data

Healthcare, finance, legal, industrial, and internal company data can stay on local or on-premise machines.

Domain behavior

Teams can tune models for their terminology, tools, workflows, policies, and preferred response style.

Personal agents

Local assistants can adapt to an individual user's files, habits, language, and long-running tasks.

Fast iteration

Run, evaluate, tune, and serve on the same box. No cluster queue and no separate training environment.

Model diversity

One foundation model can become many specialized local models across teams, communities, and domains.

Lower barrier

Fine-tuning becomes a workstation capability, not only a lab or datacenter capability.

Technology

One Technical Idea: Co-compute

HEFT is the system work that makes same-hardware SFT practical. For slides, the key idea is simple:

Do not use CPU as a passive memory warehouse. Use CPU and GPU as co-compute devices.

  • GPU runs attention, shared paths, and residual LoRA capacity
  • CPU computes routed experts where the weights already live
  • The runtime handles layouts, routing skew, and scheduling behind the scenes

Layout-aware memory

Keep tensors in layouts that match each stage and device, instead of repeatedly repacking them.

Skew-tolerant scheduling

Break routed-expert work into small tasks so hot and cold experts do not stall the CPU path.

LoRA Experts

Use residual GPU VRAM for a shared LoRA path that improves convergence and uses GPU bubbles.

Why It Matters

Local fine-tuning turns
model access into model ownership.

More people tuning models locally means more model diversity: different teams, domains, languages, tools, and workflows.

Make every developer able to run, tune, and modify large models locally.

github.com/kvcache-ai/ktransformers  ·  kvcache.ai

EDIT MODE — Ctrl+S to export