Workstation Heterogeneous Inference for Frontier MoE Models
From Local Inference to Local Finetune
Attention, KV cache, shared experts, routed experts, prefill, and decode do not want the same hardware behavior.
GPU for hot compute. CPU/DRAM for sparse capacity. Runtime scheduling to make them work as one system.
Split the model by arithmetic intensity, not by a simple "GPU full, CPU fallback" rule.
Once CPU memory becomes part of the inference system, memory locality matters as much as kernel speed.
KTransformers: place expert weight slices in the local memory of each NUMA node.
The hard part is not putting experts on CPU. The hard part is keeping CPU and GPU busy at the same time.
Expert Deferral: defer non-critical experts so attention and expert compute overlap across layers.
Prefill is a different workload from decode. For 16K–64K contexts, CPU expert compute can become the bottleneck.
Layer-wise Prefill: transfer weights layer by layer to GPU, then use optimized GPU kernels for the long-context burst.
Not all experts are equally hot. Expert activation shows stable hot/cold patterns inside a session.
Dynamic update: observe actual activations during prefill and adjust GPU expert placement on the fly.
KTransformers features are not random additions. They expand the same heterogeneous workstation lane.
vs llama.cpp on the same hardware
Leading open-source model teams recommend KTransformers in their official READMEs and deployment guides.
KTransformers was accepted into the PyTorch official ecosystem, bringing heterogeneous inference into the mainstream AI infrastructure conversation.
Community-driven benchmarks at kvcache.ai: submit hardware configurations, compare results, and reproduce the workstation lane.
Inference gives access. Fine-tuning gives ownership.
If a workstation can run a model with KTransformers, that same workstation should be able to tune LoRA adapters for it.
The point is not just cheaper training. It is letting teams build private, specialized, local models without sending data away.
Healthcare, finance, legal, industrial, and internal company data can stay on local or on-premise machines.
Teams can tune models for their terminology, tools, workflows, policies, and preferred response style.
Local assistants can adapt to an individual user's files, habits, language, and long-running tasks.
Run, evaluate, tune, and serve on the same box. No cluster queue and no separate training environment.
One foundation model can become many specialized local models across teams, communities, and domains.
Fine-tuning becomes a workstation capability, not only a lab or datacenter capability.
HEFT is the system work that makes same-hardware SFT practical. For slides, the key idea is simple:
Do not use CPU as a passive memory warehouse. Use CPU and GPU as co-compute devices.
Keep tensors in layouts that match each stage and device, instead of repeatedly repacking them.
Break routed-expert work into small tasks so hot and cold experts do not stall the CPU path.
Use residual GPU VRAM for a shared LoRA path that improves convergence and uses GPU bubbles.
More people tuning models locally means more model diversity: different teams, domains, languages, tools, and workflows.
Make every developer able to run, tune, and modify large models locally.
github.com/kvcache-ai/ktransformers · kvcache.ai