KTransformers

AMX Backend

The KT-Kernel AMX backend is designed for Intel CPUs with AMX support. It accelerates CPU-side MoE expert computation and works with SGLang to enable CPU-GPU heterogeneous inference.

Who should use AMX backend

Use AMX backend when:

  • Your CPU supports amx-bf16 / amx-int8 / amx-tile
  • You want higher CPU MoE throughput than generic AVX-only paths
  • You are deploying hybrid inference with SGLang and want CPU experts to handle cold paths efficiently

Supported methods and selection guidance:

  • BF16: Full-precision path for scenarios where quality is the top priority
  • AMXINT8: Custom quantized weight format with higher throughput and potentially slight quality drop
  • AMXINT4: Custom quantized weight format with the highest throughput, but some models may show a more noticeable quality drop

Note: AMXINT8 / AMXINT4 use KT-Kernel custom quantized weight format (conversion required).

AMX architecture summary

Compared with traditional vector-based kernels, AMX uses tile-based matrix compute and can deliver 8× the theoretical throughput of AVX-512 instructions, making it well suited for compute-intensive workloads; KT-Kernel builds on this with several MoE-oriented optimizations:

1) Tiling-aware memory layout

  • Expert weights are pre-rearranged during loading into tile-friendly sub-matrices
  • Sub-matrices are aligned to 64-byte boundaries
  • Layout follows compute access order to improve cache locality and prefetch efficiency

This reduces data-movement overhead and avoids expensive runtime transpose/reorder.

2) Cache-friendly compute pipeline

  • Work is partitioned column-wise across threads
  • Per-thread blocks are sized for cache residency (L2/L3 aware)
  • Tile registers are used for accumulation to reduce intermediate memory traffic

This design minimizes DRAM round-trips and improves sustained throughput.

3) Dynamic AMX/AVX512 path selection

  • AMX is efficient for larger matrix workloads (typical prefill)
  • Lower arithmetic-intensity scenarios (typical decode) may benefit from lighter AVX512 kernels

KT-Kernel can switch compute paths according to runtime workload characteristics, balancing throughput and latency.

4) MoE task fusion and dynamic scheduling

  • Expert GEMM tasks are fused to reduce scheduling overhead
  • Fine-grained sub-tasks are dynamically balanced across threads
  • Task stealing is used to mitigate expert-activation skew during prefill

This is important for stable performance under real routing imbalance.

Representative optimization results

On a single Xeon4 CPU, the MoE-specialized AMX backend can provide substantial gains:

  • BF16 operator throughput above 20 TFLOPS
  • Int8/Int4 operator throughput above 37 TOPS

For end-to-end inference, a dual-socket Xeon4 CPU setup with a single RTX 4090 can achieve over 500 tokens/s DeepSeek-V3 prefill throughput.

These results show that a MoE-optimized AMX kernel can deliver higher throughput on lower-cost CPUs, significantly lowering the deployment barrier for large models.

System requirements

CPU and platform

  • Intel Sapphire Rapids (Xeon 4th Gen) or newer
  • Linux x86-64
  • Python 3.10/3.11/3.12

Check AMX capability

lscpu | grep -i amx

Expected flags include:

amx-bf16 amx-int8 amx-tile

If AMX flags are missing:

  • Check CPU generation
  • Enable AMX-related options in BIOS
  • Ensure OS/kernel supports AMX state management

Prepare CPU weights for AMX

CPU-side weight requirements depend on method:

BF16: no conversion required

BF16 does not require convert_cpu_weights.py. Use the original BF16 model directory directly as --kt-weight-path (typically the same directory as --model).

AMXINT8 / AMXINT4: conversion required

These methods use KT-Kernel custom quantized format, so CPU-side expert weights must be converted to AMX-friendly format first.

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/cpu-weights-int8 \
  --quant-method int8

Common options:

  • --input-type: fp8, fp16, or bf16
  • --quant-method: int8 or int4

Use the converted output directory as --kt-weight-path.

Launch with SGLang

AMX backend now runs through SGLang startup flow.

python -m sglang.launch_server \
  --model /path/to/model \
  --trust-remote-code \
  --kt-method BF16 \
  --kt-weight-path /path/to/model \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32

Key AMX-related parameters:

  • --kt-method: BF16, AMXINT8, or AMXINT4
  • --kt-weight-path:
    • BF16: original BF16 model directory
    • AMXINT8/AMXINT4: converted CPU weight directory

For full installation and end-to-end examples, see KT-Kernel Installation.

Troubleshooting

Illegal instruction or AMX kernel not used

  • Verify AMX flags with lscpu
  • Confirm AMX is enabled in BIOS
  • Rebuild/reinstall KT-Kernel if environment changed

Slow startup during AMX conversion

Weight conversion is expected to take time for large MoE models. This is a one-time preprocessing step per weight set.

No throughput gains in decode-heavy workloads

decode can have lower arithmetic intensity. This is normal; KT-Kernel may rely more on non-AMX paths in such phases for better latency.