AMX Backend
The KT-Kernel AMX backend is designed for Intel CPUs with AMX support. It accelerates CPU-side MoE expert computation and works with SGLang to enable CPU-GPU heterogeneous inference.
- Who should use AMX backend
- AMX architecture summary
- Representative optimization results
- System requirements
- Prepare CPU weights for AMX
- Launch with SGLang
- Troubleshooting
Who should use AMX backend
Use AMX backend when:
- Your CPU supports
amx-bf16/amx-int8/amx-tile - You want higher CPU MoE throughput than generic AVX-only paths
- You are deploying hybrid inference with SGLang and want CPU experts to handle cold paths efficiently
Supported methods and selection guidance:
BF16: Full-precision path for scenarios where quality is the top priorityAMXINT8: Custom quantized weight format with higher throughput and potentially slight quality dropAMXINT4: Custom quantized weight format with the highest throughput, but some models may show a more noticeable quality drop
Note: AMXINT8 / AMXINT4 use KT-Kernel custom quantized weight format (conversion required).
AMX architecture summary
Compared with traditional vector-based kernels, AMX uses tile-based matrix compute and can deliver 8× the theoretical throughput of AVX-512 instructions, making it well suited for compute-intensive workloads; KT-Kernel builds on this with several MoE-oriented optimizations:
1) Tiling-aware memory layout
- Expert weights are pre-rearranged during loading into tile-friendly sub-matrices
- Sub-matrices are aligned to 64-byte boundaries
- Layout follows compute access order to improve cache locality and prefetch efficiency
This reduces data-movement overhead and avoids expensive runtime transpose/reorder.
2) Cache-friendly compute pipeline
- Work is partitioned column-wise across threads
- Per-thread blocks are sized for cache residency (L2/L3 aware)
- Tile registers are used for accumulation to reduce intermediate memory traffic
This design minimizes DRAM round-trips and improves sustained throughput.
3) Dynamic AMX/AVX512 path selection
- AMX is efficient for larger matrix workloads (typical prefill)
- Lower arithmetic-intensity scenarios (typical decode) may benefit from lighter AVX512 kernels
KT-Kernel can switch compute paths according to runtime workload characteristics, balancing throughput and latency.
4) MoE task fusion and dynamic scheduling
- Expert GEMM tasks are fused to reduce scheduling overhead
- Fine-grained sub-tasks are dynamically balanced across threads
- Task stealing is used to mitigate expert-activation skew during prefill
This is important for stable performance under real routing imbalance.
Representative optimization results
On a single Xeon4 CPU, the MoE-specialized AMX backend can provide substantial gains:
- BF16 operator throughput above
20 TFLOPS - Int8/Int4 operator throughput above
37 TOPS
For end-to-end inference, a dual-socket Xeon4 CPU setup with a single RTX 4090 can achieve over 500 tokens/s DeepSeek-V3 prefill throughput.
These results show that a MoE-optimized AMX kernel can deliver higher throughput on lower-cost CPUs, significantly lowering the deployment barrier for large models.
System requirements
CPU and platform
- Intel Sapphire Rapids (Xeon 4th Gen) or newer
- Linux x86-64
- Python 3.10/3.11/3.12
Check AMX capability
lscpu | grep -i amx
Expected flags include:
amx-bf16 amx-int8 amx-tile
If AMX flags are missing:
- Check CPU generation
- Enable AMX-related options in BIOS
- Ensure OS/kernel supports AMX state management
Prepare CPU weights for AMX
CPU-side weight requirements depend on method:
BF16: no conversion required
BF16 does not require convert_cpu_weights.py. Use the original BF16 model directory directly as --kt-weight-path (typically the same directory as --model).
AMXINT8 / AMXINT4: conversion required
These methods use KT-Kernel custom quantized format, so CPU-side expert weights must be converted to AMX-friendly format first.
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/cpu-weights-int8 \
--quant-method int8
Common options:
--input-type:fp8,fp16, orbf16--quant-method:int8orint4
Use the converted output directory as --kt-weight-path.
Launch with SGLang
AMX backend now runs through SGLang startup flow.
python -m sglang.launch_server \
--model /path/to/model \
--trust-remote-code \
--kt-method BF16 \
--kt-weight-path /path/to/model \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32
Key AMX-related parameters:
--kt-method:BF16,AMXINT8, orAMXINT4--kt-weight-path:BF16: original BF16 model directoryAMXINT8/AMXINT4: converted CPU weight directory
For full installation and end-to-end examples, see KT-Kernel Installation.
Troubleshooting
Illegal instruction or AMX kernel not used
- Verify AMX flags with
lscpu - Confirm AMX is enabled in BIOS
- Rebuild/reinstall KT-Kernel if environment changed
Slow startup during AMX conversion
Weight conversion is expected to take time for large MoE models. This is a one-time preprocessing step per weight set.
No throughput gains in decode-heavy workloads
decode can have lower arithmetic intensity. This is normal; KT-Kernel may rely more on non-AMX paths in such phases for better latency.