AMX Backend

The KT-Kernel AMX backend is designed for Intel CPUs with AMX support. It accelerates CPU-side MoE expert computation and works with SGLang to enable CPU-GPU heterogeneous inference.

Who should use AMX backend
AMX architecture summary
Representative optimization results
System requirements
Prepare CPU weights for AMX
Launch with SGLang
Troubleshooting

Who should use AMX backend

Use AMX backend when:

Your CPU supports amx-bf16 / amx-int8 / amx-tile
You want higher CPU MoE throughput than generic AVX-only paths
You are deploying hybrid inference with SGLang and want CPU experts to handle cold paths efficiently

Supported methods and selection guidance:

BF16: Full-precision path for scenarios where quality is the top priority
AMXINT8: Custom quantized weight format with higher throughput and potentially slight quality drop
AMXINT4: Custom quantized weight format with the highest throughput, but some models may show a more noticeable quality drop

Note: AMXINT8 / AMXINT4 use KT-Kernel custom quantized weight format (conversion required).

AMX architecture summary

Compared with traditional vector-based kernels, AMX uses tile-based matrix compute and can deliver 8× the theoretical throughput of AVX-512 instructions, making it well suited for compute-intensive workloads; KT-Kernel builds on this with several MoE-oriented optimizations:

1) Tiling-aware memory layout

Expert weights are pre-rearranged during loading into tile-friendly sub-matrices
Sub-matrices are aligned to 64-byte boundaries
Layout follows compute access order to improve cache locality and prefetch efficiency

This reduces data-movement overhead and avoids expensive runtime transpose/reorder.

2) Cache-friendly compute pipeline

Work is partitioned column-wise across threads
Per-thread blocks are sized for cache residency (L2/L3 aware)
Tile registers are used for accumulation to reduce intermediate memory traffic

This design minimizes DRAM round-trips and improves sustained throughput.

3) Dynamic AMX/AVX512 path selection

AMX is efficient for larger matrix workloads (typical prefill)
Lower arithmetic-intensity scenarios (typical decode) may benefit from lighter AVX512 kernels

KT-Kernel can switch compute paths according to runtime workload characteristics, balancing throughput and latency.

4) MoE task fusion and dynamic scheduling

Expert GEMM tasks are fused to reduce scheduling overhead
Fine-grained sub-tasks are dynamically balanced across threads
Task stealing is used to mitigate expert-activation skew during prefill

This is important for stable performance under real routing imbalance.

Representative optimization results

On a single Xeon4 CPU, the MoE-specialized AMX backend can provide substantial gains:

BF16 operator throughput above 20 TFLOPS
Int8/Int4 operator throughput above 37 TOPS

For end-to-end inference, a dual-socket Xeon4 CPU setup with a single RTX 4090 can achieve over 500 tokens/s DeepSeek-V3 prefill throughput.

These results show that a MoE-optimized AMX kernel can deliver higher throughput on lower-cost CPUs, significantly lowering the deployment barrier for large models.

System requirements

CPU and platform

Intel Sapphire Rapids (Xeon 4th Gen) or newer
Linux x86-64
Python 3.10/3.11/3.12

Check AMX capability

lscpu | grep -i amx

Expected flags include:

amx-bf16 amx-int8 amx-tile

If AMX flags are missing:

Check CPU generation
Enable AMX-related options in BIOS
Ensure OS/kernel supports AMX state management

Prepare CPU weights for AMX

CPU-side weight requirements depend on method:

`BF16`: no conversion required

BF16 does not require convert_cpu_weights.py. Use the original BF16 model directory directly as --kt-weight-path (typically the same directory as --model).

`AMXINT8` / `AMXINT4`: conversion required

These methods use KT-Kernel custom quantized format, so CPU-side expert weights must be converted to AMX-friendly format first.

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/cpu-weights-int8 \
  --quant-method int8

Common options:

--input-type: fp8, fp16, or bf16
--quant-method: int8 or int4

Use the converted output directory as --kt-weight-path.

Launch with SGLang

AMX backend now runs through SGLang startup flow.

python -m sglang.launch_server \
  --model /path/to/model \
  --trust-remote-code \
  --kt-method BF16 \
  --kt-weight-path /path/to/model \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32

Key AMX-related parameters:

--kt-method: BF16, AMXINT8, or AMXINT4
--kt-weight-path:
- BF16: original BF16 model directory
- AMXINT8/AMXINT4: converted CPU weight directory

For full installation and end-to-end examples, see KT-Kernel Installation.

Troubleshooting

`Illegal instruction` or AMX kernel not used

Verify AMX flags with lscpu
Confirm AMX is enabled in BIOS
Rebuild/reinstall KT-Kernel if environment changed

Slow startup during AMX conversion

Weight conversion is expected to take time for large MoE models. This is a one-time preprocessing step per weight set.

No throughput gains in decode-heavy workloads

decode can have lower arithmetic intensity. This is normal; KT-Kernel may rely more on non-AMX paths in such phases for better latency.