KTransformers

Benchmark and Profiling

Performance claims must be reproducible. Do not report a single tokens/s number without the runtime tuple and workload shape.

Required Metadata

  • model and checkpoint
  • KT method and weight path type
  • CPU SKU, physical core count, NUMA count
  • GPU SKU/count and VRAM
  • package versions
  • launch command
  • request shape: input tokens, output tokens, concurrency, batch behavior
  • whether prefill, decode, or end-to-end throughput is measured

Metrics

Report metrics separately:

MetricMeaning
Prefill tokens/sPrompt processing throughput.
Decode tokens/sGeneration throughput after prefill.
End-to-end latencyUser-visible latency for a request shape.
Peak memoryCPU RAM and GPU VRAM under the tested load.

Comparison Rule

When comparing against another runtime, align:

  • model checkpoint
  • quantization / precision
  • input and output length
  • concurrency
  • hardware
  • server arguments

If any field differs, label the result as a directional observation rather than a strict benchmark.