Benchmark and Profiling
Performance claims must be reproducible. Do not report a single tokens/s number without the runtime tuple and workload shape.
Required Metadata
- model and checkpoint
- KT method and weight path type
- CPU SKU, physical core count, NUMA count
- GPU SKU/count and VRAM
- package versions
- launch command
- request shape: input tokens, output tokens, concurrency, batch behavior
- whether prefill, decode, or end-to-end throughput is measured
Metrics
Report metrics separately:
| Metric | Meaning |
|---|---|
| Prefill tokens/s | Prompt processing throughput. |
| Decode tokens/s | Generation throughput after prefill. |
| End-to-end latency | User-visible latency for a request shape. |
| Peak memory | CPU RAM and GPU VRAM under the tested load. |
Comparison Rule
When comparing against another runtime, align:
- model checkpoint
- quantization / precision
- input and output length
- concurrency
- hardware
- server arguments
If any field differs, label the result as a directional observation rather than a strict benchmark.