发送请求
KTransformers 推理服务启动后,常规 chat completion 请求走 SGLang 的 OpenAI-compatible HTTP 接口。
下面假设服务监听 30000 端口,served model name 是 my-model。
cURL
curl -s http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "用一段话解释 CPU-GPU 异构 MoE 推理。"}
],
"temperature": 0,
"max_tokens": 128
}'
Python Requests
import requests
response = requests.post(
"http://127.0.0.1:30000/v1/chat/completions",
json={
"model": "my-model",
"messages": [
{"role": "user", "content": "给我一个简短的部署检查清单。"}
],
"temperature": 0,
"max_tokens": 128,
},
)
print(response.json())
OpenAI Python Client
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "user", "content": "列出三个 KTransformers 调参项。"}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)
Streaming
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
stream = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "流式输出一个简短回答。"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
服务内置 API 文档
服务运行后,可查看自动生成的 API 文档:
http://127.0.0.1:30000/docshttp://127.0.0.1:30000/redochttp://127.0.0.1:30000/openapi.json