Sending Requests
After a KTransformers inference server is running, use the SGLang OpenAI-compatible HTTP surface for normal chat completion requests.
Assume the server is listening on port 30000 and the served model name is my-model.
cURL
curl -s http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "Explain CPU-GPU heterogeneous MoE inference in one paragraph."}
],
"temperature": 0,
"max_tokens": 128
}'
Python Requests
import requests
response = requests.post(
"http://127.0.0.1:30000/v1/chat/completions",
json={
"model": "my-model",
"messages": [
{"role": "user", "content": "Give me a short deployment checklist."}
],
"temperature": 0,
"max_tokens": 128,
},
)
print(response.json())
OpenAI Python Client
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "user", "content": "List three KTransformers tuning knobs."}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)
Streaming
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
stream = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Stream a short answer."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Server Docs
When the server is running, check the generated API documentation at:
http://127.0.0.1:30000/docshttp://127.0.0.1:30000/redochttp://127.0.0.1:30000/openapi.json