vLLM 高性能推理部署指南 / 04 - OpenAI 兼容 API 服务
04 - OpenAI 兼容 API 服务
全面掌握 vLLM 的 OpenAI 兼容 API,实现与 OpenAI 生态的无缝对接。
4.1 API 服务概览
vLLM 内置了与 OpenAI API 完全兼容的 HTTP 服务,可以作为 OpenAI API 的直接替代品。这意味着现有的 OpenAI 客户端代码只需修改 base_url 即可无缝切换到 vLLM。
4.1.1 架构总览
客户端代码
│
▼
OpenAI Python SDK / cURL / 任意 HTTP 客户端
│
▼ (HTTP Request)
┌───────────────────────────────────┐
│ vLLM OpenAI-Compatible Server │
│ │
│ /v1/chat/completions ──┐ │
│ /v1/completions ──┤ │
│ /v1/embeddings ──┼──→ vLLM Engine
│ /v1/models ──┤ │
│ /health ──┘ │
└───────────────────────────────────┘
4.1.2 支持的 API 端点
| 端点 | 方法 | 功能 | OpenAI 兼容 |
|---|---|---|---|
/v1/chat/completions | POST | Chat 模式对话 | ✅ |
/v1/completions | POST | 文本补全 | ✅ |
/v1/embeddings | POST | 文本向量化 | ✅ |
/v1/models | GET | 模型列表 | ✅ |
/health | GET | 健康检查 | - |
/tokenize | POST | 分词 | vLLM 特有 |
/detokenize | POST | 反分词 | vLLM 特有 |
4.2 Chat Completions API
4.2.1 基础请求
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"messages": [
{"role": "system", "content": "你是一个有用的助手。"},
{"role": "user", "content": "什么是深度学习?"}
],
"max_tokens": 300,
"temperature": 0.7
}'
4.2.2 响应格式
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "qwen-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "深度学习是机器学习的一个子领域..."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 156,
"total_tokens": 181
}
}
4.2.3 请求参数详解
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
model | string | - | 模型名称(必需) |
messages | array | - | 消息列表(必需) |
max_tokens | integer | null | 最大生成 token 数 |
temperature | float | 1.0 | 采样温度(0-2) |
top_p | float | 1.0 | Nucleus sampling |
n | integer | 1 | 生成候选数 |
stream | boolean | false | 是否流式输出 |
stop | string/array | null | 停止词 |
presence_penalty | float | 0 | 存在惩罚(-2 到 2) |
frequency_penalty | float | 0 | 频率惩罚(-2 到 2) |
logprobs | boolean | false | 是否返回 log 概率 |
top_logprobs | integer | null | 返回 top N log 概率 |
tools | array | null | 工具定义(Function Calling) |
tool_choice | string/object | auto | 工具选择策略 |
response_format | object | null | 响应格式(JSON mode) |
seed | integer | null | 随机种子(可复现) |
4.2.4 多轮对话
# multi_turn.py
"""多轮对话示例"""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# 维护对话历史
conversation = [
{"role": "system", "content": "你是一个 Python 编程专家。"}
]
def chat(user_input: str) -> str:
conversation.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="qwen-7b",
messages=conversation,
max_tokens=500,
temperature=0.7,
)
assistant_msg = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_msg})
return assistant_msg
# 多轮对话
print(chat("如何在 Python 中读取 CSV 文件?"))
print(chat("如果文件很大,有什么优化方法?"))
print(chat("能给出一个完整的代码示例吗?"))
4.2.5 Function Calling
# function_calling.py
"""Function Calling 示例"""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "获取指定城市的天气信息",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "城市名称,如 '北京'",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "温度单位",
},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="qwen-7b",
messages=[
{"role": "user", "content": "北京今天天气怎么样?"}
],
tools=tools,
tool_choice="auto",
)
print(response.choices[0].message)
4.2.6 JSON Mode
# json_mode.py
"""结构化 JSON 输出"""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="qwen-7b",
messages=[
{"role": "system", "content": "你是一个 JSON 输出助手。请始终以 JSON 格式回复。"},
{"role": "user", "content": "列出3种常见的排序算法及其时间复杂度"},
],
response_format={"type": "json_object"},
max_tokens=500,
temperature=0.3,
)
print(response.choices[0].message.content)
# 输出合法的 JSON 字符串
4.3 Completions API
4.3.1 基础请求
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"prompt": "The meaning of life is",
"max_tokens": 100,
"temperature": 0.7
}'
4.3.2 Python 客户端
# completions_example.py
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# 单条请求
response = client.completions.create(
model="qwen-7b",
prompt="def quicksort(arr):\n",
max_tokens=200,
temperature=0.2,
stop=["\n\ndef"], # 遇到下一个函数定义时停止
)
print(response.choices[0].text)
# 批量请求
response = client.completions.create(
model="qwen-7b",
prompt=["How to learn Python?", "How to learn Rust?"],
max_tokens=100,
temperature=0.5,
)
for choice in response.choices:
print(f"[{choice.index}]: {choice.text[:100]}")
4.3.3 Chat vs Completions 选择
| 维度 | Chat Completions | Completions |
|---|---|---|
| 适用模型 | Chat/Instruct 模型 | 基础模型 |
| 输入格式 | messages 数组 | 纯文本字符串 |
| Chat Template | 自动应用 | 不适用 |
| System Prompt | ✅ 原生支持 | 需手动拼接 |
| 推荐度 | 大多数场景推荐 | 补全/续写场景 |
4.4 流式输出(Streaming)
4.4.1 基础流式输出
# cURL 流式请求
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"messages": [{"role": "user", "content": "写一首诗"}],
"max_tokens": 200,
"stream": true
}'
4.4.2 流式响应格式
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"春"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"风"},"logprobs":null,"finish_reason":null}]}
...
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
4.4.3 Python 流式客户端
# stream_example.py
"""流式输出完整示例"""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def stream_chat(prompt: str):
"""流式 Chat 输出"""
stream = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
temperature=0.7,
stream=True,
stream_options={"include_usage": True}, # 包含用量统计
)
for chunk in stream:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
# 检查完成原因
if chunk.choices[0].finish_reason:
print(f"\n[完成原因: {chunk.choices[0].finish_reason}]")
# 最后一个 chunk 包含 usage 信息
if chunk.usage:
print(f"[用量: prompt={chunk.usage.prompt_tokens}, "
f"completion={chunk.usage.completion_tokens}]")
stream_chat("解释量子纠缠的概念")
4.4.4 异步流式客户端
# async_stream.py
"""异步流式输出"""
import asyncio
from openai import AsyncOpenAI
async def stream_chat_async(prompt: str):
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="none",
)
stream = await client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
stream=True,
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
# 并发多个流式请求
async def main():
tasks = [
stream_chat_async("什么是机器学习?"),
stream_chat_async("什么是深度学习?"),
stream_chat_async("什么是强化学习?"),
]
await asyncio.gather(*tasks)
asyncio.run(main())
4.5 Embeddings API
4.5.1 生成向量
# embeddings.py
"""文本向量化示例"""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# 启动服务时加载 embedding 模型
# vllm serve BAAI/bge-base-zh-v1.5 --task embedding
response = client.embeddings.create(
model="bge-base-zh",
input=["什么是人工智能?", "机器学习的基本概念"],
)
for i, embedding in enumerate(response.data):
print(f"文本 {i}: 维度={len(embedding.embedding)}, "
f"前5个值={embedding.embedding[:5]}")
4.6 并发请求处理
4.6.1 多线程并发
# concurrent_requests.py
"""并发请求示例"""
import time
import concurrent.futures
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def send_request(prompt: str) -> dict:
"""发送单个请求"""
start = time.time()
response = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0.7,
)
elapsed = time.time() - start
return {
"prompt": prompt[:30],
"response": response.choices[0].message.content[:50],
"tokens": response.usage.completion_tokens,
"time": elapsed,
}
# 并发 20 个请求
prompts = [f"用一句话解释概念 {i}" for i in range(20)]
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(send_request, prompts))
total_time = time.time() - start
# 统计
total_tokens = sum(r["tokens"] for r in results)
print(f"总耗时: {total_time:.2f}s")
print(f"总 tokens: {total_tokens}")
print(f"吞吐量: {total_tokens / total_time:.1f} tokens/s")
4.6.2 异步并发
# async_concurrent.py
"""异步并发请求"""
import asyncio
import time
from openai import AsyncOpenAI
async def main():
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="none",
)
async def send_request(prompt: str):
response = await client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
)
return response.usage.completion_tokens
prompts = [f"用一句话解释概念 {i}" for i in range(50)]
start = time.time()
tasks = [send_request(p) for p in prompts]
results = await asyncio.gather(*tasks)
total_time = time.time() - start
print(f"50 个请求并发完成,总耗时: {total_time:.2f}s")
print(f"总 tokens: {sum(results)}")
print(f"吞吐量: {sum(results) / total_time:.1f} tokens/s")
asyncio.run(main())
4.7 API 服务配置
4.7.1 启动命令详解
vllm serve Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen-7b \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 \
--dtype auto \
--seed 42 \
--enable-prefix-caching \
--disable-log-requests \
--max-num-seqs 256 \
--max-num-batched-tokens 8192
4.7.2 多模型服务
vLLM 单实例只能加载一个基础模型(可搭配多个 LoRA)。如需多模型,启动多个实例:
# 实例 1:通用模型
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8000 \
--served-model-name qwen-7b
# 实例 2:代码模型
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
--port 8001 \
--served-model-name qwen-coder
# 实例 3:数学模型
vllm serve Qwen/Qwen2.5-Math-7B-Instruct \
--port 8002 \
--served-model-name qwen-math
4.7.3 使用 Nginx 反向代理
# /etc/nginx/conf.d/vllm.conf
upstream vllm_cluster {
server 127.0.0.1:8000 weight=1;
server 127.0.0.1:8001 weight=1;
server 127.0.0.1:8002 weight=1;
}
server {
listen 80;
server_name llm.example.com;
location /v1/ {
proxy_pass http://vllm_cluster/v1/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# 流式输出必需
proxy_buffering off;
proxy_cache off;
# 超时设置(LLM 生成可能较慢)
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
4.8 安全配置
4.8.1 API Key 认证
# 启动时设置 API Key
# vllm serve model --api-key YOUR_SECRET_KEY
# 带 API Key 的请求
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer YOUR_SECRET_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello"}]}'
# Python 客户端
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="YOUR_SECRET_KEY",
)
4.8.2 CORS 配置
# 允许跨域访问(开发环境)
vllm serve model --allowed-origins '["*"]'
4.9 错误处理
4.9.1 常见 HTTP 状态码
| 状态码 | 含义 | 常见原因 |
|---|---|---|
| 200 | 成功 | 正常 |
| 400 | 请求错误 | 参数格式错误 |
| 401 | 未授权 | API Key 错误 |
| 404 | 未找到 | 模型名不匹配 |
| 422 | 验证错误 | 参数值不合法 |
| 500 | 服务器错误 | 内部异常 |
| 503 | 服务不可用 | 队列已满 |
4.9.2 错误处理最佳实践
# error_handling.py
"""API 错误处理示例"""
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="none",
timeout=60.0,
max_retries=3,
)
def call_with_retry(prompt: str, max_attempts: int = 3) -> str:
for attempt in range(max_attempts):
try:
response = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
)
return response.choices[0].message.content
except RateLimitError:
wait = 2 ** attempt
print(f"请求限流,等待 {wait}s 后重试...")
time.sleep(wait)
except APITimeoutError:
print(f"请求超时,第 {attempt + 1} 次重试...")
except APIError as e:
print(f"API 错误: {e}")
if attempt == max_attempts - 1:
raise
raise Exception("超过最大重试次数")
4.10 业务场景
场景一:API 网关集成
前端应用 → API 网关 → vLLM 服务集群
↓
认证 / 限流 / 日志 / 路由
场景二:多后端路由
# 根据请求内容路由到不同模型
def route_request(prompt: str, model_type: str):
if model_type == "code":
return call_vllm("http://coder:8000/v1", prompt)
elif model_type == "chat":
return call_vllm("http://chat:8000/v1", prompt)
else:
return call_vllm("http://general:8000/v1", prompt)
4.11 注意事项
模型名一致性:请求中的
model参数必须与--served-model-name一致,否则返回 404。
流式超时:流式请求的超时时间应设置较长,因为生成过程可能持续数十秒。
并发限制:vLLM 的并发由
max_num_seqs参数控制。超出的请求会在队列中等待。
上下文长度:请求的总 token 数(prompt + completion)不能超过
max-model-len。
4.12 扩展阅读
上一章:03 - 快速开始 | 下一章:05 - 核心架构解析