club-3090-llm-serving
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseclub-3090 LLM Serving
club-3090 LLM 部署方案
Skill by ara.so — Daily 2026 Skills collection.
Community recipes for serving modern LLMs on RTX 3090 (24 GB) hardware. Supports vLLM, llama.cpp, and SGLang engines with validated Docker Compose configs exposing an OpenAI-compatible API on . Currently ships Qwen3.6-27B configs for 1× and 2× cards.
localhost:8020来自ara.so的技能方案 — 2026每日技能合集。
面向RTX 3090(24GB显存)硬件的现代大语言模型(LLM)社区部署方案。支持vLLM、llama.cpp和SGLang引擎,提供经过验证的Docker Compose配置,可在暴露兼容OpenAI的API。目前已提供适用于1张和2张显卡的Qwen3.6-27B模型配置。
localhost:8020Engine Decision Matrix
引擎选择矩阵
| Need | Engine | Why |
|---|---|---|
| Max throughput (code/chat) | vLLM dual | 89–127 TPS, MTP n=3, vision, tools |
| Full 262K context, no crashes | llama.cpp single | No prefill cliffs, stable tool-use |
| 4 concurrent streams @ 262K | vLLM dual turbo | Stream isolation, full feature stack |
| Single card, moderate ctx | vLLM default | ~89 TPS, easiest setup |
SGLang is currently blocked on Qwen3.6-27B — see .
models/qwen3.6-27b/sglang/README.md| 需求 | 引擎 | 原因 |
|---|---|---|
| 最大吞吐量(代码/对话) | vLLM dual | 89–127 TPS,支持MTP n=3、视觉能力、工具调用 |
| 完整262K上下文,无崩溃 | llama.cpp single | 无预填充瓶颈,工具调用稳定 |
| 262K上下文下支持4并发流 | vLLM dual turbo | 流隔离,功能栈完整 |
| 单卡,中等上下文 | vLLM default | 约89 TPS,部署最简单 |
目前SGLang在Qwen3.6-27B上处于阻塞状态 — 详见。
models/qwen3.6-27b/sglang/README.mdPrerequisites
前置条件
- 1× or 2× NVIDIA RTX 3090 (24 GB each)
- Linux (Ubuntu 22.04+ recommended)
- Docker + NVIDIA Container Toolkit
- NVIDIA driver 580.x+
- ~30 GB free disk per model- 1张或2张NVIDIA RTX 3090(单卡24GB显存)
- Linux系统(推荐Ubuntu 22.04+)
- Docker + NVIDIA Container Toolkit
- NVIDIA驱动580.x+
- 每个模型需约30GB空闲磁盘空间Installation & Setup
安装与部署
1. Clone the repo
1. 克隆仓库
bash
git clone https://github.com/noonghunna/club-3090.git
cd club-3090bash
git clone https://github.com/noonghunna/club-3090.git
cd club-30902. Download and verify a model
2. 下载并验证模型
bash
undefinedbash
undefinedDownloads model weights, verifies SHA, clones Genesis patches
下载模型权重、验证SHA、克隆Genesis补丁
bash scripts/setup.sh qwen3.6-27b
undefinedbash scripts/setup.sh qwen3.6-27b
undefined3. Launch (interactive wizard)
3. 启动(交互式向导)
bash
bash scripts/launch.shbash
bash scripts/launch.shWizard prompts: engine → card count → workload → boots compose → verifies
向导提示:选择引擎 → 显卡数量 → 工作负载 → 启动compose → 验证
undefinedundefined4. Launch (non-interactive)
4. 启动(非交互式)
bash
undefinedbash
undefinedSingle card, chat-optimized
单卡,对话优化配置
bash scripts/launch.sh --variant vllm/default
bash scripts/launch.sh --variant vllm/default
Dual card, 262K context + vision
双卡,262K上下文 + 视觉能力
bash scripts/launch.sh --variant vllm/dual
bash scripts/launch.sh --variant vllm/dual
Single card, 262K context, no prefill cliffs
单卡,262K上下文,无预填充瓶颈
bash scripts/launch.sh --variant llamacpp/default
bash scripts/launch.sh --variant llamacpp/default
List all available variants
列出所有可用配置变体
bash scripts/switch.sh --list
---bash scripts/switch.sh --list
---Key Scripts
核心脚本
| Script | Purpose |
|---|---|
| Preflight checks, model download, SHA verify, Genesis patch clone |
| Interactive or direct variant boot; calls switch.sh + verify-full.sh |
| Stateless switcher — tears down old compose, brings up new one |
| Live health probe: KV %, MTP accept-length, recent TPS, errors |
| Quick smoke test (engine-aware via env vars) |
| 8-check functional test (~1–2 min) |
| Boundary stress test: 262K ladder + tool prefill OOM (~5–10 min) |
| Canonical TPS benchmark (3 warm + 5 measured runs) |
| 脚本 | 用途 |
|---|---|
| 预检查、模型下载、SHA验证、Genesis补丁克隆 |
| 交互式或直接启动配置变体;调用switch.sh + verify-full.sh |
| 无状态切换器 — 停止旧compose,启动新配置 |
| 实时健康检测:KV缓存使用率、MTP接受长度、近期TPS、错误信息 |
| 快速冒烟测试(通过环境变量识别引擎) |
| 8项功能测试(约1–2分钟) |
| 边界压力测试:262K阶梯式测试 + 工具预填充OOM测试(约5–10分钟) |
| 标准TPS基准测试(3次预热 + 5次正式测试) |
Common script usage
脚本常用示例
bash
undefinedbash
undefinedSwitch variants without the wizard
无需向导切换配置变体
bash scripts/switch.sh vllm/long-vision
bash scripts/switch.sh vllm/dual
bash scripts/switch.sh llamacpp/default
bash scripts/switch.sh vllm/long-vision
bash scripts/switch.sh vllm/dual
bash scripts/switch.sh llamacpp/default
Check runtime health
检查运行时健康状态
bash scripts/health.sh
bash scripts/health.sh
Output: KV cache %, MTP accept-length rate, recent TPS, error log tail
输出:KV缓存使用率、MTP接受长度率、近期TPS、错误日志末尾内容
Run canonical benchmark
运行标准基准测试
bash scripts/bench.sh
bash scripts/bench.sh
Runs narrative + code prompts, prints per-run TPS + averages
运行叙事型+代码型提示词,输出单次测试TPS及平均值
Full functional verification after a switch
切换配置后执行完整功能验证
bash scripts/verify-full.sh
bash scripts/verify-full.sh
Stress test (run before relying on long-context)
压力测试(依赖长上下文前建议执行)
bash scripts/verify-stress.sh
---bash scripts/verify-stress.sh
---Variant Names Reference
配置变体名称参考
vllm/default Single-card, chat-optimized (recommended first start)
vllm/dual Dual-card, 262K ctx, vision, tools, MTP n=3
vllm/long-vision Dual-card, long-context + vision workloads
vllm/turbo Dual-card, 4 concurrent streams @ 262K
llamacpp/default Single-card, full 262K, no prefill cliffs
llamacpp/65k Single-card, 65K ctx (faster, more VRAM headroom)
llamacpp/dual Dual-card llama.cpp recipevllm/default 单卡,对话优化(推荐首次启动使用)
vllm/dual 双卡,262K上下文,支持视觉、工具调用、MTP n=3
vllm/long-vision 双卡,长上下文+视觉工作负载
vllm/turbo 双卡,262K上下文下支持4并发流
llamacpp/default 单卡,完整262K上下文,无预填充瓶颈
llamacpp/65k 单卡,65K上下文(速度更快,显存余量更多)
llamacpp/dual 双卡llama.cpp部署方案API Usage (OpenAI-compatible, port 8020)
API 使用(兼容OpenAI,端口8020)
The server exposes a standard OpenAI-compatible API. Use the Python SDK pointed at .
openailocalhost:8020服务器提供标准的OpenAI兼容API。可使用指向的 Python SDK调用。
localhost:8020openaiPython — openai SDK
Python — openai SDK
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8020/v1",
api_key="ignored", # local server, no auth needed
)python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8020/v1",
api_key="ignored", # 本地服务器无需认证
)Basic chat
基础对话
response = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "Explain KV cache in one paragraph."}],
max_tokens=512,
)
print(response.choices[0].message.content)
undefinedresponse = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "用一段话解释KV缓存。"}],
max_tokens=512,
)
print(response.choices[0].message.content)
undefinedPython — streaming
Python — 流式输出
python
stream = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "Write a Python quicksort."}],
max_tokens=1024,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()python
stream = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "写一个Python快速排序算法。"}],
max_tokens=1024,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()Python — raw requests (no SDK dependency)
Python — 原生请求(无需SDK依赖)
python
import requests, json
payload = {
"model": "qwen3.6-27b-autoround",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
"max_tokens": 200,
"temperature": 0.7,
}
resp = requests.post(
"http://localhost:8020/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])python
import requests, json
payload = {
"model": "qwen3.6-27b-autoround",
"messages": [
{"role": "system", "content": "你是一个乐于助人的助手。"},
{"role": "user", "content": "法国的首都是什么?"},
],
"max_tokens": 200,
"temperature": 0.7,
}
resp = requests.post(
"http://localhost:8020/v1/chat/completions",
headers={"Content-Type": "application/json"},
json=payload,
timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])Python — tool calling
Python — 工具调用
python
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for recent information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
},
"required": ["query"],
},
},
}
]
response = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "What's the latest news on CUDA 13?"}],
tools=tools,
tool_choice="auto",
max_tokens=512,
)
msg = response.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")python
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "搜索网络获取最新信息",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "搜索关键词"},
},
"required": ["query"],
},
},
}
]
response = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[{"role": "user", "content": "CUDA 13的最新消息是什么?"}],
tools=tools,
tool_choice="auto",
max_tokens=512,
)
msg = response.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
print(f"工具:{call.function.name}")
print(f"参数:{call.function.arguments}")Python — long context (262K, use with llamacpp/default or vllm/dual)
Python — 长上下文(262K,搭配llamacpp/default或vllm/dual使用)
python
undefinedpython
undefinedLoad a large document
加载大文档
with open("large_codebase.txt") as f:
document = f.read()
response = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[
{"role": "user", "content": f"Summarize the architecture:\n\n{document}"},
],
max_tokens=1024,
)
print(response.choices[0].message.content)
undefinedwith open("large_codebase.txt") as f:
document = f.read()
response = client.chat.completions.create(
model="qwen3.6-27b-autoround",
messages=[
{"role": "user", "content": f"总结以下架构:\n\n{document}"},
],
max_tokens=1024,
)
print(response.choices[0].message.content)
undefinedTypeScript / Node
TypeScript / Node
typescript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8020/v1",
apiKey: "ignored",
});
async function chat(prompt: string): Promise<string> {
const response = await client.chat.completions.create({
model: "qwen3.6-27b-autoround",
messages: [{ role: "user", content: prompt }],
max_tokens: 512,
});
return response.choices[0].message.content ?? "";
}
// Streaming in Node
async function streamChat(prompt: string): Promise<void> {
const stream = await client.chat.completions.create({
model: "qwen3.6-27b-autoround",
messages: [{ role: "user", content: prompt }],
max_tokens: 1024,
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
console.log();
}typescript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8020/v1",
apiKey: "ignored",
});
async function chat(prompt: string): Promise<string> {
const response = await client.chat.completions.create({
model: "qwen3.6-27b-autoround",
messages: [{ role: "user", content: prompt }],
max_tokens: 512,
});
return response.choices[0].message.content ?? "";
}
// Node环境下流式输出
async function streamChat(prompt: string): Promise<void> {
const stream = await client.chat.completions.create({
model: "qwen3.6-27b-autoround",
messages: [{ role: "user", content: prompt }],
max_tokens: 1024,
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
console.log();
}curl — quick sanity check
curl — 快速可用性检查
bash
curl -sf http://localhost:8020/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-27b-autoround",
"messages": [{"role": "user", "content": "Capital of France?"}],
"max_tokens": 200
}' | jq '.choices[0].message.content'bash
curl -sf http://localhost:8020/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-27b-autoround",
"messages": [{"role": "user", "content": "法国的首都是什么?"}],
"max_tokens": 200
}' | jq '.choices[0].message.content'curl — list available models
curl — 列出可用模型
bash
curl -sf http://localhost:8020/v1/models | jq '.data[].id'bash
curl -sf http://localhost:8020/v1/models | jq '.data[].id'Docker Compose Structure
Docker Compose 结构
Configs live under . Example structure of a single-card compose:
models/qwen3.6-27b/vllm/compose/yaml
undefined配置文件位于。单卡compose示例结构:
models/qwen3.6-27b/vllm/compose/yaml
undefinedmodels/qwen3.6-27b/vllm/compose/default.yml (representative structure)
models/qwen3.6-27b/vllm/compose/default.yml(代表性结构)
services:
vllm:
image: vllm/vllm-openai:v0.20.1rc1.dev16+g7a1eb8ac2
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0
ports:
- "8020:8000"
volumes:
- ${MODEL_PATH}:/models/qwen3.6-27b
- ${PATCH_PATH}:/patches
command: >
--model /models/qwen3.6-27b
--served-model-name qwen3.6-27b-autoround
--tensor-parallel-size 1
--max-model-len 65536
--kv-cache-dtype fp8
--speculative-model /models/qwen3.6-27b/mtp_head
--num-speculative-tokens 3
--port 8000
For dual-card, `tensor-parallel-size 2` and `NVIDIA_VISIBLE_DEVICES=0,1` are set, and `max-model-len` extends to 262144.
---services:
vllm:
image: vllm/vllm-openai:v0.20.1rc1.dev16+g7a1eb8ac2
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0
ports:
- "8020:8000"
volumes:
- ${MODEL_PATH}:/models/qwen3.6-27b
- ${PATCH_PATH}:/patches
command: >
--model /models/qwen3.6-27b
--served-model-name qwen3.6-27b-autoround
--tensor-parallel-size 1
--max-model-len 65536
--kv-cache-dtype fp8
--speculative-model /models/qwen3.6-27b/mtp_head
--num-speculative-tokens 3
--port 8000
双卡配置中,`tensor-parallel-size`设为2,`NVIDIA_VISIBLE_DEVICES=0,1`,`max-model-len`扩展至262144。
---Connecting External Clients
连接外部客户端
Open WebUI
Open WebUI
API Base URL: http://localhost:8020/v1
API Key: (leave blank or type anything)
Model: qwen3.6-27b-autoroundAPI基础URL: http://localhost:8020/v1
API密钥: (留空或任意输入)
模型: qwen3.6-27b-autoroundCline / Cursor / Copilot-compatible tools
Cline / Cursor / 兼容Copilot的工具
json
{
"openai.baseURL": "http://localhost:8020/v1",
"openai.apiKey": "local",
"openai.model": "qwen3.6-27b-autoround"
}json
{
"openai.baseURL": "http://localhost:8020/v1",
"openai.apiKey": "local",
"openai.model": "qwen3.6-27b-autoround"
}LiteLLM proxy passthrough
LiteLLM代理转发
yaml
undefinedyaml
undefinedlitellm_config.yaml
litellm_config.yaml
model_list:
- model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored
---model_list:
- model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored
---Repo Layout Quick Reference
仓库结构速览
club-3090/
├── scripts/ Shared model-aware scripts (setup, launch, bench, health)
├── models/
│ └── qwen3.6-27b/
│ ├── vllm/
│ │ ├── compose/ Docker Compose files (all variants)
│ │ └── patches/ tolist_cudagraph, Marlin pad, Genesis pointer
│ ├── llama-cpp/
│ │ └── recipes/ Single-card 65K / 262K-max / dual recipes
│ └── sglang/ Blocked — watch list only
└── docs/
├── SINGLE_CARD.md 1× 3090 workload → config guide
├── DUAL_CARD.md 2× 3090 workload → config guide
├── HARDWARE.md PCIe vs NVLink, power draw, card compatibility
├── GLOSSARY.md TPS / KV / MTP / TP / prefill cliff definitions
├── CLIFFS.md Prefill cliff root causes and fix landscape
├── COMPARISONS.md Self-host vs cloud cost crossover analysis
├── UPSTREAM.md Tracked upstream issues and PRs
└── engines/ Per-engine deep dives (vLLM / llama.cpp / SGLang)club-3090/
├── scripts/ 模型通用脚本(部署、启动、基准测试、健康检测)
├── models/
│ └── qwen3.6-27b/
│ ├── vllm/
│ │ ├── compose/ Docker Compose配置文件(所有变体)
│ │ └── patches/ tolist_cudagraph、Marlin pad、Genesis指针补丁
│ ├── llama-cpp/
│ │ └── recipes/ 单卡65K/262K上限/双卡部署方案
│ └── sglang/ 阻塞状态 — 仅作关注列表
└── docs/
├── SINGLE_CARD.md 单张3090工作负载→配置指南
├── DUAL_CARD.md 两张3090工作负载→配置指南
├── HARDWARE.md PCIe vs NVLink、功耗、显卡兼容性
├── GLOSSARY.md TPS/KV/MTP/TP/预填充瓶颈术语定义
├── CLIFFS.md 预填充瓶颈根源及修复方案
├── COMPARISONS.md 本地部署 vs 云服务成本对比分析
├── UPSTREAM.md 跟踪上游问题与PR
└── engines/ 各引擎深度解析(vLLM/llama.cpp/SGLang)Troubleshooting
故障排查
Server won't start — CUDA/driver error
服务器无法启动 — CUDA/驱动错误
bash
undefinedbash
undefinedCheck driver version (need 580.x+)
检查驱动版本(需580.x+)
nvidia-smi --query-gpu=driver_version --format=csv,noheader
nvidia-smi --query-gpu=driver_version --format=csv,noheader
Check NVIDIA Container Toolkit
检查NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Check GPU visibility
检查GPU可见性
nvidia-smi -L
undefinednvidia-smi -L
undefinedOut of VRAM / OOM on prefill
显存不足/预填充时OOM
bash
undefinedbash
undefinedCheck current KV cache usage
检查当前KV缓存使用率
bash scripts/health.sh
bash scripts/health.sh
Switch to a config with smaller max-model-len
切换至更小max-model-len的配置
bash scripts/switch.sh llamacpp/65k # 65K ctx, more headroom
bash scripts/switch.sh llamacpp/default # 262K but manages prefill correctly
undefinedbash scripts/switch.sh llamacpp/65k # 65K上下文,显存余量更多
bash scripts/switch.sh llamacpp/default # 262K上下文,但能正确管理预填充
undefinedPrefill cliff (vLLM hangs or errors on large prompts)
预填充瓶颈(vLLM在大提示词时挂起或报错)
This is a known DeltaNet architecture issue on Qwen3.6-27B with vLLM. The llama.cpp route avoids it entirely:
bash
bash scripts/switch.sh llamacpp/default这是Qwen3.6-27B搭配vLLM时的DeltaNet架构已知问题。使用llama.cpp方案可完全避免:
bash
bash scripts/switch.sh llamacpp/defaultStress-test it:
压力测试验证:
bash scripts/verify-stress.sh
For vLLM workarounds, see `models/qwen3.6-27b/INTERNALS.md` and `docs/CLIFFS.md`.bash scripts/verify-stress.sh
vLLM的临时解决方案详见`models/qwen3.6-27b/INTERNALS.md`和`docs/CLIFFS.md`。MTP / speculative decoding not accepting tokens
MTP/投机解码不接受令牌
bash
bash scripts/health.shbash
bash scripts/health.shLook for "MTP AL:" (accept-length) — should be > 1.0
查看"MTP AL:"(接受长度)— 应大于1.0
If AL ~= 1.0, speculative head may not be loaded correctly
如果AL≈1.0,可能投机头未正确加载
Check that Genesis patches were applied:
检查Genesis补丁是否已应用:
bash scripts/setup.sh qwen3.6-27b # re-runs patch verification
undefinedbash scripts/setup.sh qwen3.6-27b # 重新运行补丁验证
undefinedTool call returns 25K+ tokens and hangs
工具调用返回25K+令牌并挂起
Known failure mode on vLLM with very large tool responses. Use llama.cpp:
bash
bash scripts/switch.sh llamacpp/default这是vLLM处理超大工具响应时的已知故障模式。使用llama.cpp方案:
bash
bash scripts/switch.sh llamacpp/defaultllama.cpp handles 25K-token tool returns cleanly (stress-tested)
llama.cpp可稳定处理25K令牌的工具返回(已通过压力测试)
undefinedundefinedSwitching variants leaves old container running
切换配置变体后旧容器仍在运行
bash
undefinedbash
undefinedswitch.sh handles this, but if you ran docker compose manually:
switch.sh会处理此问题,但如果手动运行docker compose:
docker compose -f models/qwen3.6-27b/vllm/compose/default.yml down
bash scripts/switch.sh vllm/dual
undefineddocker compose -f models/qwen3.6-27b/vllm/compose/default.yml down
bash scripts/switch.sh vllm/dual
undefinedCheck what variant is currently running
检查当前运行的配置变体
bash
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"bash
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"Performance Reference (Qwen3.6-27B)
性能参考(Qwen3.6-27B)
| Config | Cards | TPS (narrative) | TPS (code) | Max ctx | Notes |
|---|---|---|---|---|---|
| 1× | ~89 | ~89 | 65K | Recommended starting point |
| 2× | ~89 | ~127 | 262K | DFlash on code workloads |
| 2× | — | — | 262K | 4 concurrent streams |
| 1× | ~21 | ~21 | 262K | No cliffs, stable tool-use |
Benchmark substrate: vLLM nightly + Genesis v7.65 dev, llama.cpp , RTX 3090 sm_86 PCIe @ 230 W. Full per-run numbers in .
0.20.1rc1.dev16+g7a1eb8ac20d0764dfdmodels/qwen3.6-27b/CHANGELOG.md| 配置 | 显卡数量 | 叙事型TPS | 代码型TPS | 最大上下文 | 说明 |
|---|---|---|---|---|---|
| 1张 | ~89 | ~89 | 65K | 推荐初始配置 |
| 2张 | ~89 | ~127 | 262K | 代码工作负载启用DFlash |
| 2张 | — | — | 262K | 支持4并发流 |
| 1张 | ~21 | ~21 | 262K | 无瓶颈,工具调用稳定 |
基准测试环境:vLLM nightly + Genesis v7.65 dev,llama.cpp ,RTX 3090 sm_86 PCIe @ 230W。完整单次测试数据见。
0.20.1rc1.dev16+g7a1eb8ac20d0764dfdmodels/qwen3.6-27b/CHANGELOG.mdAdding a New Model
添加新模型
bash
undefinedbash
undefinedThe repo structure is model-agnostic.
仓库结构与模型无关。
New models follow the same pattern under models/<name>/:
新模型需在models/<name>/下遵循相同结构:
mkdir -p models/glm-4.6/{vllm/compose,vllm/patches,llama-cpp/recipes,sglang}
mkdir -p models/glm-4.6/{vllm/compose,vllm/patches,llama-cpp/recipes,sglang}
Add README.md, INTERNALS.md, CHANGELOG.md following qwen3.6-27b/ as template
参考qwen3.6-27b/添加README.md、INTERNALS.md、CHANGELOG.md
setup.sh and launch.sh are model-aware — add the model slug to their dispatch
setup.sh和launch.sh支持多模型 — 在分发逻辑中添加新模型标识
bash scripts/setup.sh glm-4.6 # once scripts updated
---bash scripts/setup.sh glm-4.6 # 脚本更新后执行
---Key Links
关键链接
- docs/SINGLE_CARD.md — 1× 3090 workload → config → quick start
- docs/DUAL_CARD.md — 2× 3090 workload → config → quick start
- docs/HARDWARE.md — 4090/A6000 compatibility, NVLink notes
- docs/GLOSSARY.md — TPS, KV cache, MTP, TP, prefill cliff explained
- docs/COMPARISONS.md — self-host vs cloud cost crossover
- docs/CLIFFS.md — prefill cliff deep dive
- docs/UPSTREAM.md — upstream issues/PRs being tracked
- CONTRIBUTING.md — adding numbers, bug repros, new variants
- docs/SINGLE_CARD.md — 单张3090工作负载→配置→快速入门
- docs/DUAL_CARD.md — 两张3090工作负载→配置→快速入门
- docs/HARDWARE.md — 4090/A6000兼容性、NVLink说明
- docs/GLOSSARY.md — TPS、KV缓存、MTP、TP、预填充瓶颈术语解释
- docs/COMPARISONS.md — 本地部署 vs 云服务成本平衡点分析
- docs/CLIFFS.md — 预填充瓶颈深度解析
- docs/UPSTREAM.md — 跟踪上游问题/PR
- CONTRIBUTING.md — 添加数据、Bug复现、新配置变体