club-3090-llm-serving

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

club-3090 LLM Serving

club-3090 LLM 部署方案

Skill by ara.so — Daily 2026 Skills collection.

Community recipes for serving modern LLMs on RTX 3090 (24 GB) hardware. Supports vLLM, llama.cpp, and SGLang engines with validated Docker Compose configs exposing an OpenAI-compatible API on

localhost:8020

. Currently ships Qwen3.6-27B configs for 1× and 2× cards.

来自ara.so的技能方案 — 2026每日技能合集。

面向RTX 3090（24GB显存）硬件的现代大语言模型（LLM）社区部署方案。支持vLLM、llama.cpp和SGLang引擎，提供经过验证的Docker Compose配置，可在

localhost:8020

暴露兼容OpenAI的API。目前已提供适用于1张和2张显卡的Qwen3.6-27B模型配置。

Engine Decision Matrix

引擎选择矩阵

Need	Engine	Why
Max throughput (code/chat)	vLLM dual	89–127 TPS, MTP n=3, vision, tools
Full 262K context, no crashes	llama.cpp single	No prefill cliffs, stable tool-use
4 concurrent streams @ 262K	vLLM dual turbo	Stream isolation, full feature stack
Single card, moderate ctx	vLLM default	~89 TPS, easiest setup

SGLang is currently blocked on Qwen3.6-27B — see

models/qwen3.6-27b/sglang/README.md

需求	引擎	原因
最大吞吐量（代码/对话）	vLLM dual	89–127 TPS，支持MTP n=3、视觉能力、工具调用
完整262K上下文，无崩溃	llama.cpp single	无预填充瓶颈，工具调用稳定
262K上下文下支持4并发流	vLLM dual turbo	流隔离，功能栈完整
单卡，中等上下文	vLLM default	约89 TPS，部署最简单

目前SGLang在Qwen3.6-27B上处于阻塞状态 — 详见

models/qwen3.6-27b/sglang/README.md

。

Prerequisites

前置条件

- 1× or 2× NVIDIA RTX 3090 (24 GB each)
- Linux (Ubuntu 22.04+ recommended)
- Docker + NVIDIA Container Toolkit
- NVIDIA driver 580.x+
- ~30 GB free disk per model

- 1张或2张NVIDIA RTX 3090（单卡24GB显存）
- Linux系统（推荐Ubuntu 22.04+）
- Docker + NVIDIA Container Toolkit
- NVIDIA驱动580.x+
- 每个模型需约30GB空闲磁盘空间

Installation & Setup

安装与部署

1. Clone the repo

1. 克隆仓库

bash

git clone https://github.com/noonghunna/club-3090.git
cd club-3090

bash

git clone https://github.com/noonghunna/club-3090.git
cd club-3090

2. Download and verify a model

2. 下载并验证模型

bash

undefined

bash

undefined

Downloads model weights, verifies SHA, clones Genesis patches

下载模型权重、验证SHA、克隆Genesis补丁

bash scripts/setup.sh qwen3.6-27b

undefined

bash scripts/setup.sh qwen3.6-27b

undefined

3. Launch (interactive wizard)

3. 启动（交互式向导）

bash

bash scripts/launch.sh

bash

bash scripts/launch.sh

Wizard prompts: engine → card count → workload → boots compose → verifies

向导提示：选择引擎 → 显卡数量 → 工作负载 → 启动compose → 验证

undefined

undefined

4. Launch (non-interactive)

4. 启动（非交互式）

bash

undefined

bash

undefined

Single card, chat-optimized

单卡，对话优化配置

bash scripts/launch.sh --variant vllm/default

Dual card, 262K context + vision

双卡，262K上下文 + 视觉能力

bash scripts/launch.sh --variant vllm/dual

Single card, 262K context, no prefill cliffs

单卡，262K上下文，无预填充瓶颈

bash scripts/launch.sh --variant llamacpp/default

List all available variants

列出所有可用配置变体

bash scripts/switch.sh --list

---

bash scripts/switch.sh --list

---

Key Scripts

核心脚本

Script	Purpose
`scripts/setup.sh <model>`	Preflight checks, model download, SHA verify, Genesis patch clone
`scripts/launch.sh [--variant X]`	Interactive or direct variant boot; calls switch.sh + verify-full.sh
`scripts/switch.sh <variant>`	Stateless switcher — tears down old compose, brings up new one
`scripts/health.sh`	Live health probe: KV %, MTP accept-length, recent TPS, errors
`scripts/verify.sh`	Quick smoke test (engine-aware via env vars)
`scripts/verify-full.sh`	8-check functional test (~1–2 min)
`scripts/verify-stress.sh`	Boundary stress test: 262K ladder + tool prefill OOM (~5–10 min)
`scripts/bench.sh`	Canonical TPS benchmark (3 warm + 5 measured runs)

脚本	用途
`scripts/setup.sh <model>`	预检查、模型下载、SHA验证、Genesis补丁克隆
`scripts/launch.sh [--variant X]`	交互式或直接启动配置变体；调用switch.sh + verify-full.sh
`scripts/switch.sh <variant>`	无状态切换器 — 停止旧compose，启动新配置
`scripts/health.sh`	实时健康检测：KV缓存使用率、MTP接受长度、近期TPS、错误信息
`scripts/verify.sh`	快速冒烟测试（通过环境变量识别引擎）
`scripts/verify-full.sh`	8项功能测试（约1–2分钟）
`scripts/verify-stress.sh`	边界压力测试：262K阶梯式测试 + 工具预填充OOM测试（约5–10分钟）
`scripts/bench.sh`	标准TPS基准测试（3次预热 + 5次正式测试）

Common script usage

脚本常用示例

bash

undefined

bash

undefined

Switch variants without the wizard

无需向导切换配置变体

bash scripts/switch.sh vllm/long-vision bash scripts/switch.sh vllm/dual bash scripts/switch.sh llamacpp/default

Check runtime health

检查运行时健康状态

bash scripts/health.sh

Output: KV cache %, MTP accept-length rate, recent TPS, error log tail

输出：KV缓存使用率、MTP接受长度率、近期TPS、错误日志末尾内容

Run canonical benchmark

运行标准基准测试

bash scripts/bench.sh

Runs narrative + code prompts, prints per-run TPS + averages

运行叙事型+代码型提示词，输出单次测试TPS及平均值

Full functional verification after a switch

切换配置后执行完整功能验证

bash scripts/verify-full.sh

Stress test (run before relying on long-context)

压力测试（依赖长上下文前建议执行）

bash scripts/verify-stress.sh

---

bash scripts/verify-stress.sh

---

Variant Names Reference

配置变体名称参考

vllm/default          Single-card, chat-optimized (recommended first start)
vllm/dual             Dual-card, 262K ctx, vision, tools, MTP n=3
vllm/long-vision      Dual-card, long-context + vision workloads
vllm/turbo            Dual-card, 4 concurrent streams @ 262K
llamacpp/default      Single-card, full 262K, no prefill cliffs
llamacpp/65k          Single-card, 65K ctx (faster, more VRAM headroom)
llamacpp/dual         Dual-card llama.cpp recipe

vllm/default          单卡，对话优化（推荐首次启动使用）
vllm/dual             双卡，262K上下文，支持视觉、工具调用、MTP n=3
vllm/long-vision      双卡，长上下文+视觉工作负载
vllm/turbo            双卡，262K上下文下支持4并发流
llamacpp/default      单卡，完整262K上下文，无预填充瓶颈
llamacpp/65k          单卡，65K上下文（速度更快，显存余量更多）
llamacpp/dual         双卡llama.cpp部署方案

API Usage (OpenAI-compatible, port 8020)

API 使用（兼容OpenAI，端口8020）

The server exposes a standard OpenAI-compatible API. Use the

openai

Python SDK pointed at

localhost:8020

服务器提供标准的OpenAI兼容API。可使用指向

localhost:8020

的

openai

Python SDK调用。

Python — openai SDK

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8020/v1",
    api_key="ignored",  # local server, no auth needed
)

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8020/v1",
    api_key="ignored",  # 本地服务器无需认证
)

Basic chat

基础对话

response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[{"role": "user", "content": "Explain KV cache in one paragraph."}], max_tokens=512, ) print(response.choices[0].message.content)

undefined

response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[{"role": "user", "content": "用一段话解释KV缓存。"}], max_tokens=512, ) print(response.choices[0].message.content)

undefined

Python — streaming

Python — 流式输出

python

stream = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "Write a Python quicksort."}],
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

python

stream = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "写一个Python快速排序算法。"}],
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Python — raw requests (no SDK dependency)

Python — 原生请求（无需SDK依赖）

python

import requests, json

payload = {
    "model": "qwen3.6-27b-autoround",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    "max_tokens": 200,
    "temperature": 0.7,
}

resp = requests.post(
    "http://localhost:8020/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])

python

import requests, json

payload = {
    "model": "qwen3.6-27b-autoround",
    "messages": [
        {"role": "system", "content": "你是一个乐于助人的助手。"},
        {"role": "user", "content": "法国的首都是什么？"},
    ],
    "max_tokens": 200,
    "temperature": 0.7,
}

resp = requests.post(
    "http://localhost:8020/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])

Python — tool calling

Python — 工具调用

python

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for recent information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                },
                "required": ["query"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "What's the latest news on CUDA 13?"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=512,
)

msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")

python

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "搜索网络获取最新信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "搜索关键词"},
                },
                "required": ["query"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "CUDA 13的最新消息是什么？"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=512,
)

msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        print(f"工具：{call.function.name}")
        print(f"参数：{call.function.arguments}")

Python — long context (262K, use with llamacpp/default or vllm/dual)

Python — 长上下文（262K，搭配llamacpp/default或vllm/dual使用）

python

undefined

python

undefined

Load a large document

加载大文档

with open("large_codebase.txt") as f: document = f.read()

response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[ {"role": "user", "content": f"Summarize the architecture:\n\n{document}"}, ], max_tokens=1024, ) print(response.choices[0].message.content)

undefined

with open("large_codebase.txt") as f: document = f.read()

response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[ {"role": "user", "content": f"总结以下架构：\n\n{document}"}, ], max_tokens=1024, ) print(response.choices[0].message.content)

undefined

TypeScript / Node

typescript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8020/v1",
  apiKey: "ignored",
});

async function chat(prompt: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 512,
  });
  return response.choices[0].message.content ?? "";
}

// Streaming in Node
async function streamChat(prompt: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 1024,
    stream: true,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
  }
  console.log();
}

typescript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8020/v1",
  apiKey: "ignored",
});

async function chat(prompt: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 512,
  });
  return response.choices[0].message.content ?? "";
}

// Node环境下流式输出
async function streamChat(prompt: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 1024,
    stream: true,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
  }
  console.log();
}

curl — quick sanity check

curl — 快速可用性检查

bash

curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b-autoround",
    "messages": [{"role": "user", "content": "Capital of France?"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

bash

curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b-autoround",
    "messages": [{"role": "user", "content": "法国的首都是什么？"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

curl — list available models

curl — 列出可用模型

bash

curl -sf http://localhost:8020/v1/models | jq '.data[].id'

bash

curl -sf http://localhost:8020/v1/models | jq '.data[].id'

Docker Compose Structure

Docker Compose 结构

Configs live under

models/qwen3.6-27b/vllm/compose/

. Example structure of a single-card compose:

yaml

undefined

配置文件位于

models/qwen3.6-27b/vllm/compose/

。单卡compose示例结构：

yaml

undefined

models/qwen3.6-27b/vllm/compose/default.yml (representative structure)

models/qwen3.6-27b/vllm/compose/default.yml（代表性结构）

services: vllm: image: vllm/vllm-openai:v0.20.1rc1.dev16+g7a1eb8ac2 runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 - CUDA_VISIBLE_DEVICES=0 ports: - "8020:8000" volumes: - ${MODEL_PATH}:/models/qwen3.6-27b - ${PATCH_PATH}:/patches command: > --model /models/qwen3.6-27b --served-model-name qwen3.6-27b-autoround --tensor-parallel-size 1 --max-model-len 65536 --kv-cache-dtype fp8 --speculative-model /models/qwen3.6-27b/mtp_head --num-speculative-tokens 3 --port 8000


For dual-card, `tensor-parallel-size 2` and `NVIDIA_VISIBLE_DEVICES=0,1` are set, and `max-model-len` extends to 262144.

---


双卡配置中，`tensor-parallel-size`设为2，`NVIDIA_VISIBLE_DEVICES=0,1`，`max-model-len`扩展至262144。

---

Connecting External Clients

连接外部客户端

Open WebUI

API Base URL: http://localhost:8020/v1
API Key:      (leave blank or type anything)
Model:        qwen3.6-27b-autoround

API基础URL: http://localhost:8020/v1
API密钥:      （留空或任意输入）
模型:        qwen3.6-27b-autoround

Cline / Cursor / Copilot-compatible tools

Cline / Cursor / 兼容Copilot的工具

json

{
  "openai.baseURL": "http://localhost:8020/v1",
  "openai.apiKey": "local",
  "openai.model": "qwen3.6-27b-autoround"
}

json

{
  "openai.baseURL": "http://localhost:8020/v1",
  "openai.apiKey": "local",
  "openai.model": "qwen3.6-27b-autoround"
}

LiteLLM proxy passthrough

LiteLLM代理转发

yaml

undefined

yaml

undefined

litellm_config.yaml

model_list:

model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored

---

model_list:

model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored

---

Repo Layout Quick Reference

仓库结构速览

club-3090/
├── scripts/               Shared model-aware scripts (setup, launch, bench, health)
├── models/
│   └── qwen3.6-27b/
│       ├── vllm/
│       │   ├── compose/   Docker Compose files (all variants)
│       │   └── patches/   tolist_cudagraph, Marlin pad, Genesis pointer
│       ├── llama-cpp/
│       │   └── recipes/   Single-card 65K / 262K-max / dual recipes
│       └── sglang/        Blocked — watch list only
└── docs/
    ├── SINGLE_CARD.md     1× 3090 workload → config guide
    ├── DUAL_CARD.md       2× 3090 workload → config guide
    ├── HARDWARE.md        PCIe vs NVLink, power draw, card compatibility
    ├── GLOSSARY.md        TPS / KV / MTP / TP / prefill cliff definitions
    ├── CLIFFS.md          Prefill cliff root causes and fix landscape
    ├── COMPARISONS.md     Self-host vs cloud cost crossover analysis
    ├── UPSTREAM.md        Tracked upstream issues and PRs
    └── engines/           Per-engine deep dives (vLLM / llama.cpp / SGLang)

club-3090/
├── scripts/               模型通用脚本（部署、启动、基准测试、健康检测）
├── models/
│   └── qwen3.6-27b/
│       ├── vllm/
│       │   ├── compose/   Docker Compose配置文件（所有变体）
│       │   └── patches/   tolist_cudagraph、Marlin pad、Genesis指针补丁
│       ├── llama-cpp/
│       │   └── recipes/   单卡65K/262K上限/双卡部署方案
│       └── sglang/        阻塞状态 — 仅作关注列表
└── docs/
    ├── SINGLE_CARD.md     单张3090工作负载→配置指南
    ├── DUAL_CARD.md       两张3090工作负载→配置指南
    ├── HARDWARE.md        PCIe vs NVLink、功耗、显卡兼容性
    ├── GLOSSARY.md        TPS/KV/MTP/TP/预填充瓶颈术语定义
    ├── CLIFFS.md          预填充瓶颈根源及修复方案
    ├── COMPARISONS.md     本地部署 vs 云服务成本对比分析
    ├── UPSTREAM.md        跟踪上游问题与PR
    └── engines/           各引擎深度解析（vLLM/llama.cpp/SGLang）

Troubleshooting

故障排查

Server won't start — CUDA/driver error

服务器无法启动 — CUDA/驱动错误

bash

undefined

bash

undefined

Check driver version (need 580.x+)

检查驱动版本（需580.x+）

nvidia-smi --query-gpu=driver_version --format=csv,noheader

Check NVIDIA Container Toolkit

检查NVIDIA Container Toolkit

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Check GPU visibility

检查GPU可见性

nvidia-smi -L

undefined

nvidia-smi -L

undefined

Out of VRAM / OOM on prefill

显存不足/预填充时OOM

bash

undefined

bash

undefined

Check current KV cache usage

检查当前KV缓存使用率

bash scripts/health.sh

Switch to a config with smaller max-model-len

切换至更小max-model-len的配置

bash scripts/switch.sh llamacpp/65k # 65K ctx, more headroom bash scripts/switch.sh llamacpp/default # 262K but manages prefill correctly

undefined

bash scripts/switch.sh llamacpp/65k # 65K上下文，显存余量更多 bash scripts/switch.sh llamacpp/default # 262K上下文，但能正确管理预填充

undefined

Prefill cliff (vLLM hangs or errors on large prompts)

预填充瓶颈（vLLM在大提示词时挂起或报错）

This is a known DeltaNet architecture issue on Qwen3.6-27B with vLLM. The llama.cpp route avoids it entirely:

bash

bash scripts/switch.sh llamacpp/default

这是Qwen3.6-27B搭配vLLM时的DeltaNet架构已知问题。使用llama.cpp方案可完全避免：

bash

bash scripts/switch.sh llamacpp/default

Stress-test it:

压力测试验证：

bash scripts/verify-stress.sh


For vLLM workarounds, see `models/qwen3.6-27b/INTERNALS.md` and `docs/CLIFFS.md`.

bash scripts/verify-stress.sh


vLLM的临时解决方案详见`models/qwen3.6-27b/INTERNALS.md`和`docs/CLIFFS.md`。

MTP / speculative decoding not accepting tokens

MTP/投机解码不接受令牌

bash

bash scripts/health.sh

bash

bash scripts/health.sh

Look for "MTP AL:" (accept-length) — should be > 1.0

查看"MTP AL:"（接受长度）— 应大于1.0

If AL ~= 1.0, speculative head may not be loaded correctly

如果AL≈1.0，可能投机头未正确加载

Check that Genesis patches were applied:

检查Genesis补丁是否已应用：

bash scripts/setup.sh qwen3.6-27b # re-runs patch verification

undefined

bash scripts/setup.sh qwen3.6-27b # 重新运行补丁验证

undefined

Tool call returns 25K+ tokens and hangs

工具调用返回25K+令牌并挂起

Known failure mode on vLLM with very large tool responses. Use llama.cpp:

bash

bash scripts/switch.sh llamacpp/default

这是vLLM处理超大工具响应时的已知故障模式。使用llama.cpp方案：

bash

bash scripts/switch.sh llamacpp/default

llama.cpp handles 25K-token tool returns cleanly (stress-tested)

llama.cpp可稳定处理25K令牌的工具返回（已通过压力测试）

undefined

undefined

Switching variants leaves old container running

切换配置变体后旧容器仍在运行

bash

undefined

bash

undefined

switch.sh handles this, but if you ran docker compose manually:

switch.sh会处理此问题，但如果手动运行docker compose：

docker compose -f models/qwen3.6-27b/vllm/compose/default.yml down bash scripts/switch.sh vllm/dual

undefined

docker compose -f models/qwen3.6-27b/vllm/compose/default.yml down bash scripts/switch.sh vllm/dual

undefined

Check what variant is currently running

检查当前运行的配置变体

bash

docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"

bash

docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"

Performance Reference (Qwen3.6-27B)

性能参考（Qwen3.6-27B）

Config	Cards	TPS (narrative)	TPS (code)	Max ctx	Notes
`vllm/default`	1×	~89	~89	65K	Recommended starting point
`vllm/dual`	2×	~89	~127	262K	DFlash on code workloads
`vllm/turbo`	2×	—	—	262K	4 concurrent streams
`llamacpp/default`	1×	~21	~21	262K	No cliffs, stable tool-use

Benchmark substrate: vLLM nightly

0.20.1rc1.dev16+g7a1eb8ac2

+ Genesis v7.65 dev, llama.cpp

0d0764dfd

, RTX 3090 sm_86 PCIe @ 230 W. Full per-run numbers in

models/qwen3.6-27b/CHANGELOG.md

配置	显卡数量	叙事型TPS	代码型TPS	最大上下文	说明
`vllm/default`	1张	~89	~89	65K	推荐初始配置
`vllm/dual`	2张	~89	~127	262K	代码工作负载启用DFlash
`vllm/turbo`	2张	—	—	262K	支持4并发流
`llamacpp/default`	1张	~21	~21	262K	无瓶颈，工具调用稳定

基准测试环境：vLLM nightly

0.20.1rc1.dev16+g7a1eb8ac2

+ Genesis v7.65 dev，llama.cpp

0d0764dfd

，RTX 3090 sm_86 PCIe @ 230W。完整单次测试数据见

models/qwen3.6-27b/CHANGELOG.md

。

Adding a New Model

添加新模型

bash

undefined

bash

undefined

The repo structure is model-agnostic.

仓库结构与模型无关。

New models follow the same pattern under models/<name>/:

新模型需在models/<name>/下遵循相同结构：

mkdir -p models/glm-4.6/{vllm/compose,vllm/patches,llama-cpp/recipes,sglang}

Add README.md, INTERNALS.md, CHANGELOG.md following qwen3.6-27b/ as template

参考qwen3.6-27b/添加README.md、INTERNALS.md、CHANGELOG.md

setup.sh and launch.sh are model-aware — add the model slug to their dispatch

setup.sh和launch.sh支持多模型 — 在分发逻辑中添加新模型标识

bash scripts/setup.sh glm-4.6 # once scripts updated

---

bash scripts/setup.sh glm-4.6 # 脚本更新后执行

---

Key Links

关键链接

docs/SINGLE_CARD.md — 1× 3090 workload → config → quick start
docs/DUAL_CARD.md — 2× 3090 workload → config → quick start
docs/HARDWARE.md — 4090/A6000 compatibility, NVLink notes
docs/GLOSSARY.md — TPS, KV cache, MTP, TP, prefill cliff explained
docs/COMPARISONS.md — self-host vs cloud cost crossover
docs/CLIFFS.md — prefill cliff deep dive
docs/UPSTREAM.md — upstream issues/PRs being tracked
CONTRIBUTING.md — adding numbers, bug repros, new variants

docs/SINGLE_CARD.md — 单张3090工作负载→配置→快速入门
docs/DUAL_CARD.md — 两张3090工作负载→配置→快速入门
docs/HARDWARE.md — 4090/A6000兼容性、NVLink说明
docs/GLOSSARY.md — TPS、KV缓存、MTP、TP、预填充瓶颈术语解释
docs/COMPARISONS.md — 本地部署 vs 云服务成本平衡点分析
docs/CLIFFS.md — 预填充瓶颈深度解析
docs/UPSTREAM.md — 跟踪上游问题/PR
CONTRIBUTING.md — 添加数据、Bug复现、新配置变体