club-3090-llm-serving

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

club-3090 LLM Serving

club-3090 LLM 部署方案

Skill by ara.so — Daily 2026 Skills collection.
Community recipes for serving modern LLMs on RTX 3090 (24 GB) hardware. Supports vLLM, llama.cpp, and SGLang engines with validated Docker Compose configs exposing an OpenAI-compatible API on
localhost:8020
. Currently ships Qwen3.6-27B configs for 1× and 2× cards.

来自ara.so的技能方案 — 2026每日技能合集。
面向RTX 3090(24GB显存)硬件的现代大语言模型(LLM)社区部署方案。支持vLLM、llama.cpp和SGLang引擎,提供经过验证的Docker Compose配置,可在
localhost:8020
暴露兼容OpenAI的API。目前已提供适用于1张和2张显卡的Qwen3.6-27B模型配置。

Engine Decision Matrix

引擎选择矩阵

NeedEngineWhy
Max throughput (code/chat)vLLM dual89–127 TPS, MTP n=3, vision, tools
Full 262K context, no crashesllama.cpp singleNo prefill cliffs, stable tool-use
4 concurrent streams @ 262KvLLM dual turboStream isolation, full feature stack
Single card, moderate ctxvLLM default~89 TPS, easiest setup
SGLang is currently blocked on Qwen3.6-27B — see
models/qwen3.6-27b/sglang/README.md
.

需求引擎原因
最大吞吐量(代码/对话)vLLM dual89–127 TPS,支持MTP n=3、视觉能力、工具调用
完整262K上下文,无崩溃llama.cpp single无预填充瓶颈,工具调用稳定
262K上下文下支持4并发流vLLM dual turbo流隔离,功能栈完整
单卡,中等上下文vLLM default约89 TPS,部署最简单
目前SGLang在Qwen3.6-27B上处于阻塞状态 — 详见
models/qwen3.6-27b/sglang/README.md

Prerequisites

前置条件

- 1× or 2× NVIDIA RTX 3090 (24 GB each)
- Linux (Ubuntu 22.04+ recommended)
- Docker + NVIDIA Container Toolkit
- NVIDIA driver 580.x+
- ~30 GB free disk per model

- 1张或2张NVIDIA RTX 3090(单卡24GB显存)
- Linux系统(推荐Ubuntu 22.04+)
- Docker + NVIDIA Container Toolkit
- NVIDIA驱动580.x+
- 每个模型需约30GB空闲磁盘空间

Installation & Setup

安装与部署

1. Clone the repo

1. 克隆仓库

bash
git clone https://github.com/noonghunna/club-3090.git
cd club-3090
bash
git clone https://github.com/noonghunna/club-3090.git
cd club-3090

2. Download and verify a model

2. 下载并验证模型

bash
undefined
bash
undefined

Downloads model weights, verifies SHA, clones Genesis patches

下载模型权重、验证SHA、克隆Genesis补丁

bash scripts/setup.sh qwen3.6-27b
undefined
bash scripts/setup.sh qwen3.6-27b
undefined

3. Launch (interactive wizard)

3. 启动(交互式向导)

bash
bash scripts/launch.sh
bash
bash scripts/launch.sh

Wizard prompts: engine → card count → workload → boots compose → verifies

向导提示:选择引擎 → 显卡数量 → 工作负载 → 启动compose → 验证

undefined
undefined

4. Launch (non-interactive)

4. 启动(非交互式)

bash
undefined
bash
undefined

Single card, chat-optimized

单卡,对话优化配置

bash scripts/launch.sh --variant vllm/default
bash scripts/launch.sh --variant vllm/default

Dual card, 262K context + vision

双卡,262K上下文 + 视觉能力

bash scripts/launch.sh --variant vllm/dual
bash scripts/launch.sh --variant vllm/dual

Single card, 262K context, no prefill cliffs

单卡,262K上下文,无预填充瓶颈

bash scripts/launch.sh --variant llamacpp/default
bash scripts/launch.sh --variant llamacpp/default

List all available variants

列出所有可用配置变体

bash scripts/switch.sh --list

---
bash scripts/switch.sh --list

---

Key Scripts

核心脚本

ScriptPurpose
scripts/setup.sh <model>
Preflight checks, model download, SHA verify, Genesis patch clone
scripts/launch.sh [--variant X]
Interactive or direct variant boot; calls switch.sh + verify-full.sh
scripts/switch.sh <variant>
Stateless switcher — tears down old compose, brings up new one
scripts/health.sh
Live health probe: KV %, MTP accept-length, recent TPS, errors
scripts/verify.sh
Quick smoke test (engine-aware via env vars)
scripts/verify-full.sh
8-check functional test (~1–2 min)
scripts/verify-stress.sh
Boundary stress test: 262K ladder + tool prefill OOM (~5–10 min)
scripts/bench.sh
Canonical TPS benchmark (3 warm + 5 measured runs)
脚本用途
scripts/setup.sh <model>
预检查、模型下载、SHA验证、Genesis补丁克隆
scripts/launch.sh [--variant X]
交互式或直接启动配置变体;调用switch.sh + verify-full.sh
scripts/switch.sh <variant>
无状态切换器 — 停止旧compose,启动新配置
scripts/health.sh
实时健康检测:KV缓存使用率、MTP接受长度、近期TPS、错误信息
scripts/verify.sh
快速冒烟测试(通过环境变量识别引擎)
scripts/verify-full.sh
8项功能测试(约1–2分钟)
scripts/verify-stress.sh
边界压力测试:262K阶梯式测试 + 工具预填充OOM测试(约5–10分钟)
scripts/bench.sh
标准TPS基准测试(3次预热 + 5次正式测试)

Common script usage

脚本常用示例

bash
undefined
bash
undefined

Switch variants without the wizard

无需向导切换配置变体

bash scripts/switch.sh vllm/long-vision bash scripts/switch.sh vllm/dual bash scripts/switch.sh llamacpp/default
bash scripts/switch.sh vllm/long-vision bash scripts/switch.sh vllm/dual bash scripts/switch.sh llamacpp/default

Check runtime health

检查运行时健康状态

bash scripts/health.sh
bash scripts/health.sh

Output: KV cache %, MTP accept-length rate, recent TPS, error log tail

输出:KV缓存使用率、MTP接受长度率、近期TPS、错误日志末尾内容

Run canonical benchmark

运行标准基准测试

bash scripts/bench.sh
bash scripts/bench.sh

Runs narrative + code prompts, prints per-run TPS + averages

运行叙事型+代码型提示词,输出单次测试TPS及平均值

Full functional verification after a switch

切换配置后执行完整功能验证

bash scripts/verify-full.sh
bash scripts/verify-full.sh

Stress test (run before relying on long-context)

压力测试(依赖长上下文前建议执行)

bash scripts/verify-stress.sh

---
bash scripts/verify-stress.sh

---

Variant Names Reference

配置变体名称参考

vllm/default          Single-card, chat-optimized (recommended first start)
vllm/dual             Dual-card, 262K ctx, vision, tools, MTP n=3
vllm/long-vision      Dual-card, long-context + vision workloads
vllm/turbo            Dual-card, 4 concurrent streams @ 262K
llamacpp/default      Single-card, full 262K, no prefill cliffs
llamacpp/65k          Single-card, 65K ctx (faster, more VRAM headroom)
llamacpp/dual         Dual-card llama.cpp recipe

vllm/default          单卡,对话优化(推荐首次启动使用)
vllm/dual             双卡,262K上下文,支持视觉、工具调用、MTP n=3
vllm/long-vision      双卡,长上下文+视觉工作负载
vllm/turbo            双卡,262K上下文下支持4并发流
llamacpp/default      单卡,完整262K上下文,无预填充瓶颈
llamacpp/65k          单卡,65K上下文(速度更快,显存余量更多)
llamacpp/dual         双卡llama.cpp部署方案

API Usage (OpenAI-compatible, port 8020)

API 使用(兼容OpenAI,端口8020)

The server exposes a standard OpenAI-compatible API. Use the
openai
Python SDK pointed at
localhost:8020
.
服务器提供标准的OpenAI兼容API。可使用指向
localhost:8020
openai
Python SDK调用。

Python — openai SDK

Python — openai SDK

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8020/v1",
    api_key="ignored",  # local server, no auth needed
)
python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8020/v1",
    api_key="ignored",  # 本地服务器无需认证
)

Basic chat

基础对话

response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[{"role": "user", "content": "Explain KV cache in one paragraph."}], max_tokens=512, ) print(response.choices[0].message.content)
undefined
response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[{"role": "user", "content": "用一段话解释KV缓存。"}], max_tokens=512, ) print(response.choices[0].message.content)
undefined

Python — streaming

Python — 流式输出

python
stream = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "Write a Python quicksort."}],
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
python
stream = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "写一个Python快速排序算法。"}],
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Python — raw requests (no SDK dependency)

Python — 原生请求(无需SDK依赖)

python
import requests, json

payload = {
    "model": "qwen3.6-27b-autoround",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    "max_tokens": 200,
    "temperature": 0.7,
}

resp = requests.post(
    "http://localhost:8020/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])
python
import requests, json

payload = {
    "model": "qwen3.6-27b-autoround",
    "messages": [
        {"role": "system", "content": "你是一个乐于助人的助手。"},
        {"role": "user", "content": "法国的首都是什么?"},
    ],
    "max_tokens": 200,
    "temperature": 0.7,
}

resp = requests.post(
    "http://localhost:8020/v1/chat/completions",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=120,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])

Python — tool calling

Python — 工具调用

python
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for recent information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                },
                "required": ["query"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "What's the latest news on CUDA 13?"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=512,
)

msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")
python
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "搜索网络获取最新信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "搜索关键词"},
                },
                "required": ["query"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen3.6-27b-autoround",
    messages=[{"role": "user", "content": "CUDA 13的最新消息是什么?"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=512,
)

msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        print(f"工具:{call.function.name}")
        print(f"参数:{call.function.arguments}")

Python — long context (262K, use with llamacpp/default or vllm/dual)

Python — 长上下文(262K,搭配llamacpp/default或vllm/dual使用)

python
undefined
python
undefined

Load a large document

加载大文档

with open("large_codebase.txt") as f: document = f.read()
response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[ {"role": "user", "content": f"Summarize the architecture:\n\n{document}"}, ], max_tokens=1024, ) print(response.choices[0].message.content)
undefined
with open("large_codebase.txt") as f: document = f.read()
response = client.chat.completions.create( model="qwen3.6-27b-autoround", messages=[ {"role": "user", "content": f"总结以下架构:\n\n{document}"}, ], max_tokens=1024, ) print(response.choices[0].message.content)
undefined

TypeScript / Node

TypeScript / Node

typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8020/v1",
  apiKey: "ignored",
});

async function chat(prompt: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 512,
  });
  return response.choices[0].message.content ?? "";
}

// Streaming in Node
async function streamChat(prompt: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 1024,
    stream: true,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
  }
  console.log();
}
typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8020/v1",
  apiKey: "ignored",
});

async function chat(prompt: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 512,
  });
  return response.choices[0].message.content ?? "";
}

// Node环境下流式输出
async function streamChat(prompt: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "qwen3.6-27b-autoround",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 1024,
    stream: true,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
  }
  console.log();
}

curl — quick sanity check

curl — 快速可用性检查

bash
curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b-autoround",
    "messages": [{"role": "user", "content": "Capital of France?"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'
bash
curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b-autoround",
    "messages": [{"role": "user", "content": "法国的首都是什么?"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

curl — list available models

curl — 列出可用模型

bash
curl -sf http://localhost:8020/v1/models | jq '.data[].id'

bash
curl -sf http://localhost:8020/v1/models | jq '.data[].id'

Docker Compose Structure

Docker Compose 结构

Configs live under
models/qwen3.6-27b/vllm/compose/
. Example structure of a single-card compose:
yaml
undefined
配置文件位于
models/qwen3.6-27b/vllm/compose/
。单卡compose示例结构:
yaml
undefined

models/qwen3.6-27b/vllm/compose/default.yml (representative structure)

models/qwen3.6-27b/vllm/compose/default.yml(代表性结构)

services: vllm: image: vllm/vllm-openai:v0.20.1rc1.dev16+g7a1eb8ac2 runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 - CUDA_VISIBLE_DEVICES=0 ports: - "8020:8000" volumes: - ${MODEL_PATH}:/models/qwen3.6-27b - ${PATCH_PATH}:/patches command: > --model /models/qwen3.6-27b --served-model-name qwen3.6-27b-autoround --tensor-parallel-size 1 --max-model-len 65536 --kv-cache-dtype fp8 --speculative-model /models/qwen3.6-27b/mtp_head --num-speculative-tokens 3 --port 8000

For dual-card, `tensor-parallel-size 2` and `NVIDIA_VISIBLE_DEVICES=0,1` are set, and `max-model-len` extends to 262144.

---
services: vllm: image: vllm/vllm-openai:v0.20.1rc1.dev16+g7a1eb8ac2 runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 - CUDA_VISIBLE_DEVICES=0 ports: - "8020:8000" volumes: - ${MODEL_PATH}:/models/qwen3.6-27b - ${PATCH_PATH}:/patches command: > --model /models/qwen3.6-27b --served-model-name qwen3.6-27b-autoround --tensor-parallel-size 1 --max-model-len 65536 --kv-cache-dtype fp8 --speculative-model /models/qwen3.6-27b/mtp_head --num-speculative-tokens 3 --port 8000

双卡配置中,`tensor-parallel-size`设为2,`NVIDIA_VISIBLE_DEVICES=0,1`,`max-model-len`扩展至262144。

---

Connecting External Clients

连接外部客户端

Open WebUI

Open WebUI

API Base URL: http://localhost:8020/v1
API Key:      (leave blank or type anything)
Model:        qwen3.6-27b-autoround
API基础URL: http://localhost:8020/v1
API密钥:      (留空或任意输入)
模型:        qwen3.6-27b-autoround

Cline / Cursor / Copilot-compatible tools

Cline / Cursor / 兼容Copilot的工具

json
{
  "openai.baseURL": "http://localhost:8020/v1",
  "openai.apiKey": "local",
  "openai.model": "qwen3.6-27b-autoround"
}
json
{
  "openai.baseURL": "http://localhost:8020/v1",
  "openai.apiKey": "local",
  "openai.model": "qwen3.6-27b-autoround"
}

LiteLLM proxy passthrough

LiteLLM代理转发

yaml
undefined
yaml
undefined

litellm_config.yaml

litellm_config.yaml

model_list:
  • model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored

---
model_list:
  • model_name: qwen3.6-27b litellm_params: model: openai/qwen3.6-27b-autoround api_base: http://localhost:8020/v1 api_key: ignored

---

Repo Layout Quick Reference

仓库结构速览

club-3090/
├── scripts/               Shared model-aware scripts (setup, launch, bench, health)
├── models/
│   └── qwen3.6-27b/
│       ├── vllm/
│       │   ├── compose/   Docker Compose files (all variants)
│       │   └── patches/   tolist_cudagraph, Marlin pad, Genesis pointer
│       ├── llama-cpp/
│       │   └── recipes/   Single-card 65K / 262K-max / dual recipes
│       └── sglang/        Blocked — watch list only
└── docs/
    ├── SINGLE_CARD.md     1× 3090 workload → config guide
    ├── DUAL_CARD.md       2× 3090 workload → config guide
    ├── HARDWARE.md        PCIe vs NVLink, power draw, card compatibility
    ├── GLOSSARY.md        TPS / KV / MTP / TP / prefill cliff definitions
    ├── CLIFFS.md          Prefill cliff root causes and fix landscape
    ├── COMPARISONS.md     Self-host vs cloud cost crossover analysis
    ├── UPSTREAM.md        Tracked upstream issues and PRs
    └── engines/           Per-engine deep dives (vLLM / llama.cpp / SGLang)

club-3090/
├── scripts/               模型通用脚本(部署、启动、基准测试、健康检测)
├── models/
│   └── qwen3.6-27b/
│       ├── vllm/
│       │   ├── compose/   Docker Compose配置文件(所有变体)
│       │   └── patches/   tolist_cudagraph、Marlin pad、Genesis指针补丁
│       ├── llama-cpp/
│       │   └── recipes/   单卡65K/262K上限/双卡部署方案
│       └── sglang/        阻塞状态 — 仅作关注列表
└── docs/
    ├── SINGLE_CARD.md     单张3090工作负载→配置指南
    ├── DUAL_CARD.md       两张3090工作负载→配置指南
    ├── HARDWARE.md        PCIe vs NVLink、功耗、显卡兼容性
    ├── GLOSSARY.md        TPS/KV/MTP/TP/预填充瓶颈术语定义
    ├── CLIFFS.md          预填充瓶颈根源及修复方案
    ├── COMPARISONS.md     本地部署 vs 云服务成本对比分析
    ├── UPSTREAM.md        跟踪上游问题与PR
    └── engines/           各引擎深度解析(vLLM/llama.cpp/SGLang)

Troubleshooting

故障排查

Server won't start — CUDA/driver error

服务器无法启动 — CUDA/驱动错误

bash
undefined
bash
undefined

Check driver version (need 580.x+)

检查驱动版本(需580.x+)

nvidia-smi --query-gpu=driver_version --format=csv,noheader
nvidia-smi --query-gpu=driver_version --format=csv,noheader

Check NVIDIA Container Toolkit

检查NVIDIA Container Toolkit

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Check GPU visibility

检查GPU可见性

nvidia-smi -L
undefined
nvidia-smi -L
undefined

Out of VRAM / OOM on prefill

显存不足/预填充时OOM

bash
undefined
bash
undefined

Check current KV cache usage

检查当前KV缓存使用率

bash scripts/health.sh
bash scripts/health.sh

Switch to a config with smaller max-model-len

切换至更小max-model-len的配置

bash scripts/switch.sh llamacpp/65k # 65K ctx, more headroom bash scripts/switch.sh llamacpp/default # 262K but manages prefill correctly
undefined
bash scripts/switch.sh llamacpp/65k # 65K上下文,显存余量更多 bash scripts/switch.sh llamacpp/default # 262K上下文,但能正确管理预填充
undefined

Prefill cliff (vLLM hangs or errors on large prompts)

预填充瓶颈(vLLM在大提示词时挂起或报错)

This is a known DeltaNet architecture issue on Qwen3.6-27B with vLLM. The llama.cpp route avoids it entirely:
bash
bash scripts/switch.sh llamacpp/default
这是Qwen3.6-27B搭配vLLM时的DeltaNet架构已知问题。使用llama.cpp方案可完全避免:
bash
bash scripts/switch.sh llamacpp/default

Stress-test it:

压力测试验证:

bash scripts/verify-stress.sh

For vLLM workarounds, see `models/qwen3.6-27b/INTERNALS.md` and `docs/CLIFFS.md`.
bash scripts/verify-stress.sh

vLLM的临时解决方案详见`models/qwen3.6-27b/INTERNALS.md`和`docs/CLIFFS.md`。

MTP / speculative decoding not accepting tokens

MTP/投机解码不接受令牌

bash
bash scripts/health.sh
bash
bash scripts/health.sh

Look for "MTP AL:" (accept-length) — should be > 1.0

查看"MTP AL:"(接受长度)— 应大于1.0

If AL ~= 1.0, speculative head may not be loaded correctly

如果AL≈1.0,可能投机头未正确加载

Check that Genesis patches were applied:

检查Genesis补丁是否已应用:

bash scripts/setup.sh qwen3.6-27b # re-runs patch verification
undefined
bash scripts/setup.sh qwen3.6-27b # 重新运行补丁验证
undefined

Tool call returns 25K+ tokens and hangs

工具调用返回25K+令牌并挂起

Known failure mode on vLLM with very large tool responses. Use llama.cpp:
bash
bash scripts/switch.sh llamacpp/default
这是vLLM处理超大工具响应时的已知故障模式。使用llama.cpp方案:
bash
bash scripts/switch.sh llamacpp/default

llama.cpp handles 25K-token tool returns cleanly (stress-tested)

llama.cpp可稳定处理25K令牌的工具返回(已通过压力测试)

undefined
undefined

Switching variants leaves old container running

切换配置变体后旧容器仍在运行

bash
undefined
bash
undefined

switch.sh handles this, but if you ran docker compose manually:

switch.sh会处理此问题,但如果手动运行docker compose:

docker compose -f models/qwen3.6-27b/vllm/compose/default.yml down bash scripts/switch.sh vllm/dual
undefined
docker compose -f models/qwen3.6-27b/vllm/compose/default.yml down bash scripts/switch.sh vllm/dual
undefined

Check what variant is currently running

检查当前运行的配置变体

bash
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"

bash
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"

Performance Reference (Qwen3.6-27B)

性能参考(Qwen3.6-27B)

ConfigCardsTPS (narrative)TPS (code)Max ctxNotes
vllm/default
~89~8965KRecommended starting point
vllm/dual
~89~127262KDFlash on code workloads
vllm/turbo
262K4 concurrent streams
llamacpp/default
~21~21262KNo cliffs, stable tool-use
Benchmark substrate: vLLM nightly
0.20.1rc1.dev16+g7a1eb8ac2
+ Genesis v7.65 dev, llama.cpp
0d0764dfd
, RTX 3090 sm_86 PCIe @ 230 W. Full per-run numbers in
models/qwen3.6-27b/CHANGELOG.md
.

配置显卡数量叙事型TPS代码型TPS最大上下文说明
vllm/default
1张~89~8965K推荐初始配置
vllm/dual
2张~89~127262K代码工作负载启用DFlash
vllm/turbo
2张262K支持4并发流
llamacpp/default
1张~21~21262K无瓶颈,工具调用稳定
基准测试环境:vLLM nightly
0.20.1rc1.dev16+g7a1eb8ac2
+ Genesis v7.65 dev,llama.cpp
0d0764dfd
,RTX 3090 sm_86 PCIe @ 230W。完整单次测试数据见
models/qwen3.6-27b/CHANGELOG.md

Adding a New Model

添加新模型

bash
undefined
bash
undefined

The repo structure is model-agnostic.

仓库结构与模型无关。

New models follow the same pattern under models/<name>/:

新模型需在models/<name>/下遵循相同结构:

mkdir -p models/glm-4.6/{vllm/compose,vllm/patches,llama-cpp/recipes,sglang}
mkdir -p models/glm-4.6/{vllm/compose,vllm/patches,llama-cpp/recipes,sglang}

Add README.md, INTERNALS.md, CHANGELOG.md following qwen3.6-27b/ as template

参考qwen3.6-27b/添加README.md、INTERNALS.md、CHANGELOG.md

setup.sh and launch.sh are model-aware — add the model slug to their dispatch

setup.sh和launch.sh支持多模型 — 在分发逻辑中添加新模型标识

bash scripts/setup.sh glm-4.6 # once scripts updated

---
bash scripts/setup.sh glm-4.6 # 脚本更新后执行

---

Key Links

关键链接

  • docs/SINGLE_CARD.md — 1× 3090 workload → config → quick start
  • docs/DUAL_CARD.md — 2× 3090 workload → config → quick start
  • docs/HARDWARE.md — 4090/A6000 compatibility, NVLink notes
  • docs/GLOSSARY.md — TPS, KV cache, MTP, TP, prefill cliff explained
  • docs/COMPARISONS.md — self-host vs cloud cost crossover
  • docs/CLIFFS.md — prefill cliff deep dive
  • docs/UPSTREAM.md — upstream issues/PRs being tracked
  • CONTRIBUTING.md — adding numbers, bug repros, new variants
  • docs/SINGLE_CARD.md — 单张3090工作负载→配置→快速入门
  • docs/DUAL_CARD.md — 两张3090工作负载→配置→快速入门
  • docs/HARDWARE.md — 4090/A6000兼容性、NVLink说明
  • docs/GLOSSARY.md — TPS、KV缓存、MTP、TP、预填充瓶颈术语解释
  • docs/COMPARISONS.md — 本地部署 vs 云服务成本平衡点分析
  • docs/CLIFFS.md — 预填充瓶颈深度解析
  • docs/UPSTREAM.md — 跟踪上游问题/PR
  • CONTRIBUTING.md — 添加数据、Bug复现、新配置变体