llm-architect
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Architect
LLM架构师
Purpose
用途
Provides expert large language model system architecture for designing, deploying, and optimizing LLM applications at scale. Specializes in model selection, RAG (Retrieval Augmented Generation) pipelines, fine-tuning strategies, serving infrastructure, cost optimization, and safety guardrails for production LLM systems.
提供专业的大语言模型(LLM)系统架构支持,用于设计、部署和优化大规模LLM应用。专注于模型选型、RAG(检索增强生成)流水线、微调策略、服务基础设施、成本优化以及生产级LLM系统的安全防护机制。
When to Use
使用场景
- Designing end-to-end LLM systems from requirements to production
- Selecting models and serving infrastructure for specific use cases
- Implementing RAG (Retrieval Augmented Generation) pipelines
- Optimizing LLM costs while maintaining quality thresholds
- Building safety guardrails and compliance mechanisms
- Planning fine-tuning vs RAG vs prompt engineering strategies
- Scaling LLM inference for high-throughput applications
- 设计从需求到生产的端到端LLM系统
- 针对特定用例选择模型与服务基础设施
- 实现RAG(检索增强生成)流水线
- 在维持质量标准的同时优化LLM成本
- 构建安全防护机制与合规体系
- 规划微调、RAG与提示工程的策略选择
- 为高吞吐量应用扩展LLM推理能力
Quick Start
快速入门
Invoke this skill when:
- Designing end-to-end LLM systems from requirements to production
- Selecting models and serving infrastructure for specific use cases
- Implementing RAG (Retrieval Augmented Generation) pipelines
- Optimizing LLM costs while maintaining quality thresholds
- Building safety guardrails and compliance mechanisms
Do NOT invoke when:
- Simple API integration exists (use backend-developer instead)
- Only prompt engineering needed without architecture decisions
- Training foundation models from scratch (almost always wrong approach)
- Generic ML tasks unrelated to language models (use ml-engineer)
在以下场景调用此技能:
- 设计从需求到生产的端到端LLM系统
- 针对特定用例选择模型与服务基础设施
- 实现RAG(检索增强生成)流水线
- 在维持质量标准的同时优化LLM成本
- 构建安全防护机制与合规体系
请勿在以下场景调用:
- 仅需简单API集成(请使用后端开发技能)
- 仅需提示工程,无需架构决策
- 从头训练基础模型(几乎总是错误的方案)
- 与语言模型无关的通用机器学习任务(请使用机器学习工程师技能)
Decision Framework
决策框架
Model Selection Quick Guide
模型选型快速指南
| Requirement | Recommended Approach |
|---|---|
| Latency <100ms | Small fine-tuned model (7B quantized) |
| Latency <2s, budget unlimited | Claude 3 Opus / GPT-4 |
| Latency <2s, domain-specific | Claude 3 Sonnet fine-tuned |
| Latency <2s, cost-sensitive | Claude 3 Haiku |
| Batch/async acceptable | Batch API, cheapest tier |
| 需求 | 推荐方案 |
|---|---|
| 延迟 <100ms | 小型微调模型(7B量化版) |
| 延迟 <2s,预算无限制 | Claude 3 Opus / GPT-4 |
| 延迟 <2s,特定领域场景 | 微调后的Claude 3 Sonnet |
| 延迟 <2s,对成本敏感 | Claude 3 Haiku |
| 可接受批量/异步处理 | 批量API,最便宜层级 |
RAG vs Fine-Tuning Decision Tree
RAG vs 微调决策树
Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│ ├─ Knowledge changes frequently?
│ │ └─ RAG (Retrieval Augmented Generation)
│ └─ Knowledge is static?
│ └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│ ├─ Can describe in prompt?
│ │ └─ Prompt engineering (try first)
│ └─ Format too complex for prompt?
│ └─ Fine-tuning
│
└─ Need latency <100ms?
└─ Fine-tuned small model (7B-13B)是否需要定制LLM行为?
│
├─ 是否需要特定领域知识?
│ ├─ 知识更新频繁?
│ │ └─ RAG(检索增强生成)
│ └─ 知识静态不变?
│ └─ 微调 或 RAG(两者都测试)
│
├─ 是否需要特定输出格式/风格?
│ ├─ 能否在提示词中描述?
│ │ └─ 提示工程(优先尝试)
│ └─ 格式过于复杂无法通过提示词实现?
│ └─ 微调
│
└─ 是否需要延迟 <100ms?
└─ 微调后的小型模型(7B-13B)Architecture Pattern
架构模式
[Client] → [API Gateway + Rate Limiting]
↓
[Request Router]
(Route by intent/complexity)
↓
┌────────┴────────┐
↓ ↓
[Fast Model] [Powerful Model]
(Haiku/Small) (Sonnet/Large)
↓ ↓
[Cache Layer] ← [Response Aggregator]
↓
[Logging & Monitoring]
↓
[Response to Client][客户端] → [API网关 + 限流]
↓
[请求路由器]
(按意图/复杂度路由)
↓
┌────────┴────────┐
↓ ↓
[快速模型] [高性能模型]
(Haiku/小型模型) (Sonnet/大型模型)
↓ ↓
[缓存层] ← [响应聚合器]
↓
[日志与监控]
↓
[响应返回客户端]Core Workflow: Design LLM System
核心工作流:设计LLM系统
1. Requirements Gathering
1. 需求收集
Ask these questions:
- Latency: What's the P95 response time requirement?
- Scale: Expected requests/day and growth trajectory?
- Accuracy: What's the minimum acceptable quality? (measurable metric)
- Cost: Budget constraints? ($/request or $/month)
- Data: Existing datasets for evaluation? Sensitivity level?
- Compliance: Regulatory requirements? (HIPAA, GDPR, SOC2, etc.)
询问以下问题:
- 延迟:P95响应时间要求是多少?
- 规模:预期日请求量及增长趋势?
- 准确率:最低可接受质量标准?(可量化指标)
- 成本:预算限制?(每请求成本或月度成本)
- 数据:是否有用于评估的现有数据集?数据敏感级别如何?
- 合规:是否有监管要求?(HIPAA、GDPR、SOC2等)
2. Model Selection
2. 模型选型
python
def select_model(requirements):
if requirements.latency_p95 < 100: # milliseconds
if requirements.task_complexity == "simple":
return "llama2-7b-finetune"
else:
return "mistral-7b-quantized"
elif requirements.latency_p95 < 2000:
if requirements.budget == "unlimited":
return "claude-3-opus"
elif requirements.domain_specific:
return "claude-3-sonnet-finetuned"
else:
return "claude-3-haiku"
else: # Batch/async acceptable
if requirements.accuracy_critical:
return "gpt-4-with-ensemble"
else:
return "batch-api-cheapest-tier"python
def select_model(requirements):
if requirements.latency_p95 < 100: # 毫秒
if requirements.task_complexity == "simple":
return "llama2-7b-finetune"
else:
return "mistral-7b-quantized"
elif requirements.latency_p95 < 2000:
if requirements.budget == "unlimited":
return "claude-3-opus"
elif requirements.domain_specific:
return "claude-3-sonnet-finetuned"
else:
return "claude-3-haiku"
else: # 可接受批量/异步处理
if requirements.accuracy_critical:
return "gpt-4-with-ensemble"
else:
return "batch-api-cheapest-tier"3. Prototype & Evaluate
3. 原型开发与评估
bash
undefinedbash
undefinedRun benchmark on eval dataset
在评估数据集上运行基准测试
python scripts/evaluate_model.py
--model claude-3-sonnet
--dataset data/eval_1000_examples.jsonl
--metrics accuracy,latency,cost
--model claude-3-sonnet
--dataset data/eval_1000_examples.jsonl
--metrics accuracy,latency,cost
python scripts/evaluate_model.py
--model claude-3-sonnet
--dataset data/eval_1000_examples.jsonl
--metrics accuracy,latency,cost
--model claude-3-sonnet
--dataset data/eval_1000_examples.jsonl
--metrics accuracy,latency,cost
Expected output:
预期输出:
Accuracy: 94.3%
Accuracy: 94.3%
P95 Latency: 1,245ms
P95 Latency: 1,245ms
Cost per 1K requests: $2.15
Cost per 1K requests: $2.15
undefinedundefined4. Iteration Checklist
4. 迭代检查清单
- Latency P95 meets requirement? If no → optimize serving (quantization, caching)
- Accuracy meets threshold? If no → improve prompts, fine-tune, or upgrade model
- Cost within budget? If no → aggressive caching, smaller model routing, batching
- Safety guardrails tested? If no → add content filters, PII detection
- Monitoring dashboards live? If no → set up Prometheus + Grafana
- Runbook documented? If no → document common failures and fixes
- P95延迟是否满足要求?若否 → 优化服务(量化、缓存)
- 准确率是否达到阈值?若否 → 优化提示词、微调或升级模型
- 成本是否在预算内?若否 → 强化缓存、多模型路由、批量处理
- 安全防护机制是否经过测试?若否 → 添加内容过滤器、PII检测
- 监控仪表盘是否已启用?若否 → 搭建Prometheus + Grafana
- 运行手册是否已文档化?若否 → 记录常见故障与修复方案
Cost Optimization Strategies
成本优化策略
| Strategy | Savings | When to Use |
|---|---|---|
| Semantic caching | 40-80% | 60%+ similar queries |
| Multi-model routing | 30-50% | Mixed complexity queries |
| Prompt compression | 10-20% | Long context inputs |
| Batching | 20-40% | Async-tolerant workloads |
| Smaller model cascade | 40-60% | Simple queries first |
| 策略 | 节省比例 | 适用场景 |
|---|---|---|
| 语义缓存 | 40-80% | 60%以上查询相似 |
| 多模型路由 | 30-50% | 混合复杂度查询 |
| 提示词压缩 | 10-20% | 长上下文输入 |
| 批量处理 | 20-40% | 可容忍异步的工作负载 |
| 小型模型级联 | 40-60% | 先处理简单查询 |
Safety Checklist
安全检查清单
- Content filtering tested against adversarial examples
- PII detection and redaction validated
- Prompt injection defenses in place
- Output validation rules implemented
- Audit logging configured for all requests
- Compliance requirements documented and validated
- 针对对抗样本测试内容过滤
- 验证PII检测与脱敏功能
- 部署提示词注入防御机制
- 实现输出验证规则
- 为所有请求配置审计日志
- 合规要求已文档化并验证
Red Flags - When to Escalate
危险信号 - 何时升级处理
| Observation | Action |
|---|---|
| Accuracy <80% after prompt iteration | Consider fine-tuning |
| Latency 2x requirement | Review infrastructure |
| Cost >2x budget | Aggressive caching/routing |
| Hallucination rate >5% | Add RAG or stronger guardrails |
| Safety bypass detected | Immediate security review |
| 现象 | 行动 |
|---|---|
| 提示词迭代后准确率 <80% | 考虑微调 |
| 延迟达到要求的2倍 | 审查基础设施 |
| 成本超过预算2倍 | 强化缓存/路由策略 |
| 幻觉率 >5% | 添加RAG或更强的防护机制 |
| 检测到安全绕过 | 立即进行安全审查 |
Quick Reference: Performance Targets
快速参考:性能指标目标
| Metric | Target | Critical |
|---|---|---|
| P95 Latency | <2x requirement | <3x requirement |
| Accuracy | >90% | >80% |
| Cache Hit Rate | >60% | >40% |
| Error Rate | <1% | <5% |
| Cost/1K requests | Within budget | <150% budget |
| 指标 | 目标 | 临界值 |
|---|---|---|
| P95延迟 | <要求值的2倍 | <要求值的3倍 |
| 准确率 | >90% | >80% |
| 缓存命中率 | >60% | >40% |
| 错误率 | <1% | <5% |
| 每千请求成本 | 符合预算 | <预算的150% |
Additional Resources
额外资源
-
Detailed Technical Reference: See REFERENCE.md
- RAG implementation workflow
- Semantic caching patterns
- Deployment configurations
-
Code Examples & Patterns: See EXAMPLES.md
- Anti-patterns (fine-tuning when prompting suffices, no fallback)
- Quality checklist for LLM systems
- Resilient LLM call patterns
-
详细技术参考:查看REFERENCE.md
- RAG实现工作流
- 语义缓存模式
- 部署配置
-
代码示例与模式:查看EXAMPLES.md
- 反模式(当提示词足够时仍选择微调、无降级方案)
- LLM系统质量检查清单
- 高可用LLM调用模式