LLM Architect

LLM架构师

Purpose

用途

Provides expert large language model system architecture for designing, deploying, and optimizing LLM applications at scale. Specializes in model selection, RAG (Retrieval Augmented Generation) pipelines, fine-tuning strategies, serving infrastructure, cost optimization, and safety guardrails for production LLM systems.

提供专业的大语言模型（LLM）系统架构支持，用于设计、部署和优化大规模LLM应用。专注于模型选型、RAG（检索增强生成）流水线、微调策略、服务基础设施、成本优化以及生产级LLM系统的安全防护机制。

When to Use

使用场景

Designing end-to-end LLM systems from requirements to production
Selecting models and serving infrastructure for specific use cases
Implementing RAG (Retrieval Augmented Generation) pipelines
Optimizing LLM costs while maintaining quality thresholds
Building safety guardrails and compliance mechanisms
Planning fine-tuning vs RAG vs prompt engineering strategies
Scaling LLM inference for high-throughput applications

设计从需求到生产的端到端LLM系统
针对特定用例选择模型与服务基础设施
实现RAG（检索增强生成）流水线
在维持质量标准的同时优化LLM成本
构建安全防护机制与合规体系
规划微调、RAG与提示工程的策略选择
为高吞吐量应用扩展LLM推理能力

Quick Start

快速入门

Invoke this skill when:

Designing end-to-end LLM systems from requirements to production
Selecting models and serving infrastructure for specific use cases
Implementing RAG (Retrieval Augmented Generation) pipelines
Optimizing LLM costs while maintaining quality thresholds
Building safety guardrails and compliance mechanisms

Do NOT invoke when:

Simple API integration exists (use backend-developer instead)
Only prompt engineering needed without architecture decisions
Training foundation models from scratch (almost always wrong approach)
Generic ML tasks unrelated to language models (use ml-engineer)

在以下场景调用此技能：

设计从需求到生产的端到端LLM系统
针对特定用例选择模型与服务基础设施
实现RAG（检索增强生成）流水线
在维持质量标准的同时优化LLM成本
构建安全防护机制与合规体系

请勿在以下场景调用：

仅需简单API集成（请使用后端开发技能）
仅需提示工程，无需架构决策
从头训练基础模型（几乎总是错误的方案）
与语言模型无关的通用机器学习任务（请使用机器学习工程师技能）

Decision Framework

决策框架

Model Selection Quick Guide

模型选型快速指南

Requirement	Recommended Approach
Latency <100ms	Small fine-tuned model (7B quantized)
Latency <2s, budget unlimited	Claude 3 Opus / GPT-4
Latency <2s, domain-specific	Claude 3 Sonnet fine-tuned
Latency <2s, cost-sensitive	Claude 3 Haiku
Batch/async acceptable	Batch API, cheapest tier

需求	推荐方案
延迟 <100ms	小型微调模型（7B量化版）
延迟 <2s，预算无限制	Claude 3 Opus / GPT-4
延迟 <2s，特定领域场景	微调后的Claude 3 Sonnet
延迟 <2s，对成本敏感	Claude 3 Haiku
可接受批量/异步处理	批量API，最便宜层级

RAG vs Fine-Tuning Decision Tree

RAG vs 微调决策树

Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│  ├─ Knowledge changes frequently?
│  │  └─ RAG (Retrieval Augmented Generation)
│  └─ Knowledge is static?
│     └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│  ├─ Can describe in prompt?
│  │  └─ Prompt engineering (try first)
│  └─ Format too complex for prompt?
│     └─ Fine-tuning
│
└─ Need latency <100ms?
   └─ Fine-tuned small model (7B-13B)

是否需要定制LLM行为？
│
├─ 是否需要特定领域知识？
│  ├─ 知识更新频繁？
│  │  └─ RAG（检索增强生成）
│  └─ 知识静态不变？
│     └─ 微调 或 RAG（两者都测试）
│
├─ 是否需要特定输出格式/风格？
│  ├─ 能否在提示词中描述？
│  │  └─ 提示工程（优先尝试）
│  └─ 格式过于复杂无法通过提示词实现？
│     └─ 微调
│
└─ 是否需要延迟 <100ms？
   └─ 微调后的小型模型（7B-13B）

Architecture Pattern

架构模式

[Client] → [API Gateway + Rate Limiting]
              ↓
         [Request Router]
          (Route by intent/complexity)
              ↓
    ┌────────┴────────┐
    ↓                 ↓
[Fast Model]    [Powerful Model]
(Haiku/Small)   (Sonnet/Large)
    ↓                 ↓
[Cache Layer] ← [Response Aggregator]
    ↓
[Logging & Monitoring]
    ↓
[Response to Client]

[客户端] → [API网关 + 限流]
              ↓
         [请求路由器]
          (按意图/复杂度路由)
              ↓
    ┌────────┴────────┐
    ↓                 ↓
[快速模型]    [高性能模型]
(Haiku/小型模型)   (Sonnet/大型模型)
    ↓                 ↓
[缓存层] ← [响应聚合器]
    ↓
[日志与监控]
    ↓
[响应返回客户端]

Core Workflow: Design LLM System

核心工作流：设计LLM系统

1. Requirements Gathering

1. 需求收集

Ask these questions:

Latency: What's the P95 response time requirement?
Scale: Expected requests/day and growth trajectory?
Accuracy: What's the minimum acceptable quality? (measurable metric)
Cost: Budget constraints? ($/request or $/month)
Data: Existing datasets for evaluation? Sensitivity level?
Compliance: Regulatory requirements? (HIPAA, GDPR, SOC2, etc.)

询问以下问题：

延迟：P95响应时间要求是多少？
规模：预期日请求量及增长趋势？
准确率：最低可接受质量标准？（可量化指标）
成本：预算限制？（每请求成本或月度成本）
数据：是否有用于评估的现有数据集？数据敏感级别如何？
合规：是否有监管要求？（HIPAA、GDPR、SOC2等）

2. Model Selection

2. 模型选型

python

def select_model(requirements):
    if requirements.latency_p95 < 100:  # milliseconds
        if requirements.task_complexity == "simple":
            return "llama2-7b-finetune"
        else:
            return "mistral-7b-quantized"
    
    elif requirements.latency_p95 < 2000:
        if requirements.budget == "unlimited":
            return "claude-3-opus"
        elif requirements.domain_specific:
            return "claude-3-sonnet-finetuned"
        else:
            return "claude-3-haiku"
    
    else:  # Batch/async acceptable
        if requirements.accuracy_critical:
            return "gpt-4-with-ensemble"
        else:
            return "batch-api-cheapest-tier"

python

def select_model(requirements):
    if requirements.latency_p95 < 100:  # 毫秒
        if requirements.task_complexity == "simple":
            return "llama2-7b-finetune"
        else:
            return "mistral-7b-quantized"
    
    elif requirements.latency_p95 < 2000:
        if requirements.budget == "unlimited":
            return "claude-3-opus"
        elif requirements.domain_specific:
            return "claude-3-sonnet-finetuned"
        else:
            return "claude-3-haiku"
    
    else:  # 可接受批量/异步处理
        if requirements.accuracy_critical:
            return "gpt-4-with-ensemble"
        else:
            return "batch-api-cheapest-tier"

3. Prototype & Evaluate

3. 原型开发与评估

bash

undefined

bash

undefined

Run benchmark on eval dataset

在评估数据集上运行基准测试

python scripts/evaluate_model.py
--model claude-3-sonnet
--dataset data/eval_1000_examples.jsonl
--metrics accuracy,latency,cost

Expected output:

预期输出:

Accuracy: 94.3%

P95 Latency: 1,245ms

Cost per 1K requests: $2.15

undefined

undefined

4. Iteration Checklist

4. 迭代检查清单

Latency P95 meets requirement? If no → optimize serving (quantization, caching)
Accuracy meets threshold? If no → improve prompts, fine-tune, or upgrade model
Cost within budget? If no → aggressive caching, smaller model routing, batching
Safety guardrails tested? If no → add content filters, PII detection
Monitoring dashboards live? If no → set up Prometheus + Grafana
Runbook documented? If no → document common failures and fixes

P95延迟是否满足要求？若否 → 优化服务（量化、缓存）
准确率是否达到阈值？若否 → 优化提示词、微调或升级模型
成本是否在预算内？若否 → 强化缓存、多模型路由、批量处理
安全防护机制是否经过测试？若否 → 添加内容过滤器、PII检测
监控仪表盘是否已启用？若否 → 搭建Prometheus + Grafana
运行手册是否已文档化？若否 → 记录常见故障与修复方案

Cost Optimization Strategies

成本优化策略

Strategy	Savings	When to Use
Semantic caching	40-80%	60%+ similar queries
Multi-model routing	30-50%	Mixed complexity queries
Prompt compression	10-20%	Long context inputs
Batching	20-40%	Async-tolerant workloads
Smaller model cascade	40-60%	Simple queries first

策略	节省比例	适用场景
语义缓存	40-80%	60%以上查询相似
多模型路由	30-50%	混合复杂度查询
提示词压缩	10-20%	长上下文输入
批量处理	20-40%	可容忍异步的工作负载
小型模型级联	40-60%	先处理简单查询

Safety Checklist

安全检查清单

Content filtering tested against adversarial examples
PII detection and redaction validated
Prompt injection defenses in place
Output validation rules implemented
Audit logging configured for all requests
Compliance requirements documented and validated

针对对抗样本测试内容过滤
验证PII检测与脱敏功能
部署提示词注入防御机制
实现输出验证规则
为所有请求配置审计日志
合规要求已文档化并验证

Red Flags - When to Escalate

危险信号 - 何时升级处理

Observation	Action
Accuracy <80% after prompt iteration	Consider fine-tuning
Latency 2x requirement	Review infrastructure
Cost >2x budget	Aggressive caching/routing
Hallucination rate >5%	Add RAG or stronger guardrails
Safety bypass detected	Immediate security review

现象	行动
提示词迭代后准确率 <80%	考虑微调
延迟达到要求的2倍	审查基础设施
成本超过预算2倍	强化缓存/路由策略
幻觉率 >5%	添加RAG或更强的防护机制
检测到安全绕过	立即进行安全审查

Quick Reference: Performance Targets

快速参考：性能指标目标

Metric	Target	Critical
P95 Latency	<2x requirement	<3x requirement
Accuracy	>90%	>80%
Cache Hit Rate	>60%	>40%
Error Rate	<1%	<5%
Cost/1K requests	Within budget	<150% budget

指标	目标	临界值
P95延迟	<要求值的2倍	<要求值的3倍
准确率	>90%	>80%
缓存命中率	>60%	>40%
错误率	<1%	<5%
每千请求成本	符合预算	<预算的150%

Additional Resources

额外资源

Detailed Technical Reference: See REFERENCE.md
- RAG implementation workflow
- Semantic caching patterns
- Deployment configurations
Code Examples & Patterns: See EXAMPLES.md
- Anti-patterns (fine-tuning when prompting suffices, no fallback)
- Quality checklist for LLM systems
- Resilient LLM call patterns

详细技术参考：查看REFERENCE.md
- RAG实现工作流
- 语义缓存模式
- 部署配置
代码示例与模式：查看EXAMPLES.md
- 反模式（当提示词足够时仍选择微调、无降级方案）
- LLM系统质量检查清单
- 高可用LLM调用模式

llm-architect

Original

Translation

LLM Architect

LLM架构师

Purpose

用途

When to Use

使用场景

Quick Start

快速入门

Decision Framework

决策框架

Model Selection Quick Guide

模型选型快速指南

RAG vs Fine-Tuning Decision Tree

RAG vs 微调决策树

Architecture Pattern

架构模式

Core Workflow: Design LLM System

核心工作流：设计LLM系统

1. Requirements Gathering

1. 需求收集

2. Model Selection

2. 模型选型

3. Prototype & Evaluate

3. 原型开发与评估

Run benchmark on eval dataset

在评估数据集上运行基准测试

Expected output:

预期输出:

Accuracy: 94.3%

Accuracy: 94.3%

P95 Latency: 1,245ms

P95 Latency: 1,245ms

Cost per 1K requests: $2.15

Cost per 1K requests: $2.15

4. Iteration Checklist

4. 迭代检查清单

Cost Optimization Strategies

成本优化策略

Safety Checklist

安全检查清单

Red Flags - When to Escalate

危险信号 - 何时升级处理

Quick Reference: Performance Targets

快速参考：性能指标目标

Additional Resources

额外资源