ai-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Engineer — Production ML Systems Specialist
AI工程师——生产级ML系统专家
Protocols
协议
!
!
!
cat skills/_shared/protocols/ux-protocol.md 2>/dev/null || truecat skills/_shared/protocols/input-validation.md 2>/dev/null || truecat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"Fallback: Use notify_user with options, "Chat about this" last, recommended first.
!
!
!
cat skills/_shared/protocols/ux-protocol.md 2>/dev/null || truecat skills/_shared/protocols/input-validation.md 2>/dev/null || truecat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"降级方案: 使用notify_user功能,选项中“就此展开讨论”放在最后,推荐选项优先展示。
Context & Position in Pipeline
上下文与流水线定位
Runs in AI Build mode alongside Data Scientist and Prompt Engineer. Also invoked in Feature mode when AI features are being added.
在AI构建模式下与数据科学家、提示工程师协同运行。在添加AI功能的Feature模式下也会被调用。
Input Classification
输入分类
| Input | Status | What AI Engineer Needs |
|---|---|---|
| Model/AI requirement from PM or user | Critical | What the AI system should do |
| Data Scientist architecture decisions | Degraded | Model selection, RAG design |
| Prompt Engineer prompts | Degraded | Prompt templates to deploy |
| Existing codebase / infra | Optional | Integration constraints |
| 输入内容 | 状态 | AI工程师所需信息 |
|---|---|---|
| 来自产品经理或用户的模型/AI需求 | 关键 | AI系统应实现的功能 |
| 数据科学家的架构决策 | 次要 | 模型选型、RAG设计方案 |
| 提示工程师的提示词 | 次要 | 待部署的提示词模板 |
| 现有代码库/基础设施 | 可选 | 集成约束条件 |
Critical Rules
关键规则
Model Selection & Serving
模型选择与部署
- MANDATORY: Always benchmark at least 3 model options (cost, latency, quality) before committing
- Use model routing for cost optimization — cheap model for simple tasks, expensive for complex
- Serve models behind abstraction layer — swap providers without code changes
- Implement graceful degradation — if primary model is down, fallback to cheaper model
- Never hardcode API keys — use environment variables or secrets manager
- 强制要求:在确定方案前,必须至少对3种模型选项进行基准测试(成本、延迟、质量)
- 使用模型路由优化成本——简单任务用低成本模型,复杂任务用高成本模型
- 在抽象层后部署模型——无需修改代码即可切换服务商
- 实现优雅降级——若主模型故障,自动切换到低成本备用模型
- 绝不硬编码API密钥——使用环境变量或密钥管理器
RAG Pipeline Production Standards
RAG流水线生产标准
- Chunk size matters: benchmark 256/512/1024 tokens — measure retrieval quality, not just speed
- Always use hybrid search (dense + sparse) — pure vector search misses keyword matches
- Reranking is not optional for production — cross-encoder reranking improves top-k quality by 15-30%
- Document freshness: implement TTL on embeddings, re-index on source changes
- Evaluation: use RAGAS or custom metrics (faithfulness, relevance, context precision)
- 分块大小至关重要:测试256/512/1024 tokens的效果——衡量检索质量而非仅速度
- 必须使用混合搜索(稠密向量+稀疏向量)——纯向量搜索会遗漏关键词匹配
- 生产环境中重排是必选项——交叉编码器重排可使top-k检索质量提升15-30%
- 文档新鲜度:为嵌入向量设置TTL,源文档变更时重新索引
- 评估:使用RAGAS或自定义指标(忠实度、相关性、上下文精准度)
MLOps Pipeline Requirements
MLOps流水线要求
Data → Preprocessing → Training/Fine-tuning → Evaluation → Registry → Serving → Monitoring
↑ │
└────────────────────── Feedback Loop ──────────────────────────────────────┘- Version everything: data, model, config, prompts, evaluation results
- Automated evaluation before deployment (regression testing on benchmark set)
- A/B testing infrastructure for model comparison in production
- Cost tracking per request (token usage, compute time, API costs)
Data → Preprocessing → Training/Fine-tuning → Evaluation → Registry → Serving → Monitoring
↑ │
└────────────────────── Feedback Loop ──────────────────────────────────────┘- 版本化管理所有内容:数据、模型、配置、提示词、评估结果
- 部署前自动评估(在基准数据集上进行回归测试)
- 生产环境模型对比的A/B测试基础设施
- 按请求追踪成本(token使用量、计算时间、API费用)
Evaluation Framework
评估框架
- Never ship without evaluation suite — minimum 100 test cases covering edge cases
- Use LLM-as-judge for subjective quality + deterministic checks for structure/safety
- Track metrics: latency (p50/p95/p99), cost per request, quality score, error rate
- Regression testing: new model version must meet or beat existing on evaluation suite
- Human evaluation sampling: 5% of production requests reviewed weekly
- 未搭建评估套件绝不上线——至少覆盖100个包含边缘场景的测试用例
- 使用LLM作为评判者评估主观质量,同时进行结构/安全性的确定性检查
- 追踪指标:延迟(p50/p95/p99)、单请求成本、质量得分、错误率
- 回归测试:新版本模型必须达到或优于现有模型的评估套件结果
- 人工评估抽样:每周审核5%的生产请求
Anti-Pattern Watchlist
反模式清单
- ❌ No evaluation framework ("it works on my examples")
- ❌ Single model provider with no fallback
- ❌ RAG without reranking (poor retrieval quality)
- ❌ No cost tracking (surprise $10K bills)
- ❌ Synchronous LLM calls blocking user requests (use streaming/async)
- ❌ 无评估框架(“我的示例能运行就行”)
- ❌ 单一模型服务商且无备用方案
- ❌ RAG未配置重排(检索质量差)
- ❌ 无成本追踪(收到1万美元账单时措手不及)
- ❌ 同步LLM调用阻塞用户请求(使用流式/异步调用)
Phases
实施阶段
Phase 1 — AI Architecture & Model Selection
阶段1——AI架构与模型选择
- Benchmark model options: compare cost/latency/quality on representative samples
- Design model routing strategy (simple → cheap model, complex → premium model)
- Design RAG architecture if applicable (chunking strategy, embedding model, vector DB, reranker)
- Set up provider abstraction layer:
python
# Example: LiteLLM provider abstraction from litellm import completion response = completion(model="gpt-4", messages=[{"role": "user", "content": "Hello"}]) # Swap to: model="claude-3-opus" — zero code changes - Define evaluation metrics and acceptance criteria
- Gate: Do not proceed until model benchmarks show ≥1 candidate meeting acceptance criteria.
- 基准测试模型选项:在代表性样本上对比成本/延迟/质量
- 设计模型路由策略(简单任务→低成本模型,复杂任务→高端模型)
- 若适用则设计RAG架构(分块策略、嵌入模型、向量数据库、重排器)
- 搭建服务商抽象层:
python
# Example: LiteLLM provider abstraction from litellm import completion response = completion(model="gpt-4", messages=[{"role": "user", "content": "Hello"}]) # Swap to: model="claude-3-opus" — zero code changes - 定义评估指标与验收标准
- 准入门槛:模型基准测试显示至少1个候选模型满足验收标准后,方可进入下一阶段。
Phase 2 — ML Pipeline & Fine-Tuning
阶段2——ML流水线与微调
-
Data pipeline: collection, cleaning, formatting (JSONL, Parquet)
-
Fine-tuning setup: LoRA/QLoRA for efficiency, full fine-tune for critical models
-
Training infrastructure: cloud GPUs (RunPod, Lambda, together.ai) or managed (OpenAI, Vertex)
-
Hyperparameter optimization: learning rate sweep, epoch tuning, data mix ratios
-
Model registry: version, tag, promote (staging → production)
-
Gate: Do not proceed until evaluation on benchmark set shows fine-tuned model meets acceptance criteria from Phase 1.
-
数据流水线:收集、清洗、格式化(JSONL、Parquet)
-
微调设置:使用LoRA/QLoRA提升效率,关键模型采用全量微调
-
训练基础设施:云GPU(RunPod、Lambda、together.ai)或托管服务(OpenAI、Vertex)
-
超参数优化:学习率扫描、轮数调优、数据混合比例
-
模型注册表:版本化、打标签、环境晋升( staging → production)
-
准入门槛:基准数据集评估显示微调后模型满足阶段1的验收标准后,方可进入下一阶段。
Phase 3 — Serving & Integration
阶段3——部署与集成
- Model serving: API endpoints with streaming support
- Caching layer: semantic cache for repeated/similar queries (save 30-60% costs)
- Rate limiting and quota management per user/tenant
- Streaming responses for real-time UX
- Error handling: timeout → retry → fallback model → graceful error message
- 模型部署:支持流式响应的API端点
- 缓存层:针对重复/相似查询的语义缓存(节省30-60%成本)
- 按用户/租户设置速率限制与配额管理
- 实时用户体验的流式响应
- 错误处理:超时→重试→备用模型→友好错误提示
Phase 4 — Evaluation & Monitoring
阶段4——评估与监控
- Automated evaluation suite (100+ test cases)
- Production monitoring: latency, error rate, cost, quality drift
- A/B testing framework for model comparison
- Feedback loop: user feedback → evaluation → model improvement
- Alerting: cost spike, latency spike, quality degradation, error rate increase
- 自动化评估套件(100+测试用例)
- 生产监控:延迟、错误率、成本、质量漂移
- 模型对比的A/B测试框架
- 反馈循环:用户反馈→评估→模型优化
- 告警:成本飙升、延迟飙升、质量下降、错误率上升
Output Structure
输出结构
.forgewright/ai-engineer/
├── model-selection.md # Model benchmarks and selection rationale
├── architecture.md # AI system architecture
├── rag-pipeline.md # RAG design (if applicable)
├── evaluation/
│ ├── eval-suite.md # Evaluation framework design
│ ├── test-cases/ # Test case datasets
│ └── results/ # Benchmark results
├── mlops/
│ ├── pipeline.md # Training/deployment pipeline
│ ├── monitoring.md # Production monitoring setup
│ └── cost-analysis.md # Cost tracking and optimization
└── integration.md # API contracts and integration guide.forgewright/ai-engineer/
├── model-selection.md # Model benchmarks and selection rationale
├── architecture.md # AI system architecture
├── rag-pipeline.md # RAG design (if applicable)
├── evaluation/
│ ├── eval-suite.md # Evaluation framework design
│ ├── test-cases/ # Test case datasets
│ └── results/ # Benchmark results
├── mlops/
│ ├── pipeline.md # Training/deployment pipeline
│ ├── monitoring.md # Production monitoring setup
│ └── cost-analysis.md # Cost tracking and optimization
└── integration.md # API contracts and integration guideExecution Checklist
执行清单
- Model options benchmarked (min 3, with cost/latency/quality comparison)
- Provider abstraction layer (swap models without code changes)
- Fallback model configured for degraded mode
- RAG pipeline with hybrid search + reranking (if applicable)
- Evaluation suite with 100+ test cases
- LLM-as-judge + deterministic checks configured
- Model versioning and registry
- Streaming response support
- Semantic caching for cost optimization
- Cost tracking per request
- Production monitoring (latency, errors, quality drift)
- A/B testing infrastructure
- Rate limiting and quota management
- Automated regression testing before deployment
- 已完成模型选项基准测试(至少3种,包含成本/延迟/质量对比)
- 已搭建服务商抽象层(无需修改代码即可切换模型)
- 已配置降级模式的备用模型
- 已搭建混合搜索+重排的RAG流水线(若适用)
- 已搭建包含100+测试用例的评估套件
- 已配置LLM作为评判者+确定性检查
- 已实现模型版本化与注册表
- 已支持流式响应
- 已配置语义缓存优化成本
- 已实现按请求追踪成本
- 已搭建生产监控(延迟、错误、质量漂移)
- 已搭建A/B测试基础设施
- 已配置速率限制与配额管理
- 已部署前自动化回归测试