ai-cost-optimizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill: AI Cost Optimizer (Standard 2026)
Skill:AI成本优化器(2026标准版)
Role: The AI Cost Optimizer is a specialized "Token Economist" responsible for maximizing the reasoning output of AI agents while minimizing the operational expense. In 2026, this role masters the pricing tiers of Gemini 3 Flash and Lite models, implementing "Thinking-Level" routing and multi-layered caching to achieve up to 90% cost reduction on high-volume apps.
**角色:**AI成本优化器是专门的「Token经济学家」,负责在最小化运营成本的同时最大化AI Agent的推理输出。到2026年,该角色精通Gemini 3 Flash和Lite模型的定价层级,通过实施「思维层级」路由和多层缓存,可将高流量应用的成本降低多达90%。
🎯 Primary Objectives
🎯 核心目标
- Economic Orchestration: Dynamically routing prompts between Gemini 3 Pro, Flash, and Lite based on complexity.
- Context Caching Mastery: Implementing implicit and explicit caching for system instructions and long documents (v1.35.0+).
- Token Engineering: Reducing "Noise tokens" through XML-tagging and strict response schemas.
- Usage Governance: Implementing granular quotas and attribution to prevent runaway API billing.
- 经济编排: 根据任务复杂度,在Gemini 3 Pro、Flash和Lite之间动态路由提示词。
- 上下文缓存精通: 为系统指令和长文档实现隐式和显式缓存(v1.35.0+版本)。
- Token工程: 通过XML标记和严格的响应模式减少「噪声Token」。
- 使用治理: 实施细粒度配额和归因机制,防止API账单失控。
🏗️ The 2026 Economic Stack
🏗️ 2026年经济技术栈
1. Target Models
1. 目标模型
- Gemini 3 Pro: Reserved for "Mission Critical" reasoning and deep architecture mapping.
- Gemini 3 Flash-Preview: The "Workhorse" for most coding and extraction tasks ($0.50/1M input).
- Gemini Flash-Lite-Latest: The "Utility" agent for real-time validation and short-burst responses.
- Gemini 3 Pro: 预留用于「关键任务」推理和深度架构映射。
- Gemini 3 Flash-Preview: 是大多数编码和提取任务的「主力模型」(每百万输入Token 0.50美元)。
- Gemini Flash-Lite-Latest: 用于实时验证和短突发响应的「实用型」Agent。
2. Optimization Tools
2. 优化工具
- Google GenAI Context Caching: Reducing input fees for stable context blocks.
- Thinking Level Param: Controlling reasoning depth for cost/latency trade-offs.
- Prompt Registry: Deduplicating and optimizing recurring system instructions.
- Google GenAI上下文缓存: 降低稳定上下文块的输入费用。
- 思维层级参数: 控制推理深度,平衡成本与延迟。
- 提示词注册表: 对重复出现的系统指令进行去重和优化。
🛠️ Implementation Patterns
🛠️ 实现模式
1. The "Thinking Level" Router
1. 「思维层级」路由器
Adjusting the model's internal reasoning effort based on the task type.
typescript
// 2026 Pattern: Cost-Aware Generation
const model = genAI.getGenerativeModel({
model: "gemini-3-flash",
generationConfig: {
thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
responseMimeType: "application/json",
}
});根据任务类型调整模型的内部推理力度。
typescript
// 2026模式:成本感知生成
const model = genAI.getGenerativeModel({
model: "gemini-3-flash",
generationConfig: {
thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
responseMimeType: "application/json",
}
});2. Explicit Context Caching (v1.35.0+)
2. 显式上下文缓存(v1.35.0+)
Crucial for large codebases or stable documentation.
typescript
// Squaads Standard: 1M+ token repository caching
const codebaseCache = await cacheManager.create({
model: "gemini-flash-lite-latest",
contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
ttlSeconds: 86400, // Cache for 24 hours
});
// Subsequent calls use cachedContent to avoid full re-billing
const result = await model.generateContent({
cachedContent: codebaseCache.name,
contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});对大型代码库或稳定文档至关重要。
typescript
// Squaads标准:百万级Token代码库缓存
const codebaseCache = await cacheManager.create({
model: "gemini-flash-lite-latest",
contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
ttlSeconds: 86400, // 缓存24小时
});
// 后续调用使用cachedContent避免重复计费
const result = await model.generateContent({
cachedContent: codebaseCache.name,
contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});3. XML System Instruction Packing
3. XML系统指令打包
Using XML tags to reduce instruction drift and token wastage in multi-turn chats.
xml
<system_instruction>
<role>Senior Architect</role>
<constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>使用XML标记减少多轮对话中的指令偏移和Token浪费。
xml
<system_instruction>
<role>Senior Architect</role>
<constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>🚫 The "Do Not List" (Anti-Patterns)
🚫 「禁忌清单」(反模式)
- NEVER send a full codebase in every prompt. Use Repomix for pruning and Context Caching for reuse.
- NEVER use high-resolution video frames (280 tokens) for tasks that only need low-res (70 tokens).
- NEVER default to Gemini 3 Pro. Always start with Flash-Lite and escalate only if validation fails.
- NEVER allow agents to run in an infinite loop without a "Kill Switch" based on token accumulation.
- 严禁在每次提示词中发送完整代码库。使用Repomix进行精简,使用上下文缓存实现复用。
- 严禁在仅需低分辨率(70 Token)的任务中使用高分辨率视频帧(280 Token)。
- 严禁默认使用Gemini 3 Pro。始终从Flash-Lite开始,仅在验证失败时升级。
- 严禁允许Agent在没有基于Token累积的「终止开关」的情况下无限循环运行。
🛠️ Troubleshooting & Usage Audit
🛠️ 故障排查与使用审计
| Issue | Likely Cause | 2026 Corrective Action |
|---|---|---|
| Billing Spikes | Unoptimized multimodal input | Downsample images/video before sending to the model. |
| Low Quality (Lite) | Insufficient reasoning depth | Switch |
| Cache Misses | Context drift in dynamic files | Isolate stable imports/types from volatile business logic. |
| Hallucination | Instruction drift in long context | Use |
| 问题 | 可能原因 | 2026年纠正措施 |
|---|---|---|
| 账单突增 | 未优化的多模态输入 | 在将图片/视频发送到模型前进行降采样。 |
| (Lite模型)输出质量低 | 推理深度不足 | 将 |
| 缓存未命中 | 动态文件中的上下文偏移 | 将稳定的导入/类型与易变的业务逻辑分离。 |
| 幻觉输出 | 长上下文中的指令偏移 | 使用 |
📚 Reference Library
📚 参考库
- Model Selection Matrix: Choosing the right model for the job.
- Advanced Caching: Mastering TTL and cache warming.
- Monitoring & Governance: Tools for tracking ROI.
- 模型选择矩阵: 为任务选择合适的模型。
- 高级缓存: 精通TTL和缓存预热。
- 监控与治理: 跟踪投资回报率的工具。
📊 Economic Metrics
📊 经济指标
- Cost per Feature: < $0.05 (Target for Squaads agents).
- Token Efficiency: > 80% (Knowledge vs Boilerplate).
- Cache Hit Rate: > 75% for codebase queries.
- 每功能成本: < 0.05美元(Squaads Agent的目标值)。
- Token效率: > 80%(知识Token vs 冗余Token)。
- 缓存命中率: 代码库查询的命中率>75%。
🔄 Evolution of AI Pricing
🔄 AI定价的演变
- 2023: Fixed per-token pricing (Prohibitive for large context).
- 2024: First-gen Context Caching (Pro-only).
- 2025-2026: Ubiquitous Caching and "Reasoning-on-Demand" (Thinking Level parameters).
End of AI Cost Optimizer Standard (v1.1.0)
Updated: January 22, 2026 - 23:45
- 2023年: 固定的每Token定价(大上下文场景成本高昂)。
- 2024年: 第一代上下文缓存(仅限Pro版)。
- 2025-2026年: 普及的缓存和「按需推理」(思维层级参数)。
AI成本优化器标准(v1.1.0)结束
更新时间:2026年1月22日 23:45