ai-cost-optimizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill: AI Cost Optimizer (Standard 2026)

Skill:AI成本优化器(2026标准版)

Role: The AI Cost Optimizer is a specialized "Token Economist" responsible for maximizing the reasoning output of AI agents while minimizing the operational expense. In 2026, this role masters the pricing tiers of Gemini 3 Flash and Lite models, implementing "Thinking-Level" routing and multi-layered caching to achieve up to 90% cost reduction on high-volume apps.
**角色:**AI成本优化器是专门的「Token经济学家」,负责在最小化运营成本的同时最大化AI Agent的推理输出。到2026年,该角色精通Gemini 3 Flash和Lite模型的定价层级,通过实施「思维层级」路由和多层缓存,可将高流量应用的成本降低多达90%。

🎯 Primary Objectives

🎯 核心目标

  1. Economic Orchestration: Dynamically routing prompts between Gemini 3 Pro, Flash, and Lite based on complexity.
  2. Context Caching Mastery: Implementing implicit and explicit caching for system instructions and long documents (v1.35.0+).
  3. Token Engineering: Reducing "Noise tokens" through XML-tagging and strict response schemas.
  4. Usage Governance: Implementing granular quotas and attribution to prevent runaway API billing.

  1. 经济编排: 根据任务复杂度,在Gemini 3 Pro、Flash和Lite之间动态路由提示词。
  2. 上下文缓存精通: 为系统指令和长文档实现隐式和显式缓存(v1.35.0+版本)。
  3. Token工程: 通过XML标记和严格的响应模式减少「噪声Token」。
  4. 使用治理: 实施细粒度配额和归因机制,防止API账单失控。

🏗️ The 2026 Economic Stack

🏗️ 2026年经济技术栈

1. Target Models

1. 目标模型

  • Gemini 3 Pro: Reserved for "Mission Critical" reasoning and deep architecture mapping.
  • Gemini 3 Flash-Preview: The "Workhorse" for most coding and extraction tasks ($0.50/1M input).
  • Gemini Flash-Lite-Latest: The "Utility" agent for real-time validation and short-burst responses.
  • Gemini 3 Pro: 预留用于「关键任务」推理和深度架构映射。
  • Gemini 3 Flash-Preview: 是大多数编码和提取任务的「主力模型」(每百万输入Token 0.50美元)。
  • Gemini Flash-Lite-Latest: 用于实时验证和短突发响应的「实用型」Agent。

2. Optimization Tools

2. 优化工具

  • Google GenAI Context Caching: Reducing input fees for stable context blocks.
  • Thinking Level Param: Controlling reasoning depth for cost/latency trade-offs.
  • Prompt Registry: Deduplicating and optimizing recurring system instructions.

  • Google GenAI上下文缓存: 降低稳定上下文块的输入费用。
  • 思维层级参数: 控制推理深度,平衡成本与延迟。
  • 提示词注册表: 对重复出现的系统指令进行去重和优化。

🛠️ Implementation Patterns

🛠️ 实现模式

1. The "Thinking Level" Router

1. 「思维层级」路由器

Adjusting the model's internal reasoning effort based on the task type.
typescript
// 2026 Pattern: Cost-Aware Generation
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});
根据任务类型调整模型的内部推理力度。
typescript
// 2026模式:成本感知生成
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});

2. Explicit Context Caching (v1.35.0+)

2. 显式上下文缓存(v1.35.0+)

Crucial for large codebases or stable documentation.
typescript
// Squaads Standard: 1M+ token repository caching
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // Cache for 24 hours
});

// Subsequent calls use cachedContent to avoid full re-billing
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});
对大型代码库或稳定文档至关重要。
typescript
// Squaads标准:百万级Token代码库缓存
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // 缓存24小时
});

// 后续调用使用cachedContent避免重复计费
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});

3. XML System Instruction Packing

3. XML系统指令打包

Using XML tags to reduce instruction drift and token wastage in multi-turn chats.
xml
<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>

使用XML标记减少多轮对话中的指令偏移和Token浪费。
xml
<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>

🚫 The "Do Not List" (Anti-Patterns)

🚫 「禁忌清单」(反模式)

  1. NEVER send a full codebase in every prompt. Use Repomix for pruning and Context Caching for reuse.
  2. NEVER use high-resolution video frames (280 tokens) for tasks that only need low-res (70 tokens).
  3. NEVER default to Gemini 3 Pro. Always start with Flash-Lite and escalate only if validation fails.
  4. NEVER allow agents to run in an infinite loop without a "Kill Switch" based on token accumulation.

  1. 严禁在每次提示词中发送完整代码库。使用Repomix进行精简,使用上下文缓存实现复用。
  2. 严禁在仅需低分辨率(70 Token)的任务中使用高分辨率视频帧(280 Token)。
  3. 严禁默认使用Gemini 3 Pro。始终从Flash-Lite开始,仅在验证失败时升级。
  4. 严禁允许Agent在没有基于Token累积的「终止开关」的情况下无限循环运行。

🛠️ Troubleshooting & Usage Audit

🛠️ 故障排查与使用审计

IssueLikely Cause2026 Corrective Action
Billing SpikesUnoptimized multimodal inputDownsample images/video before sending to the model.
Low Quality (Lite)Insufficient reasoning depthSwitch
thinkingLevel
to standard or route to Flash-Preview.
Cache MissesContext drift in dynamic filesIsolate stable imports/types from volatile business logic.
HallucinationInstruction drift in long contextUse
<system>
tags and explicit "Do Not" lists.

问题可能原因2026年纠正措施
账单突增未优化的多模态输入在将图片/视频发送到模型前进行降采样。
(Lite模型)输出质量低推理深度不足
thinkingLevel
切换为standard或路由到Flash-Preview。
缓存未命中动态文件中的上下文偏移将稳定的导入/类型与易变的业务逻辑分离。
幻觉输出长上下文中的指令偏移使用
<system>
标签和明确的「禁止」列表。

📚 Reference Library

📚 参考库

  • Model Selection Matrix: Choosing the right model for the job.
  • Advanced Caching: Mastering TTL and cache warming.
  • Monitoring & Governance: Tools for tracking ROI.

  • 模型选择矩阵 为任务选择合适的模型。
  • 高级缓存 精通TTL和缓存预热。
  • 监控与治理 跟踪投资回报率的工具。

📊 Economic Metrics

📊 经济指标

  • Cost per Feature: < $0.05 (Target for Squaads agents).
  • Token Efficiency: > 80% (Knowledge vs Boilerplate).
  • Cache Hit Rate: > 75% for codebase queries.

  • 每功能成本: < 0.05美元(Squaads Agent的目标值)。
  • Token效率: > 80%(知识Token vs 冗余Token)。
  • 缓存命中率: 代码库查询的命中率>75%。

🔄 Evolution of AI Pricing

🔄 AI定价的演变

  • 2023: Fixed per-token pricing (Prohibitive for large context).
  • 2024: First-gen Context Caching (Pro-only).
  • 2025-2026: Ubiquitous Caching and "Reasoning-on-Demand" (Thinking Level parameters).

End of AI Cost Optimizer Standard (v1.1.0)
Updated: January 22, 2026 - 23:45
  • 2023年: 固定的每Token定价(大上下文场景成本高昂)。
  • 2024年: 第一代上下文缓存(仅限Pro版)。
  • 2025-2026年: 普及的缓存和「按需推理」(思维层级参数)。

AI成本优化器标准(v1.1.0)结束
更新时间:2026年1月22日 23:45