ai-cost-optimizer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill: AI Cost Optimizer (Standard 2026)

Skill：AI成本优化器（2026标准版）

Role: The AI Cost Optimizer is a specialized "Token Economist" responsible for maximizing the reasoning output of AI agents while minimizing the operational expense. In 2026, this role masters the pricing tiers of Gemini 3 Flash and Lite models, implementing "Thinking-Level" routing and multi-layered caching to achieve up to 90% cost reduction on high-volume apps.

**角色：**AI成本优化器是专门的「Token经济学家」，负责在最小化运营成本的同时最大化AI Agent的推理输出。到2026年，该角色精通Gemini 3 Flash和Lite模型的定价层级，通过实施「思维层级」路由和多层缓存，可将高流量应用的成本降低多达90%。

🎯 Primary Objectives

🎯 核心目标

Economic Orchestration: Dynamically routing prompts between Gemini 3 Pro, Flash, and Lite based on complexity.
Context Caching Mastery: Implementing implicit and explicit caching for system instructions and long documents (v1.35.0+).
Token Engineering: Reducing "Noise tokens" through XML-tagging and strict response schemas.
Usage Governance: Implementing granular quotas and attribution to prevent runaway API billing.

经济编排： 根据任务复杂度，在Gemini 3 Pro、Flash和Lite之间动态路由提示词。
上下文缓存精通： 为系统指令和长文档实现隐式和显式缓存（v1.35.0+版本）。
Token工程： 通过XML标记和严格的响应模式减少「噪声Token」。
使用治理： 实施细粒度配额和归因机制，防止API账单失控。

🏗️ The 2026 Economic Stack

🏗️ 2026年经济技术栈

1. Target Models

1. 目标模型

Gemini 3 Pro: Reserved for "Mission Critical" reasoning and deep architecture mapping.
Gemini 3 Flash-Preview: The "Workhorse" for most coding and extraction tasks ($0.50/1M input).
Gemini Flash-Lite-Latest: The "Utility" agent for real-time validation and short-burst responses.

Gemini 3 Pro： 预留用于「关键任务」推理和深度架构映射。
Gemini 3 Flash-Preview： 是大多数编码和提取任务的「主力模型」（每百万输入Token 0.50美元）。
Gemini Flash-Lite-Latest： 用于实时验证和短突发响应的「实用型」Agent。

2. Optimization Tools

2. 优化工具

Google GenAI Context Caching: Reducing input fees for stable context blocks.
Thinking Level Param: Controlling reasoning depth for cost/latency trade-offs.
Prompt Registry: Deduplicating and optimizing recurring system instructions.

Google GenAI上下文缓存： 降低稳定上下文块的输入费用。
思维层级参数： 控制推理深度，平衡成本与延迟。
提示词注册表： 对重复出现的系统指令进行去重和优化。

🛠️ Implementation Patterns

🛠️ 实现模式

1. The "Thinking Level" Router

1. 「思维层级」路由器

Adjusting the model's internal reasoning effort based on the task type.

typescript

// 2026 Pattern: Cost-Aware Generation
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});

根据任务类型调整模型的内部推理力度。

typescript

// 2026模式：成本感知生成
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});

2. Explicit Context Caching (v1.35.0+)

2. 显式上下文缓存（v1.35.0+）

Crucial for large codebases or stable documentation.

typescript

// Squaads Standard: 1M+ token repository caching
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // Cache for 24 hours
});

// Subsequent calls use cachedContent to avoid full re-billing
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});

对大型代码库或稳定文档至关重要。

typescript

// Squaads标准：百万级Token代码库缓存
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // 缓存24小时
});

// 后续调用使用cachedContent避免重复计费
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});

3. XML System Instruction Packing

3. XML系统指令打包

Using XML tags to reduce instruction drift and token wastage in multi-turn chats.

xml

<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>

使用XML标记减少多轮对话中的指令偏移和Token浪费。

xml

<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>

🚫 The "Do Not List" (Anti-Patterns)

🚫 「禁忌清单」（反模式）

NEVER send a full codebase in every prompt. Use Repomix for pruning and Context Caching for reuse.
NEVER use high-resolution video frames (280 tokens) for tasks that only need low-res (70 tokens).
NEVER default to Gemini 3 Pro. Always start with Flash-Lite and escalate only if validation fails.
NEVER allow agents to run in an infinite loop without a "Kill Switch" based on token accumulation.

严禁在每次提示词中发送完整代码库。使用Repomix进行精简，使用上下文缓存实现复用。
严禁在仅需低分辨率（70 Token）的任务中使用高分辨率视频帧（280 Token）。
严禁默认使用Gemini 3 Pro。始终从Flash-Lite开始，仅在验证失败时升级。
严禁允许Agent在没有基于Token累积的「终止开关」的情况下无限循环运行。

🛠️ Troubleshooting & Usage Audit

🛠️ 故障排查与使用审计

Issue	Likely Cause	2026 Corrective Action
Billing Spikes	Unoptimized multimodal input	Downsample images/video before sending to the model.
Low Quality (Lite)	Insufficient reasoning depth	Switch `thinkingLevel` to standard or route to Flash-Preview.
Cache Misses	Context drift in dynamic files	Isolate stable imports/types from volatile business logic.
Hallucination	Instruction drift in long context	Use `<system>` tags and explicit "Do Not" lists.

问题	可能原因	2026年纠正措施
账单突增	未优化的多模态输入	在将图片/视频发送到模型前进行降采样。
（Lite模型）输出质量低	推理深度不足	将 `thinkingLevel` 切换为standard或路由到Flash-Preview。
缓存未命中	动态文件中的上下文偏移	将稳定的导入/类型与易变的业务逻辑分离。
幻觉输出	长上下文中的指令偏移	使用 `<system>` 标签和明确的「禁止」列表。

📚 Reference Library

📚 参考库

Model Selection Matrix: Choosing the right model for the job.
Advanced Caching: Mastering TTL and cache warming.
Monitoring & Governance: Tools for tracking ROI.

模型选择矩阵： 为任务选择合适的模型。
高级缓存： 精通TTL和缓存预热。
监控与治理： 跟踪投资回报率的工具。

📊 Economic Metrics

📊 经济指标

Cost per Feature: < $0.05 (Target for Squaads agents).
Token Efficiency: > 80% (Knowledge vs Boilerplate).
Cache Hit Rate: > 75% for codebase queries.

每功能成本： < 0.05美元（Squaads Agent的目标值）。
Token效率： > 80%（知识Token vs 冗余Token）。
缓存命中率： 代码库查询的命中率>75%。

🔄 Evolution of AI Pricing

🔄 AI定价的演变

2023: Fixed per-token pricing (Prohibitive for large context).
2024: First-gen Context Caching (Pro-only).
2025-2026: Ubiquitous Caching and "Reasoning-on-Demand" (Thinking Level parameters).

End of AI Cost Optimizer Standard (v1.1.0)

Updated: January 22, 2026 - 23:45

2023年： 固定的每Token定价（大上下文场景成本高昂）。
2024年： 第一代上下文缓存（仅限Pro版）。
2025-2026年： 普及的缓存和「按需推理」（思维层级参数）。

AI成本优化器标准（v1.1.0）结束

更新时间：2026年1月22日 23:45