token-budget-tracking
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseToken Budget Tracking & Optimization
令牌预算跟踪与优化
Overview
概述
In production multi-agent systems, token costs are the new infrastructure bill — and they can spiral fast. An agent in a loop can burn through hundreds of dollars in minutes. This skill covers how to set budgets, track consumption in real time, attribute costs to specific agents and tasks, and optimize token usage without sacrificing quality.
在生产环境的多Agent系统中,令牌成本已成为新的基础设施开销——而且成本增长速度极快。一个陷入循环的Agent可能在几分钟内消耗数百美元。本技能介绍如何设置预算、实时跟踪消耗、将成本归因于特定Agent和任务,并在不牺牲质量的前提下优化令牌使用。
Core Concepts
核心概念
Token Cost Economics
令牌成本经济学
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Cost per 100K tasks (4K avg) |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$1,250 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | ~$1,800 |
| GPT-4o-mini | $0.15 | $0.60 | ~$75 |
| Claude 3 Haiku | $0.25 | $1.25 | ~$150 |
A single runaway agent consuming 50K tokens per loop for 100 iterations = $50-$150 in minutes.
| 模型 | 输入成本(美元/百万令牌) | 输出成本(美元/百万令牌) | 每10万次任务成本(平均4K令牌) |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$1,250 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | ~$1,800 |
| GPT-4o-mini | $0.15 | $0.60 | ~$75 |
| Claude 3 Haiku | $0.25 | $1.25 | ~$150 |
单个失控Agent每次循环消耗5万令牌,运行100次迭代 = 几分钟内消耗50-150美元。
Budget Dimensions
预算维度
| Dimension | What It Tracks | Why It Matters |
|---|---|---|
| Per Agent | Tokens consumed by each agent | Identify expensive agents |
| Per Task | Cost per completed task | Measure ROI per task type |
| Per User | Cost attributed to a user/session | Bill-back, abuse detection |
| Per Model | Cost by LLM provider/model | Model selection decisions |
| Daily/Weekly | Aggregate burn rate | Budget forecasting |
| Per Step | Tokens per reasoning step | Detect inefficient reasoning |
| 维度 | 跟踪内容 | 重要性 |
|---|---|---|
| 按Agent | 每个Agent消耗的令牌 | 识别高成本Agent |
| 按任务 | 每项已完成任务的成本 | 衡量不同任务类型的投资回报率 |
| 按用户 | 归因于单个用户/会话的成本 | 成本回溯、滥用检测 |
| 按模型 | 各LLM提供商/模型的成本 | 模型选型决策 |
| 每日/每周 | 总消耗速率 | 预算预测 |
| 按步骤 | 每个推理步骤的令牌数 | 检测低效推理 |
Step-by-Step Implementation
分步实现
Step 1: Build a Token Counter
步骤1:构建令牌计数器
python
from dataclasses import dataclass, field
from collections import defaultdict
import time
import threading
@dataclass
class TokenUsage:
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
def __add__(self, other: "TokenUsage"):
return TokenUsage(
prompt_tokens=self.prompt_tokens + other.prompt_tokens,
completion_tokens=self.completion_tokens + other.completion_tokens,
total_tokens=self.total_tokens + other.total_tokens
)
class TokenCounter:
"""Tracks token usage across all agents with attribution."""
def __init__(self):
self.usage: dict[str, dict[str, TokenUsage]] = defaultdict(
lambda: defaultdict(TokenUsage)
)
self._lock = threading.Lock()
def record(self, agent_name: str, task_id: str,
prompt_tokens: int, completion_tokens: int):
"""Record token usage for an agent-task pair."""
with self._lock:
usage = TokenUsage(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens
)
self.usage[agent_name][task_id] = usage
def agent_total(self, agent_name: str) -> TokenUsage:
"""Get total tokens for an agent."""
with self._lock:
total = TokenUsage()
for task_usage in self.usage[agent_name].values():
total += task_usage
return total
def task_cost(self, agent_name: str, task_id: str,
input_rate: float, output_rate: float) -> float:
"""Calculate monetary cost for a specific task."""
usage = self.usage[agent_name].get(task_id)
if not usage:
return 0.0
return (
usage.prompt_tokens * input_rate / 1_000_000 +
usage.completion_tokens * output_rate / 1_000_000
)
def top_agents(self, n: int = 10) -> list[tuple[str, TokenUsage]]:
"""Get the n highest-consuming agents."""
with self._lock:
totals = [
(agent, self.agent_total(agent))
for agent in self.usage
]
totals.sort(key=lambda x: x[1].total_tokens, reverse=True)
return totals[:n]python
from dataclasses import dataclass, field
from collections import defaultdict
import time
import threading
@dataclass
class TokenUsage:
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
def __add__(self, other: "TokenUsage"):
return TokenUsage(
prompt_tokens=self.prompt_tokens + other.prompt_tokens,
completion_tokens=self.completion_tokens + other.completion_tokens,
total_tokens=self.total_tokens + other.total_tokens
)
class TokenCounter:
"""Tracks token usage across all agents with attribution."""
def __init__(self):
self.usage: dict[str, dict[str, TokenUsage]] = defaultdict(
lambda: defaultdict(TokenUsage)
)
self._lock = threading.Lock()
def record(self, agent_name: str, task_id: str,
prompt_tokens: int, completion_tokens: int):
"""Record token usage for an agent-task pair."""
with self._lock:
usage = TokenUsage(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens
)
self.usage[agent_name][task_id] = usage
def agent_total(self, agent_name: str) -> TokenUsage:
"""Get total tokens for an agent."""
with self._lock:
total = TokenUsage()
for task_usage in self.usage[agent_name].values():
total += task_usage
return total
def task_cost(self, agent_name: str, task_id: str,
input_rate: float, output_rate: float) -> float:
"""Calculate monetary cost for a specific task."""
usage = self.usage[agent_name].get(task_id)
if not usage:
return 0.0
return (
usage.prompt_tokens * input_rate / 1_000_000 +
usage.completion_tokens * output_rate / 1_000_000
)
def top_agents(self, n: int = 10) -> list[tuple[str, TokenUsage]]:
"""Get the n highest-consuming agents."""
with self._lock:
totals = [
(agent, self.agent_total(agent))
for agent in self.usage
]
totals.sort(key=lambda x: x[1].total_tokens, reverse=True)
return totals[:n]Step 2: Implement Budget Enforcements
步骤2:实现预算强制执行
python
class TokenBudget:
"""Enforce per-agent and global token budgets."""
def __init__(self, counter: TokenCounter):
self.counter = counter
self.agent_limits: dict[str, int] = {} # agent -> max tokens
self.period_limits: dict[str, int] = {} # period -> max tokens
# Current period tracking
self.period_start = time.time()
self.period_usage: dict[str, int] = defaultdict(int)
def set_agent_limit(self, agent_name: str, max_tokens: int):
"""Set a per-agent token budget."""
self.agent_limits[agent_name] = max_tokens
def set_period_limit(self, period_name: str, max_tokens: int):
"""Set a global period budget (e.g., daily, weekly)."""
self.period_limits[period_name] = max_tokens
async def check_and_apply_budget(self, agent_name: str,
estimated_tokens: int) -> bool:
"""Check if this request would exceed any budget. Returns True if allowed."""
# 1. Check per-agent limit
agent_limit = self.agent_limits.get(agent_name)
if agent_limit:
current = self.counter.agent_total(agent_name).total_tokens
if current + estimated_tokens > agent_limit:
return False # Agent budget exceeded
# 2. Check period limits
current_period = self._current_period()
for period, limit in self.period_limits.items():
if self.period_usage[period] + estimated_tokens > limit:
return False # Global period budget exceeded
# 3. Reserve tokens
self.period_usage[current_period] += estimated_tokens
return True
def _current_period(self) -> str:
"""Get the current period label (e.g., 'daily:2024-01-15')."""
now = datetime.now()
return f"daily:{now.strftime('%Y-%m-%d')}"
def reset_period(self):
"""Reset period tracking (call daily, weekly)."""
self.period_start = time.time()
self.period_usage.clear()python
class TokenBudget:
"""Enforce per-agent and global token budgets."""
def __init__(self, counter: TokenCounter):
self.counter = counter
self.agent_limits: dict[str, int] = {} # agent -> max tokens
self.period_limits: dict[str, int] = {} # period -> max tokens
# Current period tracking
self.period_start = time.time()
self.period_usage: dict[str, int] = defaultdict(int)
def set_agent_limit(self, agent_name: str, max_tokens: int):
"""Set a per-agent token budget."""
self.agent_limits[agent_name] = max_tokens
def set_period_limit(self, period_name: str, max_tokens: int):
"""Set a global period budget (e.g., daily, weekly)."""
self.period_limits[period_name] = max_tokens
async def check_and_apply_budget(self, agent_name: str,
estimated_tokens: int) -> bool:
"""Check if this request would exceed any budget. Returns True if allowed."""
# 1. Check per-agent limit
agent_limit = self.agent_limits.get(agent_name)
if agent_limit:
current = self.counter.agent_total(agent_name).total_tokens
if current + estimated_tokens > agent_limit:
return False # Agent budget exceeded
# 2. Check period limits
current_period = self._current_period()
for period, limit in self.period_limits.items():
if self.period_usage[period] + estimated_tokens > limit:
return False # Global period budget exceeded
# 3. Reserve tokens
self.period_usage[current_period] += estimated_tokens
return True
def _current_period(self) -> str:
"""Get the current period label (e.g., 'daily:2024-01-15')."""
now = datetime.now()
return f"daily:{now.strftime('%Y-%m-%d')}"
def reset_period(self):
"""Reset period tracking (call daily, weekly)."""
self.period_start = time.time()
self.period_usage.clear()Step 3: Token Optimization Strategies
步骤3:令牌优化策略
python
class TokenOptimizer:
"""Apply optimization strategies to reduce token consumption."""
def __init__(self, llm):
self.llm = llm
async def compress_context(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Compress conversation history to fit within token budget."""
total = self._count_tokens(messages)
if total <= max_tokens:
return messages # No compression needed
# Strategy 1: Remove low-signal turns
messages = self._remove_greetings(messages)
# Strategy 2: Summarize older messages
if self._count_tokens(messages) > max_tokens:
messages = await self._summarize_older(messages)
# Strategy 3: Truncate long messages
if self._count_tokens(messages) > max_tokens:
messages = self._truncate_longest(messages, max_tokens)
return messages
def _remove_greetings(self, messages: list[dict]) -> list[dict]:
"""Remove low-value greetings and acknowledgments."""
greetings = {"hi", "hello", "thanks", "okay", "sure", "got it"}
return [
m for m in messages
if not (m["role"] == "assistant" and
m["content"].strip().lower() in greetings)
]
async def _summarize_older(self, messages: list[dict]) -> list[dict]:
"""Summarize messages beyond a threshold."""
keep_recent = messages[-4:] # Keep last 4 exchanges verbatim
to_summarize = messages[:-4]
if not to_summarize:
return messages
text = "\n".join(
f"[{m['role']}]: {m['content'][:500]}"
for m in to_summarize
)
summary = await self.llm.generate(
f"Summarize this conversation history concisely while preserving "
f"all key facts, decisions, and user preferences:\n\n{text}",
max_tokens=200
)
return [
{"role": "system", "content": f"[Summarized Context]: {summary}"}
] + keep_recent
def _truncate_longest(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Truncate the longest messages to fit budget."""
while self._count_tokens(messages) > max_tokens:
longest = max(
messages,
key=lambda m: len(m["content"].split())
)
# Truncate by half
words = longest["content"].split()
longest["content"] = " ".join(words[:len(words)//2])
return messages
def _count_tokens(self, messages: list[dict]) -> int:
"""Rough token estimation (4 chars ≈ 1 token)."""
total = sum(len(m["content"]) for m in messages)
return total // 4python
class TokenOptimizer:
"""Apply optimization strategies to reduce token consumption."""
def __init__(self, llm):
self.llm = llm
async def compress_context(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Compress conversation history to fit within token budget."""
total = self._count_tokens(messages)
if total <= max_tokens:
return messages # No compression needed
# Strategy 1: Remove low-signal turns
messages = self._remove_greetings(messages)
# Strategy 2: Summarize older messages
if self._count_tokens(messages) > max_tokens:
messages = await self._summarize_older(messages)
# Strategy 3: Truncate long messages
if self._count_tokens(messages) > max_tokens:
messages = self._truncate_longest(messages, max_tokens)
return messages
def _remove_greetings(self, messages: list[dict]) -> list[dict]:
"""Remove low-value greetings and acknowledgments."""
greetings = {"hi", "hello", "thanks", "okay", "sure", "got it"}
return [
m for m in messages
if not (m["role"] == "assistant" and
m["content"].strip().lower() in greetings)
]
async def _summarize_older(self, messages: list[dict]) -> list[dict]:
"""Summarize messages beyond a threshold."""
keep_recent = messages[-4:] # Keep last 4 exchanges verbatim
to_summarize = messages[:-4]
if not to_summarize:
return messages
text = "\n".join(
f"[{m['role']}]: {m['content'][:500]}"
for m in to_summarize
)
summary = await self.llm.generate(
f"Summarize this conversation history concisely while preserving "
f"all key facts, decisions, and user preferences:\n\n{text}",
max_tokens=200
)
return [
{"role": "system", "content": f"[Summarized Context]: {summary}"}
] + keep_recent
def _truncate_longest(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Truncate the longest messages to fit budget."""
while self._count_tokens(messages) > max_tokens:
longest = max(
messages,
key=lambda m: len(m["content"].split())
)
# Truncate by half
words = longest["content"].split()
longest["content"] = " ".join(words[:len(words)//2])
return messages
def _count_tokens(self, messages: list[dict]) -> int:
"""Rough token estimation (4 chars ≈ 1 token)."""
total = sum(len(m["content"]) for m in messages)
return total // 4Step 4: Real-Time Budget Dashboard
步骤4:实时预算仪表盘
python
class BudgetDashboard:
"""Real-time token budget monitoring."""
def __init__(self, counter: TokenCounter, budgets: TokenBudget):
self.counter = counter
self.budgets = budgets
def current_status(self) -> dict:
"""Get current budget status for all agents."""
agents = {}
for agent_name in self.counter.usage:
usage = self.counter.agent_total(agent_name)
limit = self.budgets.agent_limits.get(agent_name, float('inf'))
agents[agent_name] = {
"total_tokens": usage.total_tokens,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"budget_limit": limit,
"percent_used": (usage.total_tokens / limit * 100) if limit != float('inf') else 0,
"estimated_cost": self._estimate_cost(usage),
}
return {
"agents": agents,
"global": {
"total_tokens": sum(a["total_tokens"] for a in agents.values()),
"total_cost": sum(a["estimated_cost"] for a in agents.values()),
},
"period_usage": dict(self.budgets.period_usage)
}
def _estimate_cost(self, usage: TokenUsage) -> float:
"""Estimate cost at blended rate ($0.003/1K tokens)."""
return usage.total_tokens * 0.003 / 1000
def budget_alert(self, threshold: float = 0.8) -> list[str]:
"""Get alerts for agents approaching budget limits."""
alerts = []
for agent_name, info in self.current_status()["agents"].items():
if info["percent_used"] > threshold * 100:
alerts.append(
f"{agent_name}: {info['percent_used']:.0f}% of budget used "
f"({info['total_tokens']:,} tokens)"
)
return alertspython
class BudgetDashboard:
"""Real-time token budget monitoring."""
def __init__(self, counter: TokenCounter, budgets: TokenBudget):
self.counter = counter
self.budgets = budgets
def current_status(self) -> dict:
"""Get current budget status for all agents."""
agents = {}
for agent_name in self.counter.usage:
usage = self.counter.agent_total(agent_name)
limit = self.budgets.agent_limits.get(agent_name, float('inf'))
agents[agent_name] = {
"total_tokens": usage.total_tokens,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"budget_limit": limit,
"percent_used": (usage.total_tokens / limit * 100) if limit != float('inf') else 0,
"estimated_cost": self._estimate_cost(usage),
}
return {
"agents": agents,
"global": {
"total_tokens": sum(a["total_tokens"] for a in agents.values()),
"total_cost": sum(a["estimated_cost"] for a in agents.values()),
},
"period_usage": dict(self.budgets.period_usage)
}
def _estimate_cost(self, usage: TokenUsage) -> float:
"""Estimate cost at blended rate ($0.003/1K tokens)."""
return usage.total_tokens * 0.003 / 1000
def budget_alert(self, threshold: float = 0.8) -> list[str]:
"""Get alerts for agents approaching budget limits."""
alerts = []
for agent_name, info in self.current_status()["agents"].items():
if info["percent_used"] > threshold * 100:
alerts.append(
f"{agent_name}: {info['percent_used']:.0f}% of budget used "
f"({info['total_tokens']:,} tokens)"
)
return alertsStep 5: Proactive Cost Controls
步骤5:主动成本控制
python
class CostController:
"""Proactive cost optimization controls."""
def __init__(self, optimizer: TokenOptimizer, budgets: TokenBudget):
self.optimizer = optimizer
self.budgets = budgets
async def optimize_request(self, agent_name: str,
messages: list[dict]) -> list[dict]:
"""Optimize a request before sending to LLM."""
# Apply optimizations based on agent's budget status
agent_usage = self.budgets.counter.agent_total(agent_name)
agent_limit = self.budgets.agent_limits.get(agent_name, float('inf'))
usage_ratio = agent_usage.total_tokens / agent_limit if agent_limit else 0
if usage_ratio > 0.9:
# Critical: aggressive compression
return await self.optimizer.compress_context(
messages, max_tokens=2000
)
elif usage_ratio > 0.7:
# Warning: moderate compression
return await self.optimizer.compress_context(
messages, max_tokens=4000
)
return messages # Within budget, no optimization needed
def model_selection(self, task_difficulty: str,
available_budget: float) -> str:
"""Select the most cost-effective model for the task."""
models = {
"easy": {"model": "gpt-4o-mini", "cost_per_1k": 0.00015},
"medium": {"model": "gpt-4o", "cost_per_1k": 0.0025},
"hard": {"model": "gpt-4o", "cost_per_1k": 0.0025},
}
recommended = models.get(task_difficulty, models["medium"])
# If budget is very tight, downgrade
if available_budget < recommended["cost_per_1k"] * 100:
return models["easy"]["model"]
return recommended["model"]python
class CostController:
"""Proactive cost optimization controls."""
def __init__(self, optimizer: TokenOptimizer, budgets: TokenBudget):
self.optimizer = optimizer
self.budgets = budgets
async def optimize_request(self, agent_name: str,
messages: list[dict]) -> list[dict]:
"""Optimize a request before sending to LLM."""
# Apply optimizations based on agent's budget status
agent_usage = self.budgets.counter.agent_total(agent_name)
agent_limit = self.budgets.agent_limits.get(agent_name, float('inf'))
usage_ratio = agent_usage.total_tokens / agent_limit if agent_limit else 0
if usage_ratio > 0.9:
# Critical: aggressive compression
return await self.optimizer.compress_context(
messages, max_tokens=2000
)
elif usage_ratio > 0.7:
# Warning: moderate compression
return await self.optimizer.compress_context(
messages, max_tokens=4000
)
return messages # Within budget, no optimization needed
def model_selection(self, task_difficulty: str,
available_budget: float) -> str:
"""Select the most cost-effective model for the task."""
models = {
"easy": {"model": "gpt-4o-mini", "cost_per_1k": 0.00015},
"medium": {"model": "gpt-4o", "cost_per_1k": 0.0025},
"hard": {"model": "gpt-4o", "cost_per_1k": 0.0025},
}
recommended = models.get(task_difficulty, models["medium"])
# If budget is very tight, downgrade
if available_budget < recommended["cost_per_1k"] * 100:
return models["easy"]["model"]
return recommended["model"]Step 6: Budget Alerting Rules
步骤6:预算告警规则
yaml
undefinedyaml
undefinedbudget-alerts.yaml
budget-alerts.yaml
budget_rules:
-
name: daily_budget_exceeded condition: daily_tokens > daily_limit severity: P1 action: pause_all_noncritical_agents message: "Daily token budget exceeded: {used}/{limit}"
-
name: agent_spike condition: agent_hourly_tokens > agent_hourly_limit * 2 severity: P2 action: investigate_agent message: "Agent {name} token usage spiked: {hourly_used}/hr"
-
name: runaway_detected condition: agent_tokens_per_task > task_limit * 3 severity: P1 action: kill_agent_task message: "Agent {name} may be looping: {tokens_per_task} tokens/task"
-
name: budget_forecast condition: projected_daily_tokens > daily_limit * 0.9 severity: P3 action: notify_team message: "On track to exceed daily budget by {overshoot}%"
---budget_rules:
-
name: daily_budget_exceeded condition: daily_tokens > daily_limit severity: P1 action: pause_all_noncritical_agents message: "Daily token budget exceeded: {used}/{limit}"
-
name: agent_spike condition: agent_hourly_tokens > agent_hourly_limit * 2 severity: P2 action: investigate_agent message: "Agent {name} token usage spiked: {hourly_used}/hr"
-
name: runaway_detected condition: agent_tokens_per_task > task_limit * 3 severity: P1 action: kill_agent_task message: "Agent {name} may be looping: {tokens_per_task} tokens/task"
-
name: budget_forecast condition: projected_daily_tokens > daily_limit * 0.9 severity: P3 action: notify_team message: "On track to exceed daily budget by {overshoot}%"
---Token Budget Planning
令牌预算规划
Setting Budgets
设置预算
| Agent Type | Suggested Daily Limit | Monthly Cost Cap |
|---|---|---|
| Customer Support | 500K tokens | ~$150 |
| Research Agent | 2M tokens | ~$600 |
| Code Generation | 3M tokens | ~$900 |
| Data Analysis | 1M tokens | ~$300 |
| Internal Tooling | 200K tokens | ~$60 |
| Agent类型 | 建议每日限额 | 月度成本上限 |
|---|---|---|
| 客户支持 | 50万令牌 | ~$150 |
| 研究Agent | 200万令牌 | ~$600 |
| 代码生成 | 300万令牌 | ~$900 |
| 数据分析 | 100万令牌 | ~$300 |
| 内部工具 | 20万令牌 | ~$60 |
Optimization Impact
优化效果
| Strategy | Typical Savings | Quality Impact |
|---|---|---|
| Shorter system prompts | 15-30% | Low (with good editing) |
| Context compression | 20-40% | Medium (summarization loss) |
| Cheaper models for easy tasks | 40-60% | Low (task-dependent) |
| Fewer reasoning steps | 10-25% | Medium (may reduce quality) |
| Caching common responses | 5-15% | None (exact cache hits) |
| Limiting conversation history | 30-50% | Low (recent context suffices) |
| 策略 | 典型节省比例 | 质量影响 |
|---|---|---|
| 缩短系统提示词 | 15-30% | 低(编辑得当的情况下) |
| 上下文压缩 | 20-40% | 中等(存在摘要信息损失) |
| 简单任务使用低成本模型 | 40-60% | 低(取决于任务类型) |
| 减少推理步骤 | 10-25% | 中等(可能降低质量) |
| 缓存常见响应 | 5-15% | 无(精确缓存命中时) |
| 限制对话历史长度 | 30-50% | 低(近期上下文足够) |
Trigger Phrases
触发短语
| Phrase | Action |
|---|---|
| "Show token usage" | Display current consumption by agent |
| "What's the budget status?" | Show budget vs. actual for all agents |
| "Set budget for agent X" | Configure per-agent token limit |
| "What's costing the most?" | Show top spenders and cost breakdown |
| "Optimize this request" | Compress context to reduce tokens |
| "Run budget report" | Generate daily/weekly cost report |
| "Alert on budget thresholds" | Set up budget alerting rules |
| "Switch to cheaper model" | Downgrade model for cost savings |
| 短语 | 操作 |
|---|---|
| "Show token usage" | 显示各Agent当前消耗情况 |
| "What's the budget status?" | 显示所有Agent的预算与实际消耗对比 |
| "Set budget for agent X" | 配置单个Agent的令牌限额 |
| "What's costing the most?" | 显示高消耗Agent及成本明细 |
| "Optimize this request" | 压缩上下文以减少令牌消耗 |
| "Run budget report" | 生成每日/每周成本报告 |
| "Alert on budget thresholds" | 设置预算告警规则 |
| "Switch to cheaper model" | 降级模型以节省成本 |
Anti-Patterns
反模式
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| No budget tracking | Bills are a surprise each month | Real-time per-agent tracking |
| Same model for all tasks | Overpaying for simple tasks | Model tiering by difficulty |
| No per-agent limits | One runaway agent costs everything | Hard per-agent budgets |
| Ignoring output tokens | Output costs more than input (2-4x) | Track input/output separately |
| No proactive alerts | Find out about spikes at billing | Real-time budget alerts |
| Caching nothing | Pay for identical prompts repeatedly | Implement response caching |
| 反模式 | 失败原因 | 修复方案 |
|---|---|---|
| 无预算跟踪 | 每月账单超出预期 | 实现按Agent实时跟踪 |
| 所有任务使用同一模型 | 为简单任务过度付费 | 根据任务难度分层选择模型 |
| 无按Agent限额 | 单个失控Agent耗尽全部预算 | 设置严格的按Agent预算 |
| 忽略输出令牌成本 | 输出成本是输入的2-4倍 | 分别跟踪输入/输出令牌 |
| 无主动告警 | 账单出来才发现消耗激增 | 实现实时预算告警 |
| 不做任何缓存 | 重复为相同提示词付费 | 实现响应缓存 |