Loading...
Loading...
Track, optimize, and control token consumption across multi-agent systems. Covers budget allocation, real-time monitoring, cost attribution, per-agent limits, and proactive cost optimization for production LLM deployments.
npx skill4agent add cosmicstack-labs/mercury-agent-skills token-budget-tracking| Model | Input ($/1M tokens) | Output ($/1M tokens) | Cost per 100K tasks (4K avg) |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$1,250 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | ~$1,800 |
| GPT-4o-mini | $0.15 | $0.60 | ~$75 |
| Claude 3 Haiku | $0.25 | $1.25 | ~$150 |
| Dimension | What It Tracks | Why It Matters |
|---|---|---|
| Per Agent | Tokens consumed by each agent | Identify expensive agents |
| Per Task | Cost per completed task | Measure ROI per task type |
| Per User | Cost attributed to a user/session | Bill-back, abuse detection |
| Per Model | Cost by LLM provider/model | Model selection decisions |
| Daily/Weekly | Aggregate burn rate | Budget forecasting |
| Per Step | Tokens per reasoning step | Detect inefficient reasoning |
from dataclasses import dataclass, field
from collections import defaultdict
import time
import threading
@dataclass
class TokenUsage:
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
def __add__(self, other: "TokenUsage"):
return TokenUsage(
prompt_tokens=self.prompt_tokens + other.prompt_tokens,
completion_tokens=self.completion_tokens + other.completion_tokens,
total_tokens=self.total_tokens + other.total_tokens
)
class TokenCounter:
"""Tracks token usage across all agents with attribution."""
def __init__(self):
self.usage: dict[str, dict[str, TokenUsage]] = defaultdict(
lambda: defaultdict(TokenUsage)
)
self._lock = threading.Lock()
def record(self, agent_name: str, task_id: str,
prompt_tokens: int, completion_tokens: int):
"""Record token usage for an agent-task pair."""
with self._lock:
usage = TokenUsage(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens
)
self.usage[agent_name][task_id] = usage
def agent_total(self, agent_name: str) -> TokenUsage:
"""Get total tokens for an agent."""
with self._lock:
total = TokenUsage()
for task_usage in self.usage[agent_name].values():
total += task_usage
return total
def task_cost(self, agent_name: str, task_id: str,
input_rate: float, output_rate: float) -> float:
"""Calculate monetary cost for a specific task."""
usage = self.usage[agent_name].get(task_id)
if not usage:
return 0.0
return (
usage.prompt_tokens * input_rate / 1_000_000 +
usage.completion_tokens * output_rate / 1_000_000
)
def top_agents(self, n: int = 10) -> list[tuple[str, TokenUsage]]:
"""Get the n highest-consuming agents."""
with self._lock:
totals = [
(agent, self.agent_total(agent))
for agent in self.usage
]
totals.sort(key=lambda x: x[1].total_tokens, reverse=True)
return totals[:n]class TokenBudget:
"""Enforce per-agent and global token budgets."""
def __init__(self, counter: TokenCounter):
self.counter = counter
self.agent_limits: dict[str, int] = {} # agent -> max tokens
self.period_limits: dict[str, int] = {} # period -> max tokens
# Current period tracking
self.period_start = time.time()
self.period_usage: dict[str, int] = defaultdict(int)
def set_agent_limit(self, agent_name: str, max_tokens: int):
"""Set a per-agent token budget."""
self.agent_limits[agent_name] = max_tokens
def set_period_limit(self, period_name: str, max_tokens: int):
"""Set a global period budget (e.g., daily, weekly)."""
self.period_limits[period_name] = max_tokens
async def check_and_apply_budget(self, agent_name: str,
estimated_tokens: int) -> bool:
"""Check if this request would exceed any budget. Returns True if allowed."""
# 1. Check per-agent limit
agent_limit = self.agent_limits.get(agent_name)
if agent_limit:
current = self.counter.agent_total(agent_name).total_tokens
if current + estimated_tokens > agent_limit:
return False # Agent budget exceeded
# 2. Check period limits
current_period = self._current_period()
for period, limit in self.period_limits.items():
if self.period_usage[period] + estimated_tokens > limit:
return False # Global period budget exceeded
# 3. Reserve tokens
self.period_usage[current_period] += estimated_tokens
return True
def _current_period(self) -> str:
"""Get the current period label (e.g., 'daily:2024-01-15')."""
now = datetime.now()
return f"daily:{now.strftime('%Y-%m-%d')}"
def reset_period(self):
"""Reset period tracking (call daily, weekly)."""
self.period_start = time.time()
self.period_usage.clear()class TokenOptimizer:
"""Apply optimization strategies to reduce token consumption."""
def __init__(self, llm):
self.llm = llm
async def compress_context(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Compress conversation history to fit within token budget."""
total = self._count_tokens(messages)
if total <= max_tokens:
return messages # No compression needed
# Strategy 1: Remove low-signal turns
messages = self._remove_greetings(messages)
# Strategy 2: Summarize older messages
if self._count_tokens(messages) > max_tokens:
messages = await self._summarize_older(messages)
# Strategy 3: Truncate long messages
if self._count_tokens(messages) > max_tokens:
messages = self._truncate_longest(messages, max_tokens)
return messages
def _remove_greetings(self, messages: list[dict]) -> list[dict]:
"""Remove low-value greetings and acknowledgments."""
greetings = {"hi", "hello", "thanks", "okay", "sure", "got it"}
return [
m for m in messages
if not (m["role"] == "assistant" and
m["content"].strip().lower() in greetings)
]
async def _summarize_older(self, messages: list[dict]) -> list[dict]:
"""Summarize messages beyond a threshold."""
keep_recent = messages[-4:] # Keep last 4 exchanges verbatim
to_summarize = messages[:-4]
if not to_summarize:
return messages
text = "\n".join(
f"[{m['role']}]: {m['content'][:500]}"
for m in to_summarize
)
summary = await self.llm.generate(
f"Summarize this conversation history concisely while preserving "
f"all key facts, decisions, and user preferences:\n\n{text}",
max_tokens=200
)
return [
{"role": "system", "content": f"[Summarized Context]: {summary}"}
] + keep_recent
def _truncate_longest(self, messages: list[dict],
max_tokens: int) -> list[dict]:
"""Truncate the longest messages to fit budget."""
while self._count_tokens(messages) > max_tokens:
longest = max(
messages,
key=lambda m: len(m["content"].split())
)
# Truncate by half
words = longest["content"].split()
longest["content"] = " ".join(words[:len(words)//2])
return messages
def _count_tokens(self, messages: list[dict]) -> int:
"""Rough token estimation (4 chars ≈ 1 token)."""
total = sum(len(m["content"]) for m in messages)
return total // 4class BudgetDashboard:
"""Real-time token budget monitoring."""
def __init__(self, counter: TokenCounter, budgets: TokenBudget):
self.counter = counter
self.budgets = budgets
def current_status(self) -> dict:
"""Get current budget status for all agents."""
agents = {}
for agent_name in self.counter.usage:
usage = self.counter.agent_total(agent_name)
limit = self.budgets.agent_limits.get(agent_name, float('inf'))
agents[agent_name] = {
"total_tokens": usage.total_tokens,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"budget_limit": limit,
"percent_used": (usage.total_tokens / limit * 100) if limit != float('inf') else 0,
"estimated_cost": self._estimate_cost(usage),
}
return {
"agents": agents,
"global": {
"total_tokens": sum(a["total_tokens"] for a in agents.values()),
"total_cost": sum(a["estimated_cost"] for a in agents.values()),
},
"period_usage": dict(self.budgets.period_usage)
}
def _estimate_cost(self, usage: TokenUsage) -> float:
"""Estimate cost at blended rate ($0.003/1K tokens)."""
return usage.total_tokens * 0.003 / 1000
def budget_alert(self, threshold: float = 0.8) -> list[str]:
"""Get alerts for agents approaching budget limits."""
alerts = []
for agent_name, info in self.current_status()["agents"].items():
if info["percent_used"] > threshold * 100:
alerts.append(
f"{agent_name}: {info['percent_used']:.0f}% of budget used "
f"({info['total_tokens']:,} tokens)"
)
return alertsclass CostController:
"""Proactive cost optimization controls."""
def __init__(self, optimizer: TokenOptimizer, budgets: TokenBudget):
self.optimizer = optimizer
self.budgets = budgets
async def optimize_request(self, agent_name: str,
messages: list[dict]) -> list[dict]:
"""Optimize a request before sending to LLM."""
# Apply optimizations based on agent's budget status
agent_usage = self.budgets.counter.agent_total(agent_name)
agent_limit = self.budgets.agent_limits.get(agent_name, float('inf'))
usage_ratio = agent_usage.total_tokens / agent_limit if agent_limit else 0
if usage_ratio > 0.9:
# Critical: aggressive compression
return await self.optimizer.compress_context(
messages, max_tokens=2000
)
elif usage_ratio > 0.7:
# Warning: moderate compression
return await self.optimizer.compress_context(
messages, max_tokens=4000
)
return messages # Within budget, no optimization needed
def model_selection(self, task_difficulty: str,
available_budget: float) -> str:
"""Select the most cost-effective model for the task."""
models = {
"easy": {"model": "gpt-4o-mini", "cost_per_1k": 0.00015},
"medium": {"model": "gpt-4o", "cost_per_1k": 0.0025},
"hard": {"model": "gpt-4o", "cost_per_1k": 0.0025},
}
recommended = models.get(task_difficulty, models["medium"])
# If budget is very tight, downgrade
if available_budget < recommended["cost_per_1k"] * 100:
return models["easy"]["model"]
return recommended["model"]# budget-alerts.yaml
budget_rules:
- name: daily_budget_exceeded
condition: daily_tokens > daily_limit
severity: P1
action: pause_all_noncritical_agents
message: "Daily token budget exceeded: {used}/{limit}"
- name: agent_spike
condition: agent_hourly_tokens > agent_hourly_limit * 2
severity: P2
action: investigate_agent
message: "Agent {name} token usage spiked: {hourly_used}/hr"
- name: runaway_detected
condition: agent_tokens_per_task > task_limit * 3
severity: P1
action: kill_agent_task
message: "Agent {name} may be looping: {tokens_per_task} tokens/task"
- name: budget_forecast
condition: projected_daily_tokens > daily_limit * 0.9
severity: P3
action: notify_team
message: "On track to exceed daily budget by {overshoot}%"| Agent Type | Suggested Daily Limit | Monthly Cost Cap |
|---|---|---|
| Customer Support | 500K tokens | ~$150 |
| Research Agent | 2M tokens | ~$600 |
| Code Generation | 3M tokens | ~$900 |
| Data Analysis | 1M tokens | ~$300 |
| Internal Tooling | 200K tokens | ~$60 |
| Strategy | Typical Savings | Quality Impact |
|---|---|---|
| Shorter system prompts | 15-30% | Low (with good editing) |
| Context compression | 20-40% | Medium (summarization loss) |
| Cheaper models for easy tasks | 40-60% | Low (task-dependent) |
| Fewer reasoning steps | 10-25% | Medium (may reduce quality) |
| Caching common responses | 5-15% | None (exact cache hits) |
| Limiting conversation history | 30-50% | Low (recent context suffices) |
| Phrase | Action |
|---|---|
| "Show token usage" | Display current consumption by agent |
| "What's the budget status?" | Show budget vs. actual for all agents |
| "Set budget for agent X" | Configure per-agent token limit |
| "What's costing the most?" | Show top spenders and cost breakdown |
| "Optimize this request" | Compress context to reduce tokens |
| "Run budget report" | Generate daily/weekly cost report |
| "Alert on budget thresholds" | Set up budget alerting rules |
| "Switch to cheaper model" | Downgrade model for cost savings |
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| No budget tracking | Bills are a surprise each month | Real-time per-agent tracking |
| Same model for all tasks | Overpaying for simple tasks | Model tiering by difficulty |
| No per-agent limits | One runaway agent costs everything | Hard per-agent budgets |
| Ignoring output tokens | Output costs more than input (2-4x) | Track input/output separately |
| No proactive alerts | Find out about spikes at billing | Real-time budget alerts |
| Caching nothing | Pay for identical prompts repeatedly | Implement response caching |