llm-cost-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Cost Optimization

LLM成本优化

Cut LLM costs by 50–90% with the right combination of caching, model selection, prompt optimization, and self-hosting.
通过缓存、模型选择、提示词优化和自托管的合理组合,将LLM成本降低50-90%。

When to Use This Skill

何时使用该技能

Use this skill when:
  • LLM API spend is growing faster than revenue
  • You need to attribute AI costs to teams, products, or customers
  • Implementing caching to avoid redundant LLM calls
  • Deciding when to switch from API providers to self-hosted models
  • Optimizing prompt length without sacrificing quality
在以下场景使用该技能:
  • LLM API支出增长速度超过收入时
  • 需要将AI成本归因到团队、产品或客户时
  • 实施缓存以避免重复调用LLM时
  • 决定何时从API提供商切换到自托管模型时
  • 在不牺牲质量的前提下优化提示词长度时

Cost Levers by Impact

按影响程度排序的成本优化手段

StrategyTypical SavingsEffort
Semantic caching20–50%Low
Model right-sizing30–70%Low
Prompt compression10–30%Medium
Provider caching (prompt cache)10–25%Low
Batching offline workloads50% (Batch API)Medium
Self-hosting 7–8B models80–95% at scaleHigh
Quantization30–50% VRAM costMedium
策略典型节省比例实施难度
语义缓存20–50%
模型合理选型30–70%
提示词压缩10–30%
提供商侧提示词缓存10–25%
离线工作负载批量处理50%(批量API)
自托管7-8B参数模型规模化后节省80–95%
量化显存成本降低30–50%

Track Costs First

先进行成本跟踪

python
undefined
python
undefined

Use LiteLLM's cost tracking (automatic per-model pricing)

Use LiteLLM's cost tracking (automatic per-model pricing)

import litellm
response = litellm.completion( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello"}], ) cost = litellm.completion_cost(response) print(f"Cost: ${cost:.6f}")
import litellm
response = litellm.completion( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello"}], ) cost = litellm.completion_cost(response) print(f"Cost: ${cost:.6f}")

Add custom cost callbacks

Add custom cost callbacks

def log_cost(kwargs, completion_response, start_time, end_time): cost = kwargs.get("response_cost", 0) model = kwargs.get("model") user = kwargs.get("user") # Send to your analytics DB db.record_cost(user=user, model=model, cost=cost)
litellm.success_callback = [log_cost]
undefined
def log_cost(kwargs, completion_response, start_time, end_time): cost = kwargs.get("response_cost", 0) model = kwargs.get("model") user = kwargs.get("user") # Send to your analytics DB db.record_cost(user=user, model=model, cost=cost)
litellm.success_callback = [log_cost]
undefined

Model Right-Sizing

模型合理选型

python
undefined
python
undefined

Route by task complexity — don't use GPT-4o for everything

Route by task complexity — don't use GPT-4o for everything

def get_model_for_task(task_type: str) -> str: routing = { "classification": "gpt-4o-mini", # ~30× cheaper than gpt-4o "summarization": "gpt-4o-mini", "extraction": "gpt-4o-mini", "simple_qa": "gpt-4o-mini", "complex_reasoning": "gpt-4o", "code_generation": "claude-sonnet-4-6", "creative_writing": "claude-opus-4-6", } return routing.get(task_type, "gpt-4o-mini")
def get_model_for_task(task_type: str) -> str: routing = { "classification": "gpt-4o-mini", # ~30× cheaper than gpt-4o "summarization": "gpt-4o-mini", "extraction": "gpt-4o-mini", "simple_qa": "gpt-4o-mini", "complex_reasoning": "gpt-4o", "code_generation": "claude-sonnet-4-6", "creative_writing": "claude-opus-4-6", } return routing.get(task_type, "gpt-4o-mini")

Cost comparison (per 1M tokens, 2025 approx.)

Cost comparison (per 1M tokens, 2025 approx.)

gpt-4o-mini: input $0.15 / output $0.60

gpt-4o-mini: input $0.15 / output $0.60

gpt-4o: input $2.50 / output $10.00

gpt-4o: input $2.50 / output $10.00

claude-sonnet-4-6: input $3.00 / output $15.00

claude-sonnet-4-6: input $3.00 / output $15.00

llama-3.1-8b (self): ~$0.05–0.10 all-in (GPU amortized)

llama-3.1-8b (self): ~$0.05–0.10 all-in (GPU amortized)

undefined
undefined

Prompt Caching (Provider-Side)

提示词缓存(提供商侧)

python
undefined
python
undefined

Anthropic — cache long system prompts (saves 90% on cached tokens)

Anthropic — cache long system prompts (saves 90% on cached tokens)

import anthropic
client = anthropic.Anthropic()
response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant.", }, { "type": "text", "text": open("large-context.txt").read(), # large doc "cache_control": {"type": "ephemeral"}, # cache this! } ], messages=[{"role": "user", "content": "Summarize the key points."}], )
import anthropic
client = anthropic.Anthropic()
response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant.", }, { "type": "text", "text": open("large-context.txt").read(), # large doc "cache_control": {"type": "ephemeral"}, # cache this! } ], messages=[{"role": "user", "content": "Summarize the key points."}], )

First call: full price. Subsequent calls: 90% discount on cached part.

First call: full price. Subsequent calls: 90% discount on cached part.

print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

OpenAI — prompt caching is automatic for repeated prefixes >1024 tokens

OpenAI — prompt caching is automatic for repeated prefixes >1024 tokens

No code change needed; check usage.prompt_tokens_details.cached_tokens

No code change needed; check usage.prompt_tokens_details.cached_tokens

undefined
undefined

Batching with OpenAI Batch API (50% Discount)

使用OpenAI批量API实现批量处理(节省50%成本)

python
import json
from openai import OpenAI

client = OpenAI()
python
import json
from openai import OpenAI

client = OpenAI()

Prepare batch requests

Prepare batch requests

requests = [ { "custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "messages": [{"role": "user", "content": f"Classify: {text}"}], "max_tokens": 50, } } for i, text in enumerate(texts) ]
requests = [ { "custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "messages": [{"role": "user", "content": f"Classify: {text}"}], "max_tokens": 50, } } for i, text in enumerate(texts) ]

Write JSONL file

Write JSONL file

with open("batch.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")
with open("batch.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")

Upload and create batch

Upload and create batch

batch_file = client.files.create(file=open("batch.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h", ) print(f"Batch ID: {batch.id}") # poll status with client.batches.retrieve(batch.id)
undefined
batch_file = client.files.create(file=open("batch.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h", ) print(f"Batch ID: {batch.id}") # poll status with client.batches.retrieve(batch.id)
undefined

Semantic Caching

语义缓存

python
import hashlib
import json
import redis
import numpy as np
from sentence_transformers import SentenceTransformer

r = redis.Redis(host="localhost", port=6379)
embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

SIMILARITY_THRESHOLD = 0.92
CACHE_TTL = 3600 * 24  # 24 hours

def cached_llm_call(prompt: str, llm_fn) -> str:
    # 1. Exact match (free)
    exact_key = f"exact:{hashlib.sha256(prompt.encode()).hexdigest()}"
    if cached := r.get(exact_key):
        return cached.decode()

    # 2. Semantic match
    query_vec = embed_model.encode(prompt)
    cached_keys = r.keys("sem:*")
    for key in cached_keys:
        data = json.loads(r.get(key))
        similarity = np.dot(query_vec, data["embedding"]) / (
            np.linalg.norm(query_vec) * np.linalg.norm(data["embedding"])
        )
        if similarity >= SIMILARITY_THRESHOLD:
            return data["response"]

    # 3. Cache miss — call LLM
    response = llm_fn(prompt)

    # Store exact match
    r.setex(exact_key, CACHE_TTL, response)

    # Store semantic embedding
    sem_key = f"sem:{hashlib.sha256(prompt.encode()).hexdigest()}"
    r.setex(sem_key, CACHE_TTL, json.dumps({
        "embedding": query_vec.tolist(),
        "response": response,
        "prompt": prompt,
    }))
    return response
python
import hashlib
import json
import redis
import numpy as np
from sentence_transformers import SentenceTransformer

r = redis.Redis(host="localhost", port=6379)
embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

SIMILARITY_THRESHOLD = 0.92
CACHE_TTL = 3600 * 24  # 24 hours

def cached_llm_call(prompt: str, llm_fn) -> str:
    # 1. Exact match (free)
    exact_key = f"exact:{hashlib.sha256(prompt.encode()).hexdigest()}"
    if cached := r.get(exact_key):
        return cached.decode()

    # 2. Semantic match
    query_vec = embed_model.encode(prompt)
    cached_keys = r.keys("sem:*")
    for key in cached_keys:
        data = json.loads(r.get(key))
        similarity = np.dot(query_vec, data["embedding"]) / (
            np.linalg.norm(query_vec) * np.linalg.norm(data["embedding"])
        )
        if similarity >= SIMILARITY_THRESHOLD:
            return data["response"]

    # 3. Cache miss — call LLM
    response = llm_fn(prompt)

    # Store exact match
    r.setex(exact_key, CACHE_TTL, response)

    # Store semantic embedding
    sem_key = f"sem:{hashlib.sha256(prompt.encode()).hexdigest()}"
    r.setex(sem_key, CACHE_TTL, json.dumps({
        "embedding": query_vec.tolist(),
        "response": response,
        "prompt": prompt,
    }))
    return response

Prompt Compression

提示词压缩

python
undefined
python
undefined

LLMLingua — compress long prompts by 3–20× with minimal quality loss

LLMLingua — compress long prompts by 3–20× with minimal quality loss

from llmlingua import PromptCompressor
compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", device_map="cpu", )
compressed = compressor.compress_prompt( long_context, ratio=0.5, # keep 50% of tokens rank_method="longllmlingua", ) print(f"Original: {len(long_context.split())} words") print(f"Compressed: {len(compressed['compressed_prompt'].split())} words") print(f"Savings: {compressed['saving']}")
undefined
from llmlingua import PromptCompressor
compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", device_map="cpu", )
compressed = compressor.compress_prompt( long_context, ratio=0.5, # keep 50% of tokens rank_method="longllmlingua", ) print(f"Original: {len(long_context.split())} words") print(f"Compressed: {len(compressed['compressed_prompt'].split())} words") print(f"Savings: {compressed['saving']}")
undefined

Self-Hosting Break-Even Calculator

自托管收支平衡计算器

python
def break_even_analysis(
    monthly_api_spend_usd: float,
    gpu_cost_per_hour_usd: float = 2.50,   # e.g., A10G on AWS
    utilization: float = 0.70,             # 70% GPU utilization
) -> dict:
    monthly_gpu_cost = gpu_cost_per_hour_usd * 24 * 30 * utilization
    break_even = monthly_gpu_cost / monthly_api_spend_usd
    recommendation = (
        "Self-host now — strong ROI" if break_even < 0.5 else
        "Self-host if traffic grows 2×" if break_even < 0.8 else
        "Stick with API — not enough scale yet"
    )
    return {
        "monthly_gpu_cost": f"${monthly_gpu_cost:.0f}",
        "monthly_api_spend": f"${monthly_api_spend_usd:.0f}",
        "gpu_as_pct_of_api": f"{break_even*100:.0f}%",
        "recommendation": recommendation,
    }
python
def break_even_analysis(
    monthly_api_spend_usd: float,
    gpu_cost_per_hour_usd: float = 2.50,   # e.g., A10G on AWS
    utilization: float = 0.70,             # 70% GPU utilization
) -> dict:
    monthly_gpu_cost = gpu_cost_per_hour_usd * 24 * 30 * utilization
    break_even = monthly_gpu_cost / monthly_api_spend_usd
    recommendation = (
        "Self-host now — strong ROI" if break_even < 0.5 else
        "Self-host if traffic grows 2×" if break_even < 0.8 else
        "Stick with API — not enough scale yet"
    )
    return {
        "monthly_gpu_cost": f"${monthly_gpu_cost:.0f}",
        "monthly_api_spend": f"${monthly_api_spend_usd:.0f}",
        "gpu_as_pct_of_api": f"{break_even*100:.0f}%",
        "recommendation": recommendation,
    }

Example: $5k/month on OpenAI, $2.50/hr A10G

Example: $5k/month on OpenAI, $2.50/hr A10G

print(break_even_analysis(5000))
print(break_even_analysis(5000))

→ gpu_cost ~$1,260/mo = 25% of API spend → self-host now

→ gpu_cost ~$1,260/mo = 25% of API spend → self-host now

undefined
undefined

Cost Dashboard (Grafana)

成本仪表盘(Grafana)

python
undefined
python
undefined

Emit cost metrics to Prometheus

Emit cost metrics to Prometheus

from prometheus_client import Counter, Histogram
llm_cost_total = Counter( "llm_cost_usd_total", "Total LLM spend in USD", ["model", "team", "task_type"], ) llm_tokens_total = Counter( "llm_tokens_total", "Total tokens used", ["model", "token_type"], # token_type: prompt, completion, cached )
def track_call(model, team, task_type, response): cost = calculate_cost(model, response.usage) llm_cost_total.labels(model=model, team=team, task_type=task_type).inc(cost) llm_tokens_total.labels(model=model, token_type="prompt").inc( response.usage.prompt_tokens) llm_tokens_total.labels(model=model, token_type="completion").inc( response.usage.completion_tokens)
undefined
from prometheus_client import Counter, Histogram
llm_cost_total = Counter( "llm_cost_usd_total", "Total LLM spend in USD", ["model", "team", "task_type"], ) llm_tokens_total = Counter( "llm_tokens_total", "Total tokens used", ["model", "token_type"], # token_type: prompt, completion, cached )
def track_call(model, team, task_type, response): cost = calculate_cost(model, response.usage) llm_cost_total.labels(model=model, team=team, task_type=task_type).inc(cost) llm_tokens_total.labels(model=model, token_type="prompt").inc( response.usage.prompt_tokens) llm_tokens_total.labels(model=model, token_type="completion").inc( response.usage.completion_tokens)
undefined

Best Practices

最佳实践

  • Use
    gpt-4o-mini
    or
    claude-haiku
    for 80% of tasks — they're 10–30× cheaper.
  • Enable prompt caching for system prompts >1,024 tokens (Anthropic) or >1,024 tokens (OpenAI).
  • Audit your top 5 prompts by token count — compress or cache them.
  • Set hard budget limits with LiteLLM virtual keys before costs spiral.
  • Self-host 7B–8B models when monthly API spend exceeds $2k/month.
  • 80%的任务使用
    gpt-4o-mini
    claude-haiku
    ——它们的成本低10-30倍。
  • 对超过1024个token的系统提示词启用提示词缓存(Anthropic或OpenAI)。
  • 审计按token数量排名前5的提示词——对它们进行压缩或缓存。
  • 在成本失控前,使用LiteLLM虚拟密钥设置严格的预算限制。
  • 当月度API支出超过2000美元时,自托管7B-8B参数模型。

Related Skills

相关技能

  • llm-gateway - Centralized cost control
  • llm-caching - Semantic caching patterns
  • vllm-server - Self-hosted inference
  • agent-observability - Token and cost telemetry
  • llm-gateway - 集中式成本控制
  • llm-caching - 语义缓存模式
  • vllm-server - 自托管推理
  • agent-observability - Token与成本遥测