ai-cutting-costs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCut Your AI Costs
降低AI成本
Guide the user through reducing AI API costs without sacrificing quality. Multiple strategies, from quick wins to advanced techniques.
引导用户在不牺牲质量的前提下降低AI API成本。提供从快速见效到高级技巧的多种策略。
Step 1: Understand where the money goes
第一步:明确成本去向
Ask the user:
- Which provider/model are you using? (GPT-4o, Claude, etc.)
- How many API calls per day/month?
- Is there a specific module or step that's most expensive?
询问用户:
- 您正在使用哪家提供商的哪个模型?(GPT-4o、Claude等)
- 每日/每月的API调用量是多少?
- 是否存在某个成本特别高的模块或步骤?
Quick cost audit
快速成本审计
python
import dspypython
import dspyRun your program and check token usage
运行程序并检查Token使用情况
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
result = my_program(question="test")
dspy.inspect_history(n=3) # Shows token counts per call
undefinedlm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
result = my_program(question="test")
dspy.inspect_history(n=3) # 显示每次调用的Token计数
undefinedStep 2: Quick wins
第二步:快速见效的优化方案
Use a cheaper model everywhere
全面使用低成本模型
The simplest fix — switch to a cheaper model and see if quality holds:
python
undefined最简单的解决方案——切换到更便宜的模型,同时验证质量是否达标:
python
undefinedInstead of GPT-4o (~$5/M input tokens)
替代GPT-4o(约5美元/百万输入Token)
lm = dspy.LM("openai/gpt-4o-mini") # ~$0.15/M input tokens — 33x cheaper
lm = dspy.LM("openai/gpt-4o-mini") # 约0.15美元/百万输入Token——便宜33倍
Or use an open-source model
或者使用开源模型
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
Always measure quality before and after with `/ai-improving-accuracy`. When you switch models, re-optimize your prompts — they don't transfer. See `/ai-switching-models` for the full workflow.lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
切换模型前后务必使用`/ai-improving-accuracy`工具衡量质量。切换模型后,需要重新优化提示词——不同模型的提示词无法直接迁移。完整工作流请参考`/ai-switching-models`。Enable caching
启用缓存
DSPy caches LM calls by default. Make sure you're not disabling it:
python
undefinedDSPy默认会缓存语言模型调用。请确保您没有禁用该功能:
python
undefinedCaching is ON by default — same inputs won't re-call the API
缓存默认开启——相同输入不会重复调用API
lm = dspy.LM("openai/gpt-4o-mini") # cached automatically
lm = dspy.LM("openai/gpt-4o-mini") # 自动启用缓存
To verify caching is working, run the same input twice
验证缓存是否工作:运行两次相同输入
and check that the second call is instant
第二次调用应瞬间完成
undefinedundefinedStep 3: Use different models for different tasks
第三步:为不同任务分配不同模型
Not every step in your pipeline needs the expensive model. Use or to assign cheaper models to simpler steps:
dspy.contextset_lmpython
expensive_lm = dspy.LM("openai/gpt-4o")
cheap_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=expensive_lm) # default
class MyPipeline(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(ClassifySignature)
self.generate = dspy.ChainOfThought(GenerateSignature)
def forward(self, text):
# Use cheap model for simple classification
with dspy.context(lm=cheap_lm):
category = self.classify(text=text)
# Use expensive model only for complex generation
return self.generate(text=text, category=category.label)并非流水线中的每个步骤都需要使用昂贵模型。使用或为简单步骤分配低成本模型:
dspy.contextset_lmpython
expensive_lm = dspy.LM("openai/gpt-4o")
cheap_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=expensive_lm) # 默认使用昂贵模型
class MyPipeline(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(ClassifySignature)
self.generate = dspy.ChainOfThought(GenerateSignature)
def forward(self, text):
# 简单分类任务使用低成本模型
with dspy.context(lm=cheap_lm):
category = self.classify(text=text)
# 仅复杂生成任务使用昂贵模型
return self.generate(text=text, category=category.label)Per-module LM assignment
按模块分配语言模型
python
undefinedpython
undefinedSet LM on specific modules permanently
为特定模块永久设置语言模型
my_program.classify.lm = cheap_lm
my_program.generate.lm = expensive_lm
undefinedmy_program.classify.lm = cheap_lm
my_program.generate.lm = expensive_lm
undefinedStep 4: Smart routing — cheap model for easy inputs, expensive for hard ones
第四步:智能路由——简单输入用低成本模型,复杂输入用昂贵模型
Instead of sending everything to the expensive model, classify inputs by difficulty and route accordingly. This is the pattern behind FrugalGPT (up to 90% cost savings matching GPT-4 quality):
无需将所有请求都发送给昂贵模型,先对输入按难度分类再路由。这正是FrugalGPT背后的模式,在匹配GPT-4质量的前提下可实现高达90%的成本节约:
Route by complexity
按复杂度路由
python
class ComplexityRouter(dspy.Module):
def __init__(self):
self.assess = dspy.Predict(AssessComplexity)
self.simple_handler = dspy.Predict(AnswerQuestion)
self.complex_handler = dspy.ChainOfThought(AnswerQuestion)
def forward(self, question):
# Use the cheap model to decide complexity
with dspy.context(lm=cheap_lm):
assessment = self.assess(question=question)
# Route to the right model
if assessment.complexity == "simple":
with dspy.context(lm=cheap_lm):
return self.simple_handler(question=question)
else:
with dspy.context(lm=expensive_lm):
return self.complex_handler(question=question)
class AssessComplexity(dspy.Signature):
"""Assess if this question needs a powerful model or a simple one can handle it."""
question: str = dspy.InputField()
complexity: Literal["simple", "complex"] = dspy.OutputField(
desc="simple = factual/straightforward, complex = reasoning/nuanced"
)python
class ComplexityRouter(dspy.Module):
def __init__(self):
self.assess = dspy.Predict(AssessComplexity)
self.simple_handler = dspy.Predict(AnswerQuestion)
self.complex_handler = dspy.ChainOfThought(AnswerQuestion)
def forward(self, question):
# 使用低成本模型判断复杂度
with dspy.context(lm=cheap_lm):
assessment = self.assess(question=question)
# 路由至对应模型
if assessment.complexity == "simple":
with dspy.context(lm=cheap_lm):
return self.simple_handler(question=question)
else:
with dspy.context(lm=expensive_lm):
return self.complex_handler(question=question)
class AssessComplexity(dspy.Signature):
"""判断该问题需要强大模型处理,还是简单模型即可胜任。"""
question: str = dspy.InputField()
complexity: Literal["simple", "complex"] = dspy.OutputField(
desc="simple = 事实性/直截了当的问题, complex = 需要推理/有细微差别的问题"
)Cascading — try cheap first, fall back to expensive
级联策略——先尝试低成本模型,再升级到昂贵模型
python
class CascadingPipeline(dspy.Module):
def __init__(self):
self.answer = dspy.ChainOfThought(AnswerQuestion)
self.verify = dspy.Predict(CheckConfidence)
def forward(self, question):
# Try cheap model first
with dspy.context(lm=cheap_lm):
result = self.answer(question=question)
check = self.verify(question=question, answer=result.answer)
# If cheap model isn't confident, escalate to expensive
if not check.is_confident:
with dspy.context(lm=expensive_lm):
result = self.answer(question=question)
return result
class CheckConfidence(dspy.Signature):
"""Is this answer confident and complete, or should we escalate to a better model?"""
question: str = dspy.InputField()
answer: str = dspy.InputField()
is_confident: bool = dspy.OutputField()Typical savings: 50-90% cost reduction. Most real-world traffic is simple questions that a cheap model handles fine.
python
class CascadingPipeline(dspy.Module):
def __init__(self):
self.answer = dspy.ChainOfThought(AnswerQuestion)
self.verify = dspy.Predict(CheckConfidence)
def forward(self, question):
# 先尝试低成本模型
with dspy.context(lm=cheap_lm):
result = self.answer(question=question)
check = self.verify(question=question, answer=result.answer)
# 如果低成本模型不够自信,则升级到昂贵模型
if not check.is_confident:
with dspy.context(lm=expensive_lm):
result = self.answer(question=question)
return result
class CheckConfidence(dspy.Signature):
"""该答案是否自信且完整,还是需要升级到更好的模型?"""
question: str = dspy.InputField()
answer: str = dspy.InputField()
is_confident: bool = dspy.OutputField()典型节约幅度: 50-90%的成本降低。现实场景中大多数流量都是简单问题,低成本模型完全可以处理。
Step 5: Reduce prompt length
第五步:缩短提示词长度
Long prompts = more tokens = more cost.
提示词越长,消耗的Token越多,成本越高。
Reduce few-shot examples
减少少样本示例数量
python
undefinedpython
undefinedFewer demos = shorter prompts = lower cost
示例越少,提示词越短,成本越低
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=2, # down from 4
max_labeled_demos=2, # down from 4
)
undefinedoptimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=2, # 从4个减少到2个
max_labeled_demos=2, # 从4个减少到2个
)
undefinedReduce retrieved passages
减少检索到的段落数量
python
undefinedpython
undefinedFewer passages = shorter context
段落越少,上下文越短
class DocSearch(dspy.Module):
def init(self):
self.retrieve = dspy.Retrieve(k=2) # down from 5
self.answer = dspy.ChainOfThought(AnswerSignature)
undefinedclass DocSearch(dspy.Module):
def init(self):
self.retrieve = dspy.Retrieve(k=2) # 从5个减少到2个
self.answer = dspy.ChainOfThought(AnswerSignature)
undefinedSimplify signatures
简化Signature
python
undefinedpython
undefinedVerbose — costs more tokens
冗长版本——消耗更多Token
class Verbose(dspy.Signature):
"""Given the following text, carefully analyze the content and provide a detailed classification."""
text: str = dspy.InputField(desc="The full text content to be analyzed and classified")
label: str = dspy.OutputField(desc="The classification label for this text")
class Verbose(dspy.Signature):
"""给定以下文本,仔细分析内容并提供详细分类。"""
text: str = dspy.InputField(desc="需要分析和分类的完整文本内容")
label: str = dspy.OutputField(desc="该文本的分类标签")
Concise — same quality, fewer tokens
简洁版本——质量相同,消耗更少Token
class Concise(dspy.Signature):
"""Classify the text."""
text: str = dspy.InputField()
label: str = dspy.OutputField()
undefinedclass Concise(dspy.Signature):
"""对文本进行分类。"""
text: str = dspy.InputField()
label: str = dspy.OutputField()
undefinedStep 6: Fine-tune a cheap model (advanced)
第六步:微调低成本模型(高级技巧)
The biggest cost saver: train a small cheap model to do what the expensive model does. Distill from an expensive teacher to a cheap student:
python
undefined最大的成本节约方案:训练一个小型低成本模型来完成昂贵模型的任务。将昂贵的教师模型的能力蒸馏到低成本的学生模型中:
python
undefinedBuild and optimize with the expensive model, then fine-tune a cheap one
使用昂贵模型构建并优化程序,然后微调低成本模型
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(my_program, trainset=trainset, teacher=teacher_optimized)
**Requirements:** 500+ training examples, a fine-tunable model.
**Typical savings:** 10-50x cost reduction with 85-95% quality retention.
For the complete model distillation workflow (decision framework, prerequisites, BetterTogether, troubleshooting), see `/ai-fine-tuning`.optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(my_program, trainset=trainset, teacher=teacher_optimized)
**要求:** 500+训练样本,支持微调的模型。
**典型节约幅度:** 10-50倍的成本降低,同时保留85-95%的质量。
完整的模型蒸馏工作流(决策框架、前提条件、BetterTogether、故障排除)请参考`/ai-fine-tuning`。Step 7: Use Predict
instead of ChainOfThought
where possible
PredictChainOfThought第七步:尽可能用Predict
替代ChainOfThought
PredictChainOfThoughtChainOfThoughtPredictpython
undefinedChainOfThoughtPredictpython
undefinedChainOfThought — more tokens, better for complex tasks
ChainOfThought——消耗更多Token,适合复杂任务
classifier = dspy.ChainOfThought(ClassifySignature)
classifier = dspy.ChainOfThought(ClassifySignature)
Predict — fewer tokens, fine for simple tasks
Predict——消耗更少Token,适合简单任务
classifier = dspy.Predict(ClassifySignature)
Test with `/ai-improving-accuracy` to make sure quality doesn't drop.classifier = dspy.Predict(ClassifySignature)
请使用`/ai-improving-accuracy`工具测试,确保质量不会下降。Cost reduction checklist
成本优化检查清单
- Switch to a cheaper model (measure quality first)
- Verify caching is enabled
- Use cheap models for simple steps, expensive for complex
- Route easy inputs to cheap models, hard ones to expensive (Step 4)
- Reduce few-shot examples (2 instead of 4)
- Reduce retrieved passages
- Use instead of
Predictfor simple tasksChainOfThought - Fine-tune a cheap model for production (if 500+ examples available)
- 切换到低成本模型(先衡量质量)
- 验证缓存已启用
- 简单步骤用低成本模型,复杂步骤用昂贵模型
- 简单输入路由到低成本模型,复杂输入路由到昂贵模型(第四步)
- 减少少样本示例数量(从4个减到2个)
- 减少检索到的段落数量
- 简单任务用替代
PredictChainOfThought - 生产环境中微调低成本模型(如果有500+样本可用)
Additional resources
额外资源
- Use to design multi-step systems with per-stage model assignment
/ai-building-pipelines - Use to make sure quality holds after cost cuts
/ai-improving-accuracy - Use if things break during cost optimization
/ai-fixing-errors
- 使用设计支持按阶段分配模型的多步骤系统
/ai-building-pipelines - 使用确保成本削减后质量不受影响
/ai-improving-accuracy - 如果成本优化过程中出现问题,请使用
/ai-fixing-errors