ai-cutting-costs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cut Your AI Costs

降低AI成本

Guide the user through reducing AI API costs without sacrificing quality. Multiple strategies, from quick wins to advanced techniques.
引导用户在不牺牲质量的前提下降低AI API成本。提供从快速见效到高级技巧的多种策略。

Step 1: Understand where the money goes

第一步:明确成本去向

Ask the user:
  1. Which provider/model are you using? (GPT-4o, Claude, etc.)
  2. How many API calls per day/month?
  3. Is there a specific module or step that's most expensive?
询问用户:
  1. 您正在使用哪家提供商的哪个模型?(GPT-4o、Claude等)
  2. 每日/每月的API调用量是多少?
  3. 是否存在某个成本特别高的模块或步骤?

Quick cost audit

快速成本审计

python
import dspy
python
import dspy

Run your program and check token usage

运行程序并检查Token使用情况

lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=lm)
result = my_program(question="test") dspy.inspect_history(n=3) # Shows token counts per call
undefined
lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=lm)
result = my_program(question="test") dspy.inspect_history(n=3) # 显示每次调用的Token计数
undefined

Step 2: Quick wins

第二步:快速见效的优化方案

Use a cheaper model everywhere

全面使用低成本模型

The simplest fix — switch to a cheaper model and see if quality holds:
python
undefined
最简单的解决方案——切换到更便宜的模型,同时验证质量是否达标:
python
undefined

Instead of GPT-4o (~$5/M input tokens)

替代GPT-4o(约5美元/百万输入Token)

lm = dspy.LM("openai/gpt-4o-mini") # ~$0.15/M input tokens — 33x cheaper
lm = dspy.LM("openai/gpt-4o-mini") # 约0.15美元/百万输入Token——便宜33倍

Or use an open-source model

或者使用开源模型

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

Always measure quality before and after with `/ai-improving-accuracy`. When you switch models, re-optimize your prompts — they don't transfer. See `/ai-switching-models` for the full workflow.
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

切换模型前后务必使用`/ai-improving-accuracy`工具衡量质量。切换模型后,需要重新优化提示词——不同模型的提示词无法直接迁移。完整工作流请参考`/ai-switching-models`。

Enable caching

启用缓存

DSPy caches LM calls by default. Make sure you're not disabling it:
python
undefined
DSPy默认会缓存语言模型调用。请确保您没有禁用该功能:
python
undefined

Caching is ON by default — same inputs won't re-call the API

缓存默认开启——相同输入不会重复调用API

lm = dspy.LM("openai/gpt-4o-mini") # cached automatically
lm = dspy.LM("openai/gpt-4o-mini") # 自动启用缓存

To verify caching is working, run the same input twice

验证缓存是否工作:运行两次相同输入

and check that the second call is instant

第二次调用应瞬间完成

undefined
undefined

Step 3: Use different models for different tasks

第三步:为不同任务分配不同模型

Not every step in your pipeline needs the expensive model. Use
dspy.context
or
set_lm
to assign cheaper models to simpler steps:
python
expensive_lm = dspy.LM("openai/gpt-4o")
cheap_lm = dspy.LM("openai/gpt-4o-mini")

dspy.configure(lm=expensive_lm)  # default

class MyPipeline(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(ClassifySignature)
        self.generate = dspy.ChainOfThought(GenerateSignature)

    def forward(self, text):
        # Use cheap model for simple classification
        with dspy.context(lm=cheap_lm):
            category = self.classify(text=text)

        # Use expensive model only for complex generation
        return self.generate(text=text, category=category.label)
并非流水线中的每个步骤都需要使用昂贵模型。使用
dspy.context
set_lm
为简单步骤分配低成本模型:
python
expensive_lm = dspy.LM("openai/gpt-4o")
cheap_lm = dspy.LM("openai/gpt-4o-mini")

dspy.configure(lm=expensive_lm)  # 默认使用昂贵模型

class MyPipeline(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(ClassifySignature)
        self.generate = dspy.ChainOfThought(GenerateSignature)

    def forward(self, text):
        # 简单分类任务使用低成本模型
        with dspy.context(lm=cheap_lm):
            category = self.classify(text=text)

        # 仅复杂生成任务使用昂贵模型
        return self.generate(text=text, category=category.label)

Per-module LM assignment

按模块分配语言模型

python
undefined
python
undefined

Set LM on specific modules permanently

为特定模块永久设置语言模型

my_program.classify.lm = cheap_lm my_program.generate.lm = expensive_lm
undefined
my_program.classify.lm = cheap_lm my_program.generate.lm = expensive_lm
undefined

Step 4: Smart routing — cheap model for easy inputs, expensive for hard ones

第四步:智能路由——简单输入用低成本模型,复杂输入用昂贵模型

Instead of sending everything to the expensive model, classify inputs by difficulty and route accordingly. This is the pattern behind FrugalGPT (up to 90% cost savings matching GPT-4 quality):
无需将所有请求都发送给昂贵模型,先对输入按难度分类再路由。这正是FrugalGPT背后的模式,在匹配GPT-4质量的前提下可实现高达90%的成本节约:

Route by complexity

按复杂度路由

python
class ComplexityRouter(dspy.Module):
    def __init__(self):
        self.assess = dspy.Predict(AssessComplexity)
        self.simple_handler = dspy.Predict(AnswerQuestion)
        self.complex_handler = dspy.ChainOfThought(AnswerQuestion)

    def forward(self, question):
        # Use the cheap model to decide complexity
        with dspy.context(lm=cheap_lm):
            assessment = self.assess(question=question)

        # Route to the right model
        if assessment.complexity == "simple":
            with dspy.context(lm=cheap_lm):
                return self.simple_handler(question=question)
        else:
            with dspy.context(lm=expensive_lm):
                return self.complex_handler(question=question)

class AssessComplexity(dspy.Signature):
    """Assess if this question needs a powerful model or a simple one can handle it."""
    question: str = dspy.InputField()
    complexity: Literal["simple", "complex"] = dspy.OutputField(
        desc="simple = factual/straightforward, complex = reasoning/nuanced"
    )
python
class ComplexityRouter(dspy.Module):
    def __init__(self):
        self.assess = dspy.Predict(AssessComplexity)
        self.simple_handler = dspy.Predict(AnswerQuestion)
        self.complex_handler = dspy.ChainOfThought(AnswerQuestion)

    def forward(self, question):
        # 使用低成本模型判断复杂度
        with dspy.context(lm=cheap_lm):
            assessment = self.assess(question=question)

        # 路由至对应模型
        if assessment.complexity == "simple":
            with dspy.context(lm=cheap_lm):
                return self.simple_handler(question=question)
        else:
            with dspy.context(lm=expensive_lm):
                return self.complex_handler(question=question)

class AssessComplexity(dspy.Signature):
    """判断该问题需要强大模型处理,还是简单模型即可胜任。"""
    question: str = dspy.InputField()
    complexity: Literal["simple", "complex"] = dspy.OutputField(
        desc="simple = 事实性/直截了当的问题, complex = 需要推理/有细微差别的问题"
    )

Cascading — try cheap first, fall back to expensive

级联策略——先尝试低成本模型,再升级到昂贵模型

python
class CascadingPipeline(dspy.Module):
    def __init__(self):
        self.answer = dspy.ChainOfThought(AnswerQuestion)
        self.verify = dspy.Predict(CheckConfidence)

    def forward(self, question):
        # Try cheap model first
        with dspy.context(lm=cheap_lm):
            result = self.answer(question=question)
            check = self.verify(question=question, answer=result.answer)

        # If cheap model isn't confident, escalate to expensive
        if not check.is_confident:
            with dspy.context(lm=expensive_lm):
                result = self.answer(question=question)

        return result

class CheckConfidence(dspy.Signature):
    """Is this answer confident and complete, or should we escalate to a better model?"""
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    is_confident: bool = dspy.OutputField()
Typical savings: 50-90% cost reduction. Most real-world traffic is simple questions that a cheap model handles fine.
python
class CascadingPipeline(dspy.Module):
    def __init__(self):
        self.answer = dspy.ChainOfThought(AnswerQuestion)
        self.verify = dspy.Predict(CheckConfidence)

    def forward(self, question):
        # 先尝试低成本模型
        with dspy.context(lm=cheap_lm):
            result = self.answer(question=question)
            check = self.verify(question=question, answer=result.answer)

        # 如果低成本模型不够自信,则升级到昂贵模型
        if not check.is_confident:
            with dspy.context(lm=expensive_lm):
                result = self.answer(question=question)

        return result

class CheckConfidence(dspy.Signature):
    """该答案是否自信且完整,还是需要升级到更好的模型?"""
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    is_confident: bool = dspy.OutputField()
典型节约幅度: 50-90%的成本降低。现实场景中大多数流量都是简单问题,低成本模型完全可以处理。

Step 5: Reduce prompt length

第五步:缩短提示词长度

Long prompts = more tokens = more cost.
提示词越长,消耗的Token越多,成本越高。

Reduce few-shot examples

减少少样本示例数量

python
undefined
python
undefined

Fewer demos = shorter prompts = lower cost

示例越少,提示词越短,成本越低

optimizer = dspy.BootstrapFewShot( metric=metric, max_bootstrapped_demos=2, # down from 4 max_labeled_demos=2, # down from 4 )
undefined
optimizer = dspy.BootstrapFewShot( metric=metric, max_bootstrapped_demos=2, # 从4个减少到2个 max_labeled_demos=2, # 从4个减少到2个 )
undefined

Reduce retrieved passages

减少检索到的段落数量

python
undefined
python
undefined

Fewer passages = shorter context

段落越少,上下文越短

class DocSearch(dspy.Module): def init(self): self.retrieve = dspy.Retrieve(k=2) # down from 5 self.answer = dspy.ChainOfThought(AnswerSignature)
undefined
class DocSearch(dspy.Module): def init(self): self.retrieve = dspy.Retrieve(k=2) # 从5个减少到2个 self.answer = dspy.ChainOfThought(AnswerSignature)
undefined

Simplify signatures

简化Signature

python
undefined
python
undefined

Verbose — costs more tokens

冗长版本——消耗更多Token

class Verbose(dspy.Signature): """Given the following text, carefully analyze the content and provide a detailed classification.""" text: str = dspy.InputField(desc="The full text content to be analyzed and classified") label: str = dspy.OutputField(desc="The classification label for this text")
class Verbose(dspy.Signature): """给定以下文本,仔细分析内容并提供详细分类。""" text: str = dspy.InputField(desc="需要分析和分类的完整文本内容") label: str = dspy.OutputField(desc="该文本的分类标签")

Concise — same quality, fewer tokens

简洁版本——质量相同,消耗更少Token

class Concise(dspy.Signature): """Classify the text.""" text: str = dspy.InputField() label: str = dspy.OutputField()
undefined
class Concise(dspy.Signature): """对文本进行分类。""" text: str = dspy.InputField() label: str = dspy.OutputField()
undefined

Step 6: Fine-tune a cheap model (advanced)

第六步:微调低成本模型(高级技巧)

The biggest cost saver: train a small cheap model to do what the expensive model does. Distill from an expensive teacher to a cheap student:
python
undefined
最大的成本节约方案:训练一个小型低成本模型来完成昂贵模型的任务。将昂贵的教师模型的能力蒸馏到低成本的学生模型中:
python
undefined

Build and optimize with the expensive model, then fine-tune a cheap one

使用昂贵模型构建并优化程序,然后微调低成本模型

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24) finetuned = optimizer.compile(my_program, trainset=trainset, teacher=teacher_optimized)

**Requirements:** 500+ training examples, a fine-tunable model.
**Typical savings:** 10-50x cost reduction with 85-95% quality retention.

For the complete model distillation workflow (decision framework, prerequisites, BetterTogether, troubleshooting), see `/ai-fine-tuning`.
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24) finetuned = optimizer.compile(my_program, trainset=trainset, teacher=teacher_optimized)

**要求:** 500+训练样本,支持微调的模型。
**典型节约幅度:** 10-50倍的成本降低,同时保留85-95%的质量。

完整的模型蒸馏工作流(决策框架、前提条件、BetterTogether、故障排除)请参考`/ai-fine-tuning`。

Step 7: Use
Predict
instead of
ChainOfThought
where possible

第七步:尽可能用
Predict
替代
ChainOfThought

ChainOfThought
adds a reasoning step which uses extra tokens. For simple tasks,
Predict
may be sufficient:
python
undefined
ChainOfThought
会增加推理步骤,消耗额外Token。对于简单任务,
Predict
可能已足够:
python
undefined

ChainOfThought — more tokens, better for complex tasks

ChainOfThought——消耗更多Token,适合复杂任务

classifier = dspy.ChainOfThought(ClassifySignature)
classifier = dspy.ChainOfThought(ClassifySignature)

Predict — fewer tokens, fine for simple tasks

Predict——消耗更少Token,适合简单任务

classifier = dspy.Predict(ClassifySignature)

Test with `/ai-improving-accuracy` to make sure quality doesn't drop.
classifier = dspy.Predict(ClassifySignature)

请使用`/ai-improving-accuracy`工具测试,确保质量不会下降。

Cost reduction checklist

成本优化检查清单

  1. Switch to a cheaper model (measure quality first)
  2. Verify caching is enabled
  3. Use cheap models for simple steps, expensive for complex
  4. Route easy inputs to cheap models, hard ones to expensive (Step 4)
  5. Reduce few-shot examples (2 instead of 4)
  6. Reduce retrieved passages
  7. Use
    Predict
    instead of
    ChainOfThought
    for simple tasks
  8. Fine-tune a cheap model for production (if 500+ examples available)
  1. 切换到低成本模型(先衡量质量)
  2. 验证缓存已启用
  3. 简单步骤用低成本模型,复杂步骤用昂贵模型
  4. 简单输入路由到低成本模型,复杂输入路由到昂贵模型(第四步)
  5. 减少少样本示例数量(从4个减到2个)
  6. 减少检索到的段落数量
  7. 简单任务用
    Predict
    替代
    ChainOfThought
  8. 生产环境中微调低成本模型(如果有500+样本可用)

Additional resources

额外资源

  • Use
    /ai-building-pipelines
    to design multi-step systems with per-stage model assignment
  • Use
    /ai-improving-accuracy
    to make sure quality holds after cost cuts
  • Use
    /ai-fixing-errors
    if things break during cost optimization
  • 使用
    /ai-building-pipelines
    设计支持按阶段分配模型的多步骤系统
  • 使用
    /ai-improving-accuracy
    确保成本削减后质量不受影响
  • 如果成本优化过程中出现问题,请使用
    /ai-fixing-errors