llm-gateway-routing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Gateway & Routing

LLM网关与路由

Configure multi-model access, fallbacks, cost optimization, and A/B testing.
配置多模型访问、降级策略、成本优化与A/B测试。

Why Use a Gateway?

为什么使用网关?

Without gateway:
  • Vendor lock-in (one provider)
  • No fallbacks (provider down = app down)
  • Hard to A/B test models
  • Scattered API keys and configs
With gateway:
  • Single API for 400+ models
  • Automatic fallbacks
  • Easy model switching
  • Unified cost tracking
无网关时:
  • 供应商锁定(仅能使用单一提供商)
  • 无降级方案(提供商故障则应用瘫痪)
  • 难以对模型进行A/B测试
  • API密钥与配置分散管理
使用网关后:
  • 单一API对接400+模型
  • 自动降级机制
  • 模型切换便捷
  • 统一成本追踪

Quick Decision

快速选型

NeedSolution
Fastest setup, multi-modelOpenRouter
Full control, self-hostedLiteLLM
Observability + routingHelicone
Enterprise, guardrailsPortkey
需求解决方案
最快搭建、多模型支持OpenRouter
完全可控、自建部署LiteLLM
可观测性 + 路由Helicone
企业级、安全防护Portkey

OpenRouter (Recommended)

OpenRouter(推荐)

Why OpenRouter

选择OpenRouter的理由

  • 400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more
  • Single API: One key for all providers
  • Automatic fallbacks: Built-in reliability
  • A/B testing: Easy model comparison
  • Cost tracking: Unified billing dashboard
  • Free credits: $1 free to start
  • 400+模型支持:涵盖OpenAI、Anthropic、Google、Meta、Mistral等提供商的模型
  • 单一API:一个密钥对接所有提供商
  • 自动降级:内置可靠性保障
  • A/B测试:轻松对比模型效果
  • 成本追踪:统一账单仪表盘
  • 免费额度:注册即享1美元免费额度

Setup

搭建步骤

bash
undefined
bash
undefined

1. Sign up at openrouter.ai

1. 在openrouter.ai注册账号

2. Get API key from dashboard

2. 从控制台获取API密钥

3. Add to .env:

3. 添加到.env文件:

OPENROUTER_API_KEY=sk-or-v1-...
undefined
OPENROUTER_API_KEY=sk-or-v1-...
undefined

Basic Usage

基础用法

typescript
// Using fetch
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-3-5-sonnet',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});
typescript
// 使用fetch调用
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-3-5-sonnet',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});

With Vercel AI SDK (Recommended)

结合Vercel AI SDK(推荐)

typescript
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const openrouter = createOpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const { text } = await generateText({
  model: openrouter("anthropic/claude-3-5-sonnet"),
  prompt: "Explain quantum computing",
});
typescript
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const openrouter = createOpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const { text } = await generateText({
  model: openrouter("anthropic/claude-3-5-sonnet"),
  prompt: "Explain quantum computing",
});

Model IDs

模型ID格式

typescript
// Format: provider/model-name
const models = {
  // Anthropic
  claude35Sonnet: "anthropic/claude-3-5-sonnet",
  claudeHaiku: "anthropic/claude-3-5-haiku",

  // OpenAI
  gpt4o: "openai/gpt-4o",
  gpt4oMini: "openai/gpt-4o-mini",

  // Google
  geminiPro: "google/gemini-pro-1.5",
  geminiFlash: "google/gemini-flash-1.5",

  // Meta
  llama3: "meta-llama/llama-3.1-70b-instruct",

  // Auto (OpenRouter picks best)
  auto: "openrouter/auto",
};
typescript
// 格式:提供商/模型名称
const models = {
  // Anthropic
  claude35Sonnet: "anthropic/claude-3-5-sonnet",
  claudeHaiku: "anthropic/claude-3-5-haiku",

  // OpenAI
  gpt4o: "openai/gpt-4o",
  gpt4oMini: "openai/gpt-4o-mini",

  // Google
  geminiPro: "google/gemini-pro-1.5",
  geminiFlash: "google/gemini-flash-1.5",

  // Meta
  llama3: "meta-llama/llama-3.1-70b-instruct",

  // 自动选择(OpenRouter挑选最优模型)
  auto: "openrouter/auto",
};

Fallback Chains

降级链配置

typescript
// Define fallback order
const modelChain = [
  "anthropic/claude-3-5-sonnet",   // Primary
  "openai/gpt-4o",                  // Fallback 1
  "google/gemini-pro-1.5",          // Fallback 2
];

async function callWithFallback(messages: Message[]) {
  for (const model of modelChain) {
    try {
      return await openrouter.chat({ model, messages });
    } catch (error) {
      console.log(`${model} failed, trying next...`);
    }
  }
  throw new Error("All models failed");
}
typescript
// 定义降级顺序
const modelChain = [
  "anthropic/claude-3-5-sonnet",   // 主模型
  "openai/gpt-4o",                  // 第一降级模型
  "google/gemini-pro-1.5",          // 第二降级模型
];

async function callWithFallback(messages: Message[]) {
  for (const model of modelChain) {
    try {
      return await openrouter.chat({ model, messages });
    } catch (error) {
      console.log(`${model}调用失败,尝试下一个模型...`);
    }
  }
  throw new Error("所有模型调用失败");
}

Cost Routing

基于成本的路由

typescript
// Route based on query complexity
function selectModel(query: string): string {
  const complexity = analyzeComplexity(query);

  if (complexity === "simple") {
    // Simple queries → cheap model
    return "openai/gpt-4o-mini";  // ~$0.15/1M tokens
  } else if (complexity === "medium") {
    // Medium → balanced
    return "google/gemini-flash-1.5";  // ~$0.075/1M tokens
  } else {
    // Complex → best quality
    return "anthropic/claude-3-5-sonnet";  // ~$3/1M tokens
  }
}

function analyzeComplexity(query: string): "simple" | "medium" | "complex" {
  // Simple heuristics
  if (query.length < 50) return "simple";
  if (query.includes("explain") || query.includes("analyze")) return "complex";
  return "medium";
}
typescript
// 根据查询复杂度选择模型
function selectModel(query: string): string {
  const complexity = analyzeComplexity(query);

  if (complexity === "simple") {
    // 简单查询 → 低成本模型
    return "openai/gpt-4o-mini";  // ~0.15美元/百万tokens
  } else if (complexity === "medium") {
    // 中等复杂度 → 平衡成本与效果
    return "google/gemini-flash-1.5";  // ~0.075美元/百万tokens
  } else {
    // 复杂查询 → 最优质量
    return "anthropic/claude-3-5-sonnet";  // ~3美元/百万tokens
  }
}

function analyzeComplexity(query: string): "simple" | "medium" | "complex" {
  // 简单判断规则
  if (query.length < 50) return "simple";
  if (query.includes("explain") || query.includes("analyze")) return "complex";
  return "medium";
}

A/B Testing

A/B测试

typescript
// Random assignment
function getModel(userId: string): string {
  const hash = userId.charCodeAt(0) % 100;

  if (hash < 50) {
    return "anthropic/claude-3-5-sonnet";  // 50%
  } else {
    return "openai/gpt-4o";  // 50%
  }
}

// Track which model was used
const model = getModel(userId);
const response = await openrouter.chat({ model, messages });
await analytics.track("llm_call", { model, userId, latency, cost });
typescript
// 随机分配模型
function getModel(userId: string): string {
  const hash = userId.charCodeAt(0) % 100;

  if (hash < 50) {
    return "anthropic/claude-3-5-sonnet";  // 50%流量
  } else {
    return "openai/gpt-4o";  // 50%流量
  }
}

// 追踪使用的模型
const model = getModel(userId);
const response = await openrouter.chat({ model, messages });
await analytics.track("llm_call", { model, userId, latency, cost });

LiteLLM (Self-Hosted)

LiteLLM(自建部署)

Why LiteLLM

选择LiteLLM的理由

  • Self-hosted: Full control over data
  • 100+ providers: Same coverage as OpenRouter
  • Load balancing: Distribute across providers
  • Cost tracking: Built-in spend management
  • Caching: Redis or in-memory
  • Rate limiting: Per-user limits
  • 自建部署:完全控制数据
  • 100+提供商支持:与OpenRouter覆盖范围一致
  • 负载均衡:请求分发至多个提供商
  • 成本追踪:内置支出管理
  • 缓存机制:支持Redis或内存缓存
  • 速率限制:按用户设置调用上限

Setup

搭建步骤

bash
undefined
bash
undefined

Install

安装

pip install litellm[proxy]
pip install litellm[proxy]

Run proxy

启动代理

litellm --config config.yaml
litellm --config config.yaml

Use as OpenAI-compatible endpoint

作为兼容OpenAI的端点使用

export OPENAI_API_BASE=http://localhost:4000
undefined
export OPENAI_API_BASE=http://localhost:4000
undefined

Configuration

配置文件

yaml
undefined
yaml
undefined

config.yaml

config.yaml

model_list:

Claude models

  • model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-...

OpenAI models

  • model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-...

Load balanced (multiple providers)

  • model_name: balanced litellm_params: model: anthropic/claude-3-5-sonnet-latest litellm_params: model: openai/gpt-4o

    Requests distributed across both

model_list:

Claude模型

  • model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-...

OpenAI模型

  • model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-...

负载均衡(多提供商)

  • model_name: balanced litellm_params: model: anthropic/claude-3-5-sonnet-latest litellm_params: model: openai/gpt-4o

    请求将分发至两个模型

General settings

通用设置

general_settings: master_key: sk-master-... database_url: postgresql://...
general_settings: master_key: sk-master-... database_url: postgresql://...

Routing

路由配置

router_settings: routing_strategy: simple-shuffle # or latency-based-routing num_retries: 3 timeout: 30
router_settings: routing_strategy: simple-shuffle # 或 latency-based-routing(基于延迟的路由) num_retries: 3 timeout: 30

Rate limiting

速率限制

litellm_settings: max_budget: 100 # $100/month budget_duration: monthly
undefined
litellm_settings: max_budget: 100 # 每月100美元 budget_duration: monthly
undefined

Fallbacks in LiteLLM

LiteLLM中的降级配置

yaml
model_list:
  - model_name: primary
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    fallbacks:
      - model_name: fallback-1
        litellm_params:
          model: openai/gpt-4o
      - model_name: fallback-2
        litellm_params:
          model: google/gemini-pro
yaml
model_list:
  - model_name: primary
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    fallbacks:
      - model_name: fallback-1
        litellm_params:
          model: openai/gpt-4o
      - model_name: fallback-2
        litellm_params:
          model: google/gemini-pro

Usage

使用方法

typescript
// Use like OpenAI SDK
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000",
  apiKey: "sk-master-...",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet",  // Maps to configured model
  messages: [{ role: "user", content: "Hello!" }],
});
typescript
// 兼容OpenAI SDK调用
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000",
  apiKey: "sk-master-...",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet",  // 映射至配置文件中的模型
  messages: [{ role: "user", content: "Hello!" }],
});

Routing Strategies

路由策略

1. Cost-Based Routing

1. 基于成本的路由

typescript
const costTiers = {
  cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"],
  balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"],
  premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"],
};

function routeByCost(budget: "cheap" | "balanced" | "premium"): string {
  const models = costTiers[budget];
  return models[Math.floor(Math.random() * models.length)];
}
typescript
const costTiers = {
  cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"],
  balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"],
  premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"],
};

function routeByCost(budget: "cheap" | "balanced" | "premium"): string {
  const models = costTiers[budget];
  return models[Math.floor(Math.random() * models.length)];
}

2. Latency-Based Routing

2. 基于延迟的路由

typescript
// Track latency per model
const latencyStats: Record<string, number[]> = {};

function routeByLatency(): string {
  const avgLatencies = Object.entries(latencyStats)
    .map(([model, times]) => ({
      model,
      avg: times.reduce((a, b) => a + b, 0) / times.length,
    }))
    .sort((a, b) => a.avg - b.avg);

  return avgLatencies[0].model;
}

// Update after each call
function recordLatency(model: string, latencyMs: number) {
  if (!latencyStats[model]) latencyStats[model] = [];
  latencyStats[model].push(latencyMs);
  // Keep last 100 samples
  if (latencyStats[model].length > 100) {
    latencyStats[model].shift();
  }
}
typescript
// 追踪每个模型的延迟
const latencyStats: Record<string, number[]> = {};

function routeByLatency(): string {
  const avgLatencies = Object.entries(latencyStats)
    .map(([model, times]) => ({
      model,
      avg: times.reduce((a, b) => a + b, 0) / times.length,
    }))
    .sort((a, b) => a.avg - b.avg);

  return avgLatencies[0].model;
}

// 每次调用后更新延迟数据
function recordLatency(model: string, latencyMs: number) {
  if (!latencyStats[model]) latencyStats[model] = [];
  latencyStats[model].push(latencyMs);
  // 仅保留最近100条记录
  if (latencyStats[model].length > 100) {
    latencyStats[model].shift();
  }
}

3. Task-Based Routing

3. 基于任务的路由

typescript
const taskModels = {
  coding: "anthropic/claude-3-5-sonnet",  // Best for code
  reasoning: "openai/o1-preview",          // Best for logic
  creative: "anthropic/claude-3-5-sonnet", // Best for writing
  simple: "openai/gpt-4o-mini",            // Cheap and fast
  multimodal: "google/gemini-pro-1.5",     // Vision + text
};

function routeByTask(task: keyof typeof taskModels): string {
  return taskModels[task];
}
typescript
const taskModels = {
  coding: "anthropic/claude-3-5-sonnet",  // 代码任务最优
  reasoning: "openai/o1-preview",          // 逻辑推理最优
  creative: "anthropic/claude-3-5-sonnet", // 创作任务最优
  simple: "openai/gpt-4o-mini",            // 低成本快速响应
  multimodal: "google/gemini-pro-1.5",     // 多模态(视觉+文本)
};

function routeByTask(task: keyof typeof taskModels): string {
  return taskModels[task];
}

4. Hybrid Routing

4. 混合路由

typescript
interface RoutingConfig {
  task: string;
  maxCost: number;
  maxLatency: number;
}

function hybridRoute(config: RoutingConfig): string {
  // Filter by cost
  const affordable = models.filter(m => m.cost <= config.maxCost);

  // Filter by latency
  const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

  // Select best for task
  const taskScores = fast.map(m => ({
    model: m.id,
    score: getTaskScore(m.id, config.task),
  }));

  return taskScores.sort((a, b) => b.score - a.score)[0].model;
}
typescript
interface RoutingConfig {
  task: string;
  maxCost: number;
  maxLatency: number;
}

function hybridRoute(config: RoutingConfig): string {
  // 按成本过滤
  const affordable = models.filter(m => m.cost <= config.maxCost);

  // 按延迟过滤
  const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

  // 按任务适配性评分
  const taskScores = fast.map(m => ({
    model: m.id,
    score: getTaskScore(m.id, config.task),
  }));

  return taskScores.sort((a, b) => b.score - a.score)[0].model;
}

Best Practices

最佳实践

1. Always Have Fallbacks

1. 始终配置降级机制

typescript
// Bad: Single point of failure
const response = await openai.chat({ model: "gpt-4o", messages });

// Good: Fallback chain
const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"];
for (const model of models) {
  try {
    return await gateway.chat({ model, messages });
  } catch (e) {
    continue;
  }
}
typescript
// 错误示例:单点故障
const response = await openai.chat({ model: "gpt-4o", messages });

// 正确示例:降级链
const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"];
for (const model of models) {
  try {
    return await gateway.chat({ model, messages });
  } catch (e) {
    continue;
  }
}

2. Pin Model Versions

2. 固定模型版本

typescript
// Bad: Model can change
const model = "gpt-4";

// Good: Pinned version
const model = "openai/gpt-4-0125-preview";
typescript
// 错误示例:模型版本可能变更
const model = "gpt-4";

// 正确示例:固定版本
const model = "openai/gpt-4-0125-preview";

3. Track Costs

3. 追踪成本

typescript
// Log every call
async function trackedCall(model: string, messages: Message[]) {
  const start = Date.now();
  const response = await gateway.chat({ model, messages });
  const latency = Date.now() - start;

  await analytics.track("llm_call", {
    model,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cost: calculateCost(model, response.usage),
    latency,
  });

  return response;
}
typescript
// 记录每次调用
async function trackedCall(model: string, messages: Message[]) {
  const start = Date.now();
  const response = await gateway.chat({ model, messages });
  const latency = Date.now() - start;

  await analytics.track("llm_call", {
    model,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cost: calculateCost(model, response.usage),
    latency,
  });

  return response;
}

4. Set Token Limits

4. 设置Token上限

typescript
// Prevent runaway costs
const response = await gateway.chat({
  model,
  messages,
  max_tokens: 500,  // Limit output length
});
typescript
// 避免成本超支
const response = await gateway.chat({
  model,
  messages,
  max_tokens: 500,  // 限制输出长度
});

5. Use Caching

5. 使用缓存

typescript
// LiteLLM caching
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour
typescript
// LiteLLM缓存配置
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  // 1小时

References

参考资料

  • references/openrouter-guide.md
    - OpenRouter deep dive
  • references/litellm-guide.md
    - LiteLLM self-hosting
  • references/routing-strategies.md
    - Advanced routing patterns
  • references/alternatives.md
    - Helicone, Portkey, etc.
  • references/openrouter-guide.md
    - OpenRouter深度指南
  • references/litellm-guide.md
    - LiteLLM自建部署指南
  • references/routing-strategies.md
    - 高级路由模式
  • references/alternatives.md
    - Helicone、Portkey等替代方案

Templates

模板

  • templates/openrouter-config.ts
    - TypeScript OpenRouter setup
  • templates/litellm-config.yaml
    - LiteLLM proxy config
  • templates/fallback-chain.ts
    - Fallback implementation
  • templates/openrouter-config.ts
    - TypeScript OpenRouter配置模板
  • templates/litellm-config.yaml
    - LiteLLM代理配置模板
  • templates/fallback-chain.ts
    - 降级链实现模板