llm-app-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Application Development

LLM应用开发

Overview

概述

This skill covers the full spectrum of building applications powered by large language models. It addresses retrieval-augmented generation (RAG) pipelines, vector database integration, prompt engineering techniques, structured output generation, tool use and agentic patterns, evaluation frameworks, cost optimization, and streaming response handling.
Use this skill when building chatbots, AI assistants, knowledge retrieval systems, content generation tools, autonomous agents, or any application that integrates LLM capabilities into its core functionality.

本技能涵盖了构建基于大语言模型(LLM)的应用的全流程,内容包括检索增强生成(RAG)管道、向量数据库集成、提示词工程技术、结构化输出生成、工具调用与智能体模式、评估框架、成本优化以及流式响应处理。
当你需要构建聊天机器人、AI助手、知识检索系统、内容生成工具、自主智能体,或是任何将LLM能力作为核心功能的应用时,都可以参考本技能。

Core Principles

核心原则

  1. Retrieval over memorization - Use RAG to ground LLM responses in real data rather than relying on model parametric memory. This reduces hallucination and keeps answers current.
  2. Structured I/O boundaries - Define strict schemas for both inputs (system prompts, context) and outputs (typed responses via function calling or Zod schemas). Never trust raw LLM text for downstream logic.
  3. Evaluate before shipping - Every LLM feature needs automated evaluation. Model outputs are non-deterministic; without eval, you cannot measure regressions or improvements.
  4. Cost-aware architecture - Token usage drives cost. Cache aggressively, choose the smallest model that meets quality requirements, and batch where possible.
  5. Fail gracefully - LLMs can refuse, hallucinate, or timeout. Every call path needs fallback behavior, retry logic, and user-visible error states.

  1. 检索优先,而非记忆 - 利用RAG让LLM的回复基于真实数据,而非依赖模型的参数记忆。这种方式可以减少幻觉,确保答案时效性。
  2. 结构化输入输出边界 - 为输入(系统提示词、上下文)和输出(通过函数调用或Zod schema实现的类型化响应)定义严格的 schema。绝不要将原始LLM文本直接用于下游逻辑。
  3. 上线前先评估 - 每个LLM功能都需要自动化评估。模型输出具有非确定性,没有评估就无法衡量性能退化或提升。
  4. 成本感知型架构 - 令牌使用量直接决定成本。要积极缓存,选择能满足质量要求的最小模型,并尽可能批量处理请求。
  5. 优雅降级 - LLM可能会拒绝响应、产生幻觉或超时。每个调用路径都需要回退机制、重试逻辑和用户可见的错误状态。

Key Patterns

关键模式

Pattern 1: RAG Pipeline Architecture

模式1:RAG管道架构

When to use: When the LLM needs access to private, large, or frequently updated knowledge that exceeds context window limits.
Implementation:
typescript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PGVectorStore } from "@langchain/community/vectorstores/pgvector";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// 1. Chunking - Split documents into retrieval-friendly segments
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ["\n\n", "\n", ". ", " "],
});

const chunks = await splitter.splitDocuments(documents);

// 2. Embedding - Convert chunks to vectors
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
  dimensions: 1536,
});

// 3. Storage - Index in vector database
const vectorStore = await PGVectorStore.fromDocuments(chunks, embeddings, {
  postgresConnectionOptions: {
    connectionString: process.env.DATABASE_URL,
  },
  tableName: "documents",
  columns: {
    idColumnName: "id",
    vectorColumnName: "embedding",
    contentColumnName: "content",
    metadataColumnName: "metadata",
  },
});

// 4. Retrieval - Find relevant context for a query
async function retrieve(query: string, k: number = 5) {
  const results = await vectorStore.similaritySearchWithScore(query, k);

  // Filter by relevance threshold
  return results
    .filter(([_, score]) => score > 0.7)
    .map(([doc]) => doc);
}

// 5. Generation - Augment prompt with retrieved context
async function generateAnswer(query: string): Promise<string> {
  const context = await retrieve(query);

  const contextText = context
    .map((doc) => doc.pageContent)
    .join("\n---\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have information about that."

Context:
${contextText}`,
      },
      { role: "user", content: query },
    ],
    temperature: 0.1,
  });

  return response.choices[0].message.content ?? "";
}
Why: RAG separates knowledge storage from reasoning. The LLM focuses on synthesizing answers while the vector database handles retrieval at scale. This architecture supports updating knowledge without retraining and keeps token costs manageable by only including relevant context.

适用场景: 当LLM需要访问私有、大规模或频繁更新的知识,且这些知识超出了上下文窗口限制时。
实现代码:
typescript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PGVectorStore } from "@langchain/community/vectorstores/pgvector";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// 1. 文本分块 - 将文档分割为适合检索的片段
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ["\n\n", "\n", ". ", " "],
});

const chunks = await splitter.splitDocuments(documents);

// 2. 向量化 - 将文本块转换为向量
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
  dimensions: 1536,
});

// 3. 存储 - 在向量数据库中建立索引
const vectorStore = await PGVectorStore.fromDocuments(chunks, embeddings, {
  postgresConnectionOptions: {
    connectionString: process.env.DATABASE_URL,
  },
  tableName: "documents",
  columns: {
    idColumnName: "id",
    vectorColumnName: "embedding",
    contentColumnName: "content",
    metadataColumnName: "metadata",
  },
});

// 4. 检索 - 为查询找到相关上下文
async function retrieve(query: string, k: number = 5) {
  const results = await vectorStore.similaritySearchWithScore(query, k);

  // 按相关性阈值过滤
  return results
    .filter(([_, score]) => score > 0.7)
    .map(([doc]) => doc);
}

// 5. 生成 - 用检索到的上下文增强提示词
async function generateAnswer(query: string): Promise<string> {
  const context = await retrieve(query);

  const contextText = context
    .map((doc) => doc.pageContent)
    .join("\n---\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `仅使用提供的上下文回答问题。如果上下文不包含答案,请回答“我没有相关信息。”

上下文:
${contextText}`,
      },
      { role: "user", content: query },
    ],
    temperature: 0.1,
  });

  return response.choices[0].message.content ?? "";
}
优势: RAG将知识存储与推理分离。LLM专注于合成答案,而向量数据库负责大规模检索。这种架构支持无需重新训练即可更新知识,并且通过仅包含相关上下文来控制令牌成本。

Pattern 2: Structured Output with Zod Schemas

模式2:使用Zod Schema实现结构化输出

When to use: When LLM output feeds into downstream logic, APIs, or UI rendering that requires typed, validated data.
Implementation:
typescript
import { z } from "zod";
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";

// Define the output schema
const ProductReviewAnalysis = z.object({
  sentiment: z.enum(["positive", "negative", "neutral", "mixed"]),
  confidence: z.number().min(0).max(1),
  themes: z.array(z.object({
    name: z.string(),
    sentiment: z.enum(["positive", "negative", "neutral"]),
    mentions: z.number(),
  })),
  summary: z.string().max(200),
  actionItems: z.array(z.string()),
});

type ProductReviewAnalysis = z.infer<typeof ProductReviewAnalysis>;

const openai = new OpenAI();

async function analyzeReviews(
  reviews: string[]
): Promise<ProductReviewAnalysis> {
  const response = await openai.beta.chat.completions.parse({
    model: "gpt-4o-2024-08-06",
    messages: [
      {
        role: "system",
        content: "Analyze product reviews and extract structured insights.",
      },
      {
        role: "user",
        content: `Analyze these reviews:\n${reviews.join("\n---\n")}`,
      },
    ],
    response_format: zodResponseFormat(ProductReviewAnalysis, "review_analysis"),
  });

  const parsed = response.choices[0].message.parsed;
  if (!parsed) {
    throw new Error("Failed to parse structured output");
  }
  return parsed;
}
Why: Structured outputs eliminate brittle regex parsing of LLM text. The model is constrained to produce valid JSON matching your schema, giving you type-safe data for rendering UI components, storing in databases, or passing to other services.

适用场景: 当LLM输出需要传入下游逻辑、API或UI渲染,且需要类型化、经过验证的数据时。
实现代码:
typescript
import { z } from "zod";
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";

// 定义输出schema
const ProductReviewAnalysis = z.object({
  sentiment: z.enum(["positive", "negative", "neutral", "mixed"]),
  confidence: z.number().min(0).max(1),
  themes: z.array(z.object({
    name: z.string(),
    sentiment: z.enum(["positive", "negative", "neutral"]),
    mentions: z.number(),
  })),
  summary: z.string().max(200),
  actionItems: z.array(z.string()),
});

type ProductReviewAnalysis = z.infer<typeof ProductReviewAnalysis>;

const openai = new OpenAI();

async function analyzeReviews(
  reviews: string[]
): Promise<ProductReviewAnalysis> {
  const response = await openai.beta.chat.completions.parse({
    model: "gpt-4o-2024-08-06",
    messages: [
      {
        role: "system",
        content: "分析产品评论并提取结构化洞察。",
      },
      {
        role: "user",
        content: `分析以下评论:\n${reviews.join("\n---\n")}`,
      },
    ],
    response_format: zodResponseFormat(ProductReviewAnalysis, "review_analysis"),
  });

  const parsed = response.choices[0].message.parsed;
  if (!parsed) {
    throw new Error("解析结构化输出失败");
  }
  return parsed;
}
优势: 结构化输出避免了使用正则表达式解析LLM文本的脆弱性。模型被约束为生成符合你定义的schema的有效JSON,为UI组件渲染、数据库存储或传递给其他服务提供类型安全的数据。

Pattern 3: Tool Use and AI Agents

模式3:工具调用与AI Agent

When to use: When the LLM needs to take actions (search, calculate, call APIs, modify data) rather than just generate text.
Implementation:
typescript
import OpenAI from "openai";

const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "search_knowledge_base",
      description: "Search the internal knowledge base for relevant articles",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          category: {
            type: "string",
            enum: ["billing", "technical", "account"],
            description: "Category to filter by",
          },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "create_support_ticket",
      description: "Create a support ticket for unresolved issues",
      parameters: {
        type: "object",
        properties: {
          title: { type: "string" },
          description: { type: "string" },
          priority: { type: "string", enum: ["low", "medium", "high", "urgent"] },
        },
        required: ["title", "description", "priority"],
      },
    },
  },
];

// Tool implementations
const toolHandlers: Record<string, (args: unknown) => Promise<string>> = {
  search_knowledge_base: async (args) => {
    const { query, category } = args as { query: string; category?: string };
    const results = await knowledgeBase.search(query, { category });
    return JSON.stringify(results);
  },
  create_support_ticket: async (args) => {
    const ticket = args as { title: string; description: string; priority: string };
    const created = await ticketSystem.create(ticket);
    return JSON.stringify({ ticketId: created.id, status: "created" });
  },
};

// Agentic loop - let the model decide which tools to call
async function agentLoop(
  messages: OpenAI.Chat.Completions.ChatCompletionMessageParam[]
): Promise<string> {
  const MAX_ITERATIONS = 10;

  for (let i = 0; i < MAX_ITERATIONS; i++) {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
      tools,
      tool_choice: "auto",
    });

    const message = response.choices[0].message;
    messages.push(message);

    // If no tool calls, we have the final answer
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return message.content ?? "";
    }

    // Execute each tool call
    for (const toolCall of message.tool_calls) {
      const handler = toolHandlers[toolCall.function.name];
      if (!handler) {
        throw new Error(`Unknown tool: ${toolCall.function.name}`);
      }

      const args = JSON.parse(toolCall.function.arguments);
      const result = await handler(args);

      messages.push({
        role: "tool",
        tool_call_id: toolCall.id,
        content: result,
      });
    }
  }

  throw new Error("Agent exceeded maximum iterations");
}
Why: Tool use transforms LLMs from text generators into actors that can query databases, call APIs, and orchestrate workflows. The agentic loop pattern gives the model autonomy to decide which tools to use and when, while the iteration limit prevents runaway execution.

适用场景: 当LLM需要执行操作(搜索、计算、调用API、修改数据)而非仅生成文本时。
实现代码:
typescript
import OpenAI from "openai";

const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "search_knowledge_base",
      description: "搜索内部知识库以获取相关文章",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "搜索查询词" },
          category: {
            type: "string",
            enum: ["billing", "technical", "account"],
            description: "用于过滤的分类",
          },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "create_support_ticket",
      description: "为未解决的问题创建支持工单",
      parameters: {
        type: "object",
        properties: {
          title: { type: "string" },
          description: { type: "string" },
          priority: { type: "string", enum: ["low", "medium", "high", "urgent"] },
        },
        required: ["title", "description", "priority"],
      },
    },
  },
];

// 工具实现
const toolHandlers: Record<string, (args: unknown) => Promise<string>> = {
  search_knowledge_base: async (args) => {
    const { query, category } = args as { query: string; category?: string };
    const results = await knowledgeBase.search(query, { category });
    return JSON.stringify(results);
  },
  create_support_ticket: async (args) => {
    const ticket = args as { title: string; description: string; priority: string };
    const created = await ticketSystem.create(ticket);
    return JSON.stringify({ ticketId: created.id, status: "created" });
  },
};

// 智能体循环 - 让模型决定调用哪些工具
async function agentLoop(
  messages: OpenAI.Chat.Completions.ChatCompletionMessageParam[]
): Promise<string> {
  const MAX_ITERATIONS = 10;

  for (let i = 0; i < MAX_ITERATIONS; i++) {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
      tools,
      tool_choice: "auto",
    });

    const message = response.choices[0].message;
    messages.push(message);

    // 如果没有工具调用,说明已得到最终答案
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return message.content ?? "";
    }

    // 执行每个工具调用
    for (const toolCall of message.tool_calls) {
      const handler = toolHandlers[toolCall.function.name];
      if (!handler) {
        throw new Error(`未知工具:${toolCall.function.name}`);
      }

      const args = JSON.parse(toolCall.function.arguments);
      const result = await handler(args);

      messages.push({
        role: "tool",
        tool_call_id: toolCall.id,
        content: result,
      });
    }
  }

  throw new Error("智能体超出最大迭代次数");
}
优势: 工具调用将LLM从文本生成器转变为可以查询数据库、调用API和编排工作流的执行者。智能体循环模式让模型能够自主决定何时使用哪些工具,同时迭代限制可以防止无限执行。

Pattern 4: Semantic Caching for Cost Optimization

模式4:语义缓存优化成本

When to use: When similar queries are common and freshness requirements allow caching.
Implementation:
typescript
import { Redis } from "ioredis";
import { createHash } from "crypto";

interface CacheEntry {
  response: string;
  embedding: number[];
  createdAt: number;
}

class SemanticCache {
  private redis: Redis;
  private ttlSeconds: number;
  private similarityThreshold: number;

  constructor(options: {
    redisUrl: string;
    ttlSeconds?: number;
    similarityThreshold?: number;
  }) {
    this.redis = new Redis(options.redisUrl);
    this.ttlSeconds = options.ttlSeconds ?? 3600;
    this.similarityThreshold = options.similarityThreshold ?? 0.95;
  }

  // Exact match cache (fast, cheap)
  private exactKey(prompt: string): string {
    const hash = createHash("sha256").update(prompt).digest("hex");
    return `llm:exact:${hash}`;
  }

  // Check exact cache first, then semantic similarity
  async get(prompt: string): Promise<string | null> {
    // 1. Try exact match
    const exact = await this.redis.get(this.exactKey(prompt));
    if (exact) return exact;

    // 2. Try semantic match (more expensive, but catches paraphrases)
    const embedding = await getEmbedding(prompt);
    const candidates = await this.redis.smembers("llm:semantic:keys");

    for (const candidateKey of candidates) {
      const raw = await this.redis.get(`llm:semantic:${candidateKey}`);
      if (!raw) continue;

      const entry: CacheEntry = JSON.parse(raw);
      const similarity = cosineSimilarity(embedding, entry.embedding);

      if (similarity >= this.similarityThreshold) {
        return entry.response;
      }
    }

    return null;
  }

  async set(prompt: string, response: string): Promise<void> {
    const embedding = await getEmbedding(prompt);
    const key = createHash("sha256").update(prompt).digest("hex");

    // Store exact match
    await this.redis.setex(this.exactKey(prompt), this.ttlSeconds, response);

    // Store semantic entry
    const entry: CacheEntry = {
      response,
      embedding,
      createdAt: Date.now(),
    };
    await this.redis.setex(
      `llm:semantic:${key}`,
      this.ttlSeconds,
      JSON.stringify(entry)
    );
    await this.redis.sadd("llm:semantic:keys", key);
  }
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
Why: LLM API calls are expensive and slow. Semantic caching catches not just identical queries but paraphrased ones, dramatically reducing costs for applications with repetitive query patterns (FAQ bots, customer support, search).

适用场景: 当相似查询频繁出现,且新鲜度要求允许缓存时。
实现代码:
typescript
import { Redis } from "ioredis";
import { createHash } from "crypto";

interface CacheEntry {
  response: string;
  embedding: number[];
  createdAt: number;
}

class SemanticCache {
  private redis: Redis;
  private ttlSeconds: number;
  private similarityThreshold: number;

  constructor(options: {
    redisUrl: string;
    ttlSeconds?: number;
    similarityThreshold?: number;
  }) {
    this.redis = new Redis(options.redisUrl);
    this.ttlSeconds = options.ttlSeconds ?? 3600;
    this.similarityThreshold = options.similarityThreshold ?? 0.95;
  }

  // 精确匹配缓存(快速、低成本)
  private exactKey(prompt: string): string {
    const hash = createHash("sha256").update(prompt).digest("hex");
    return `llm:exact:${hash}`;
  }

  // 先检查精确缓存,再检查语义相似度
  async get(prompt: string): Promise<string | null> {
    // 1. 尝试精确匹配
    const exact = await this.redis.get(this.exactKey(prompt));
    if (exact) return exact;

    // 2. 尝试语义匹配(成本更高,但能识别改写后的查询)
    const embedding = await getEmbedding(prompt);
    const candidates = await this.redis.smembers("llm:semantic:keys");

    for (const candidateKey of candidates) {
      const raw = await this.redis.get(`llm:semantic:${candidateKey}`);
      if (!raw) continue;

      const entry: CacheEntry = JSON.parse(raw);
      const similarity = cosineSimilarity(embedding, entry.embedding);

      if (similarity >= this.similarityThreshold) {
        return entry.response;
      }
    }

    return null;
  }

  async set(prompt: string, response: string): Promise<void> {
    const embedding = await getEmbedding(prompt);
    const key = createHash("sha256").update(prompt).digest("hex");

    // 存储精确匹配
    await this.redis.setex(this.exactKey(prompt), this.ttlSeconds, response);

    // 存储语义条目
    const entry: CacheEntry = {
      response,
      embedding,
      createdAt: Date.now(),
    };
    await this.redis.setex(
      `llm:semantic:${key}`,
      this.ttlSeconds,
      JSON.stringify(entry)
    );
    await this.redis.sadd("llm:semantic:keys", key);
  }
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
优势: LLM API调用成本高且速度慢。语义缓存不仅能识别完全相同的查询,还能识别改写后的查询,大幅降低具有重复查询模式的应用(如FAQ机器人、客户支持、搜索)的成本。

Pattern 5: Streaming Responses

模式5:流式响应

When to use: Any user-facing LLM interaction where perceived latency matters.
Implementation:
typescript
// Server - Next.js Route Handler with streaming
import { OpenAI } from "openai";
import { NextRequest } from "next/server";

export async function POST(req: NextRequest) {
  const { messages } = await req.json();
  const openai = new OpenAI();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  });

  // Convert OpenAI stream to ReadableStream
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content ?? "";
        if (text) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
      }
      controller.enqueue(encoder.encode("data: [DONE]\n\n"));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

// Client - React hook for consuming streams
function useStreamingChat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = async (content: string) => {
    const userMessage = { role: "user" as const, content };
    const updatedMessages = [...messages, userMessage];
    setMessages(updatedMessages);
    setIsStreaming(true);

    const response = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ messages: updatedMessages }),
    });

    const reader = response.body?.getReader();
    const decoder = new TextDecoder();
    let assistantContent = "";

    setMessages((prev) => [...prev, { role: "assistant", content: "" }]);

    while (reader) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split("\n").filter((line) => line.startsWith("data: "));

      for (const line of lines) {
        const data = line.slice(6);
        if (data === "[DONE]") break;

        const parsed = JSON.parse(data);
        assistantContent += parsed.text;

        setMessages((prev) => {
          const updated = [...prev];
          updated[updated.length - 1] = {
            role: "assistant",
            content: assistantContent,
          };
          return updated;
        });
      }
    }

    setIsStreaming(false);
  };

  return { messages, sendMessage, isStreaming };
}
Why: Without streaming, users stare at a blank screen for 2-10 seconds. Streaming shows tokens as they arrive, reducing perceived latency to under 500ms for first token. This is table stakes for any user-facing LLM feature.

适用场景: 任何用户交互场景中,感知延迟很重要的LLM应用。
实现代码:
typescript
// 服务端 - 支持流式输出的Next.js路由处理器
import { OpenAI } from "openai";
import { NextRequest } from "next/server";

export async function POST(req: NextRequest) {
  const { messages } = await req.json();
  const openai = new OpenAI();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  });

  // 将OpenAI流转换为ReadableStream
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content ?? "";
        if (text) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
      }
      controller.enqueue(encoder.encode("data: [DONE]\n\n"));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

// 客户端 - 用于消费流的React钩子
function useStreamingChat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = async (content: string) => {
    const userMessage = { role: "user" as const, content };
    const updatedMessages = [...messages, userMessage];
    setMessages(updatedMessages);
    setIsStreaming(true);

    const response = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ messages: updatedMessages }),
    });

    const reader = response.body?.getReader();
    const decoder = new TextDecoder();
    let assistantContent = "";

    setMessages((prev) => [...prev, { role: "assistant", content: "" }]);

    while (reader) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split("\n").filter((line) => line.startsWith("data: "));

      for (const line of lines) {
        const data = line.slice(6);
        if (data === "[DONE]") break;

        const parsed = JSON.parse(data);
        assistantContent += parsed.text;

        setMessages((prev) => {
          const updated = [...prev];
          updated[updated.length - 1] = {
            role: "assistant",
            content: assistantContent,
          };
          return updated;
        });
      }
    }

    setIsStreaming(false);
  };

  return { messages, sendMessage, isStreaming };
}
优势: 如果不使用流式输出,用户会盯着空白屏幕等待2-10秒。流式输出会在令牌生成时立即显示,将感知延迟降低到首令牌生成的500毫秒以内。这是所有面向用户的LLM功能的必备特性。

Pattern 6: LLM Output Evaluation

模式6:LLM输出评估

When to use: Before deploying any LLM feature to production, and as ongoing regression testing.
Implementation:
typescript
import { evaluate } from "braintrust";

// Define evaluation dataset
const dataset = [
  {
    input: "What is our refund policy?",
    expected: "30-day money-back guarantee",
    tags: ["policy", "refund"],
  },
  {
    input: "How do I reset my password?",
    expected: "Go to Settings > Security > Reset Password",
    tags: ["account", "password"],
  },
];

// Custom scoring functions
function factualAccuracy(output: string, expected: string): number {
  const expectedFacts = expected.toLowerCase().split(/[,.]/).map((s) => s.trim());
  const matchedFacts = expectedFacts.filter((fact) =>
    output.toLowerCase().includes(fact)
  );
  return matchedFacts.length / expectedFacts.length;
}

function answerRelevance(output: string, input: string): number {
  // Use an LLM as a judge
  // Returns 0-1 score for how relevant the answer is to the question
  return llmJudge(input, output, "relevance");
}

// Run evaluation
const results = await evaluate("support-bot-v2", {
  data: dataset,
  task: async (input) => {
    return await generateAnswer(input.input);
  },
  scores: [
    {
      name: "factual_accuracy",
      scorer: (args) => factualAccuracy(args.output, args.expected),
    },
    {
      name: "answer_relevance",
      scorer: (args) => answerRelevance(args.output, args.input),
    },
    {
      name: "no_hallucination",
      scorer: (args) => {
        const hallucinationCheck = detectHallucination(args.output, args.context);
        return hallucinationCheck ? 0 : 1;
      },
    },
  ],
});

console.log(`Average factual accuracy: ${results.scores.factual_accuracy}`);
console.log(`Average relevance: ${results.scores.answer_relevance}`);
Why: LLM outputs are probabilistic. Without evaluation, you cannot measure whether prompt changes, model upgrades, or context modifications improve or degrade quality. Evals are the tests of LLM engineering.

适用场景: 在将任何LLM功能部署到生产环境之前,以及作为持续回归测试的一部分。
实现代码:
typescript
import { evaluate } from "braintrust";

// 定义评估数据集
const dataset = [
  {
    input: "我们的退款政策是什么?",
    expected: "30天无理由退款保证",
    tags: ["policy", "refund"],
  },
  {
    input: "如何重置我的密码?",
    expected: "前往设置 > 安全 > 重置密码",
    tags: ["account", "password"],
  },
];

// 自定义评分函数
function factualAccuracy(output: string, expected: string): number {
  const expectedFacts = expected.toLowerCase().split(/[,.]/).map((s) => s.trim());
  const matchedFacts = expectedFacts.filter((fact) =>
    output.toLowerCase().includes(fact)
  );
  return matchedFacts.length / expectedFacts.length;
}

function answerRelevance(output: string, input: string): number {
  // 使用LLM作为评判者
  // 返回0-1的分数,表示答案与问题的相关性
  return llmJudge(input, output, "relevance");
}

// 运行评估
const results = await evaluate("support-bot-v2", {
  data: dataset,
  task: async (input) => {
    return await generateAnswer(input.input);
  },
  scores: [
    {
      name: "factual_accuracy",
      scorer: (args) => factualAccuracy(args.output, args.expected),
    },
    {
      name: "answer_relevance",
      scorer: (args) => answerRelevance(args.output, args.input),
    },
    {
      name: "no_hallucination",
      scorer: (args) => {
        const hallucinationCheck = detectHallucination(args.output, args.context);
        return hallucinationCheck ? 0 : 1;
      },
    },
  ],
});

console.log(`平均事实准确率:${results.scores.factual_accuracy}`);
console.log(`平均相关性:${results.scores.answer_relevance}`);
优势: LLM输出具有概率性。没有评估,你无法衡量提示词修改、模型升级或上下文调整是提升还是降低了质量。评估是LLM工程的测试环节。

Vector Database Selection Guide

向量数据库选择指南

DatabaseBest ForScalingManagedLocal Dev
pgvectorExisting Postgres users, < 10M vectorsVerticalVia Supabase/NeonDocker
PineconeServerless, large-scale productionHorizontalYes (only)Mock client
ChromaPrototyping, small datasetsLimitedChroma CloudIn-memory
WeaviateHybrid search (vector + BM25)HorizontalYes + self-hostDocker
QdrantPerformance-critical, filteringHorizontalYes + self-hostDocker

数据库适用场景扩展性托管服务本地开发
pgvector已有Postgres用户、向量量<1000万垂直扩展通过Supabase/Neon提供Docker
Pinecone无服务器、大规模生产环境水平扩展仅托管模拟客户端
Chroma原型开发、小型数据集有限Chroma Cloud内存运行
Weaviate混合搜索(向量+BM25)水平扩展托管+自托管Docker
Qdrant性能关键型场景、过滤需求水平扩展托管+自托管Docker

Prompt Engineering Quick Reference

提示词工程速查指南

TechniqueWhen to UseExample Prefix
Zero-shotSimple, well-defined tasks"Classify this email as spam or not:"
Few-shotPattern demonstration needed"Here are examples:\n..."
Chain-of-thoughtReasoning tasks"Think step by step:"
System promptPersistent behavior rules
role: "system"
message
DelimitersSeparating context from instructions
"""context"""
or
<context>
tags

技巧适用场景示例前缀
零样本简单、定义明确的任务"将此邮件分类为垃圾邮件或正常邮件:"
少样本需要演示模式的任务"以下是示例:\n..."
思维链推理类任务"逐步思考:"
系统提示词持久化行为规则
role: "system"
消息
分隔符区分上下文与指令
"""上下文"""
<context>
标签

Anti-Patterns

反模式

Anti-PatternWhy It's BadBetter Approach
Stuffing entire documents into contextWastes tokens, dilutes relevanceUse RAG with chunking and retrieval
Parsing LLM text output with regexBrittle, breaks on format changesUse structured outputs (function calling, Zod)
No evaluation before productionCannot measure quality or catch regressionsBuild eval suite with scoring functions
Single huge prompt for everythingHard to debug, expensive, slowDecompose into smaller focused prompts
Ignoring token costs during developmentSurprise bills at scaleTrack costs per query, set budgets, cache
Synchronous LLM calls in request pathSlow user experience, timeout riskStream responses, use background jobs for heavy work
Hardcoded model names everywherePainful to upgrade or A/B testConfig-driven model selection
No retry logic on API callsTransient failures cause user-visible errorsExponential backoff with jitter

反模式弊端优化方案
将整个文档塞入上下文浪费令牌,降低相关性使用带分块和检索的RAG
用正则表达式解析LLM文本输出脆弱,格式变化就会失效使用结构化输出(函数调用、Zod)
上线前不做评估无法衡量质量或发现退化构建带评分函数的评估套件
所有场景都使用单个超大提示词难以调试、成本高、速度慢分解为更小的聚焦型提示词
开发阶段忽略令牌成本上线后产生意外账单跟踪每个查询的成本,设置预算,使用缓存
请求路径中使用同步LLM调用用户体验慢,有超时风险实现流式响应,将繁重工作放入后台任务
硬编码模型名称升级或A/B测试困难配置驱动的模型选择
API调用无重试逻辑瞬时故障会导致用户可见错误带抖动的指数退避重试

Checklist

检查清单

  • RAG pipeline: chunking strategy chosen and tested (size, overlap, separators)
  • Vector database selected and connection pooling configured
  • Embedding model chosen (dimension, cost, quality tradeoff)
  • Structured output schemas defined for all LLM-to-code boundaries
  • Streaming implemented for user-facing responses
  • Evaluation suite with at least 20 test cases per feature
  • Semantic or exact caching for repeated queries
  • Error handling: retries, fallbacks, user-visible error states
  • Cost tracking: per-query token usage logged
  • Rate limiting on LLM API calls
  • System prompts versioned and stored outside code
  • Prompt injection defenses (input validation, output filtering)

  • RAG管道:已选择并测试分块策略(大小、重叠、分隔符)
  • 向量数据库已选定,连接池已配置
  • 已选择嵌入模型(维度、成本、质量权衡)
  • 已为所有LLM到代码的边界定义结构化输出schema
  • 已为面向用户的响应实现流式输出
  • 评估套件包含每个功能至少20个测试用例
  • 已为重复查询实现语义或精确缓存
  • 错误处理:重试、回退、用户可见错误状态
  • 成本跟踪:已记录每个查询的令牌使用量
  • LLM API调用已配置速率限制
  • 系统提示词已版本化并存储在代码外部
  • 提示词注入防御(输入验证、输出过滤)

Related Resources

相关资源

  • Skills:
    application-security
    (prompt injection),
    monitoring-observability
    (LLM call tracing)
  • Rules:
    docs/reference/stacks/react-typescript.md
    (frontend patterns for chat UI)
  • Rules:
    docs/reference/checklists/verification-template.md
    (eval checklist)
  • 技能:
    application-security
    (提示词注入)、
    monitoring-observability
    (LLM调用追踪)
  • 规则:
    docs/reference/stacks/react-typescript.md
    (聊天UI的前端模式)
  • 规则:
    docs/reference/checklists/verification-template.md
    (评估检查清单)