llm-integration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Integration

LLM集成

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in
rules/
loaded on-demand.
将LLM集成到生产应用中的模式:工具调用、流式传输、本地推理和微调。每个类别在
rules/
目录下都有单独的规则文件,可按需加载。

Quick Reference

快速参考

CategoryRulesImpactWhen to Use
Function Calling3CRITICALTool definitions, parallel execution, input validation
Streaming3HIGHSSE endpoints, structured streaming, backpressure handling
Local Inference3HIGHOllama setup, model selection, GPU optimization
Fine-Tuning3HIGHLoRA/QLoRA training, dataset preparation, evaluation
Context Optimization2HIGHWindow management, compression, caching, budget scaling
Evaluation2HIGHLLM-as-judge, RAGAS metrics, quality gates, benchmarks
Prompt Engineering2HIGHCoT, few-shot, versioning, DSPy optimization
Total: 18 rules across 7 categories
类别规则数量影响等级使用场景
函数调用3关键工具定义、并行执行、输入验证
流式传输3SSE端点、结构化流式传输、背压处理
本地推理3Ollama配置、模型选择、GPU优化
微调3LoRA/QLoRA训练、数据集准备、评估
上下文优化2窗口管理、压缩、缓存、预算扩展
评估2LLM-as-judge、RAGAS指标、质量门控、基准测试
提示工程2CoT、少样本、版本控制、DSPy优化
总计:7个类别共18条规则

Quick Start

快速开始

python
undefined
python
undefined

Function calling: strict mode tool definition

Function calling: strict mode tool definition

tools = [{ "type": "function", "function": { "name": "search_documents", "description": "Search knowledge base", "strict": True, "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"} }, "required": ["query", "limit"], "additionalProperties": False } } }]

```python
tools = [{ "type": "function", "function": { "name": "search_documents", "description": "Search knowledge base", "strict": True, "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"} }, "required": ["query", "limit"], "additionalProperties": False } } }]

```python

Streaming: SSE endpoint with FastAPI

Streaming: SSE endpoint with FastAPI

@app.get("/chat/stream") async def stream_chat(prompt: str): async def generate(): async for token in async_stream(prompt): yield {"event": "token", "data": token} yield {"event": "done", "data": ""} return EventSourceResponse(generate())

```python
@app.get("/chat/stream") async def stream_chat(prompt: str): async def generate(): async for token in async_stream(prompt): yield {"event": "token", "data": token} yield {"event": "done", "data": ""} return EventSourceResponse(generate())

```python

Local inference: Ollama with LangChain

Local inference: Ollama with LangChain

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )

```python
llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )

```python

Fine-tuning: QLoRA with Unsloth

Fine-tuning: QLoRA with Unsloth

model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
undefined
model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
undefined

Function Calling

函数调用

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.
  • calling-tool-definition.md
    -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
  • calling-parallel.md
    -- Parallel tool execution, asyncio.gather, strict mode constraints
  • calling-validation.md
    -- Input validation, error handling, tool execution loops
让LLM能够使用外部工具并返回结构化数据。为确保可靠性,使用严格模式schema(2026年最佳实践)。每个请求限制使用5-15个工具,使用Pydantic/Zod验证所有输入,并将错误作为工具结果返回。
  • calling-tool-definition.md
    -- 严格模式schema、OpenAI/Anthropic格式、LangChain绑定
  • calling-parallel.md
    -- 并行工具执行、asyncio.gather、严格模式约束
  • calling-validation.md
    -- 输入验证、错误处理、工具执行循环

Streaming

流式传输

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.
  • streaming-sse.md
    -- FastAPI SSE endpoints, frontend consumers, async iterators
  • streaming-structured.md
    -- Streaming with tool calls, partial JSON parsing, chunk accumulation
  • streaming-backpressure.md
    -- Backpressure handling, bounded buffers, cancellation
实时交付LLM响应以提升用户体验。Web端使用SSE,双向通信使用WebSocket。使用有界队列处理背压。
  • streaming-sse.md
    -- FastAPI SSE端点、前端消费者、异步迭代器
  • streaming-structured.md
    -- 包含工具调用的流式传输、部分JSON解析、块累积
  • streaming-backpressure.md
    -- 背压处理、有界缓冲区、取消机制

Local Inference

本地推理

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.
  • local-ollama-setup.md
    -- Installation, model pulling, environment configuration
  • local-model-selection.md
    -- Model comparison by task, hardware profiles, quantization
  • local-gpu-optimization.md
    -- Apple Silicon tuning, keep-alive, CI integration
使用Ollama在本地运行LLM,可节省成本(比云端低93%)、保障隐私并支持离线开发。预热模型,使用供应商工厂实现云端/本地切换。
  • local-ollama-setup.md
    -- 安装、模型拉取、环境配置
  • local-model-selection.md
    -- 按任务对比模型、硬件配置文件、量化
  • local-gpu-optimization.md
    -- Apple Silicon调优、保持连接、CI集成

Fine-Tuning

微调

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.
  • tuning-lora.md
    -- LoRA/QLoRA configuration, Unsloth training, adapter merging
  • tuning-dataset-prep.md
    -- Synthetic data generation, quality validation, deduplication
  • tuning-evaluation.md
    -- DPO alignment, evaluation metrics, anti-patterns
使用参数高效技术定制LLM。只有在穷尽提示工程和RAG之后再进行微调。需要1000+优质样本。
  • tuning-lora.md
    -- LoRA/QLoRA配置、Unsloth训练、适配器合并
  • tuning-dataset-prep.md
    -- 合成数据生成、质量验证、去重
  • tuning-evaluation.md
    -- DPO对齐、评估指标、反模式

Context Optimization

上下文优化

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.
  • context-window-management.md
    -- Five-layer architecture, anchored summarization, compression triggers
  • context-caching.md
    -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+
管理上下文窗口、压缩和注意力感知定位。针对每个任务优化token使用。
  • context-window-management.md
    -- 五层架构、锚定摘要、压缩触发机制
  • context-caching.md
    -- 即时加载、预算扩展、探针评估、CC 2.1.32+

Evaluation

评估

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.
  • evaluation-metrics.md
    -- LLM-as-judge, RAGAS metrics, hallucination detection
  • evaluation-benchmarks.md
    -- Quality gates, batch evaluation, pairwise comparison
使用多维度评分、质量门控和基准测试评估LLM输出。
  • evaluation-metrics.md
    -- LLM-as-judge、RAGAS指标、幻觉检测
  • evaluation-benchmarks.md
    -- 质量门控、批量评估、成对比较

Prompt Engineering

提示工程

Design, version, and optimize prompts for production LLM applications.
  • prompt-design.md
    -- Chain-of-Thought, few-shot learning, pattern selection guide
  • prompt-testing.md
    -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
为生产级LLM应用设计、版本化和优化提示词。
  • prompt-design.md
    -- 思维链(CoT)、少样本学习、模式选择指南
  • prompt-testing.md
    -- Langfuse版本控制、DSPy优化、A/B测试、自一致性

Key Decisions

关键决策

DecisionRecommendation
Tool schema mode
strict: true
(2026 best practice)
Tool count5-15 max per request
Streaming protocolSSE for web, WebSocket for bidirectional
Buffer size50-200 tokens
Local model (reasoning)
deepseek-r1:70b
Local model (coding)
qwen2.5-coder:32b
Fine-tuning approachLoRA/QLoRA (try prompting first)
LoRA rank16-64 typical
Training epochs1-3 (more risks overfitting)
Context compressionAnchored iterative (60-80%)
Compress trigger70% utilization, target 50%
Judge modelGPT-5.2-mini or Haiku 4.5
Quality threshold0.7 production, 0.6 drafts
Few-shot examples3-5 diverse, representative
Prompt versioningLangfuse with labels
Auto-optimizationDSPy MIPROv2
决策项推荐方案
工具schema模式
strict: true
(2026最佳实践)
工具数量每个请求最多5-15个
流式传输协议Web端用SSE,双向通信用WebSocket
缓冲区大小50-200个token
本地推理模型(推理任务)
deepseek-r1:70b
本地推理模型(编码任务)
qwen2.5-coder:32b
微调方法LoRA/QLoRA(先尝试提示工程)
LoRA秩典型值16-64
训练轮数1-3轮(更多轮数有过拟合风险)
上下文压缩锚定迭代式(压缩率60-80%)
压缩触发条件使用率70%,目标压缩至50%
评估模型GPT-5.2-mini或Haiku 4.5
质量阈值生产环境0.7,草稿环境0.6
少样本示例3-5个多样化、有代表性的示例
提示词版本控制带标签的Langfuse
自动优化DSPy MIPROv2

Related Skills

相关技能

  • rag-retrieval
    -- Embedding patterns, when RAG is better than fine-tuning
  • agent-loops
    -- Multi-step tool use with reasoning
  • llm-evaluation
    -- Evaluate fine-tuned and local models
  • langfuse-observability
    -- Track training experiments
  • rag-retrieval
    -- 嵌入模式、何时RAG优于微调
  • agent-loops
    -- 结合推理的多步骤工具使用
  • llm-evaluation
    -- 评估微调后的模型和本地模型
  • langfuse-observability
    -- 跟踪训练实验

Capability Details

能力详情

function-calling

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:
  • Define tools with clear descriptions and strict schemas
  • Execute tool calls in parallel with asyncio.gather
  • Validate inputs and handle errors in tool execution loops
关键词: 工具、函数、定义工具、工具schema、函数schema、严格模式、并行工具 解决问题:
  • 使用清晰的描述和严格的schema定义工具
  • 用asyncio.gather并行执行工具调用
  • 在工具执行循环中验证输入并处理错误

streaming

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:
  • Stream LLM tokens via SSE endpoints
  • Handle tool calls within streams
  • Manage backpressure with bounded queues
关键词: 流式传输、SSE、Server-Sent Events、实时、背压、token流 解决问题:
  • 通过SSE端点流式传输LLM token
  • 在流中处理工具调用
  • 使用有界队列管理背压

local-inference

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:
  • Set up Ollama for local LLM inference
  • Select models based on task and hardware
  • Optimize GPU usage and CI integration
关键词: Ollama、本地、自托管、模型选择、GPU、Apple Silicon 解决问题:
  • 配置Ollama用于本地LLM推理
  • 根据任务和硬件选择模型
  • 优化GPU使用和CI集成

fine-tuning

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:
  • Configure LoRA/QLoRA for parameter-efficient training
  • Generate and validate synthetic training data
  • Align models with DPO and evaluate results
关键词: LoRA、QLoRA、微调、DPO、合成数据、PEFT、对齐 解决问题:
  • 配置LoRA/QLoRA实现参数高效训练
  • 生成并验证合成训练数据
  • 用DPO对齐模型并评估结果