llm-integration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Integration
LLM集成
Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in loaded on-demand.
rules/将LLM集成到生产应用中的模式:工具调用、流式传输、本地推理和微调。每个类别在目录下都有单独的规则文件,可按需加载。
rules/Quick Reference
快速参考
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Function Calling | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| Streaming | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| Local Inference | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| Fine-Tuning | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| Context Optimization | 2 | HIGH | Window management, compression, caching, budget scaling |
| Evaluation | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| Prompt Engineering | 2 | HIGH | CoT, few-shot, versioning, DSPy optimization |
Total: 18 rules across 7 categories
Quick Start
快速开始
python
undefinedpython
undefinedFunction calling: strict mode tool definition
Function calling: strict mode tool definition
tools = [{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search knowledge base",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
},
"required": ["query", "limit"],
"additionalProperties": False
}
}
}]
```pythontools = [{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search knowledge base",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
},
"required": ["query", "limit"],
"additionalProperties": False
}
}
}]
```pythonStreaming: SSE endpoint with FastAPI
Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
async def generate():
async for token in async_stream(prompt):
yield {"event": "token", "data": token}
yield {"event": "done", "data": ""}
return EventSourceResponse(generate())
```python@app.get("/chat/stream")
async def stream_chat(prompt: str):
async def generate():
async for token in async_stream(prompt):
yield {"event": "token", "data": token}
yield {"event": "done", "data": ""}
return EventSourceResponse(generate())
```pythonLocal inference: Ollama with LangChain
Local inference: Ollama with LangChain
llm = ChatOllama(
model="deepseek-r1:70b",
base_url="http://localhost:11434",
temperature=0.0,
num_ctx=32768,
)
```pythonllm = ChatOllama(
model="deepseek-r1:70b",
base_url="http://localhost:11434",
temperature=0.0,
num_ctx=32768,
)
```pythonFine-tuning: QLoRA with Unsloth
Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
undefinedmodel, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
undefinedFunction Calling
函数调用
Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.
- -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
calling-tool-definition.md - -- Parallel tool execution, asyncio.gather, strict mode constraints
calling-parallel.md - -- Input validation, error handling, tool execution loops
calling-validation.md
让LLM能够使用外部工具并返回结构化数据。为确保可靠性,使用严格模式schema(2026年最佳实践)。每个请求限制使用5-15个工具,使用Pydantic/Zod验证所有输入,并将错误作为工具结果返回。
- -- 严格模式schema、OpenAI/Anthropic格式、LangChain绑定
calling-tool-definition.md - -- 并行工具执行、asyncio.gather、严格模式约束
calling-parallel.md - -- 输入验证、错误处理、工具执行循环
calling-validation.md
Streaming
流式传输
Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.
- -- FastAPI SSE endpoints, frontend consumers, async iterators
streaming-sse.md - -- Streaming with tool calls, partial JSON parsing, chunk accumulation
streaming-structured.md - -- Backpressure handling, bounded buffers, cancellation
streaming-backpressure.md
实时交付LLM响应以提升用户体验。Web端使用SSE,双向通信使用WebSocket。使用有界队列处理背压。
- -- FastAPI SSE端点、前端消费者、异步迭代器
streaming-sse.md - -- 包含工具调用的流式传输、部分JSON解析、块累积
streaming-structured.md - -- 背压处理、有界缓冲区、取消机制
streaming-backpressure.md
Local Inference
本地推理
Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.
- -- Installation, model pulling, environment configuration
local-ollama-setup.md - -- Model comparison by task, hardware profiles, quantization
local-model-selection.md - -- Apple Silicon tuning, keep-alive, CI integration
local-gpu-optimization.md
使用Ollama在本地运行LLM,可节省成本(比云端低93%)、保障隐私并支持离线开发。预热模型,使用供应商工厂实现云端/本地切换。
- -- 安装、模型拉取、环境配置
local-ollama-setup.md - -- 按任务对比模型、硬件配置文件、量化
local-model-selection.md - -- Apple Silicon调优、保持连接、CI集成
local-gpu-optimization.md
Fine-Tuning
微调
Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.
- -- LoRA/QLoRA configuration, Unsloth training, adapter merging
tuning-lora.md - -- Synthetic data generation, quality validation, deduplication
tuning-dataset-prep.md - -- DPO alignment, evaluation metrics, anti-patterns
tuning-evaluation.md
使用参数高效技术定制LLM。只有在穷尽提示工程和RAG之后再进行微调。需要1000+优质样本。
- -- LoRA/QLoRA配置、Unsloth训练、适配器合并
tuning-lora.md - -- 合成数据生成、质量验证、去重
tuning-dataset-prep.md - -- DPO对齐、评估指标、反模式
tuning-evaluation.md
Context Optimization
上下文优化
Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.
- -- Five-layer architecture, anchored summarization, compression triggers
context-window-management.md - -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+
context-caching.md
管理上下文窗口、压缩和注意力感知定位。针对每个任务优化token使用。
- -- 五层架构、锚定摘要、压缩触发机制
context-window-management.md - -- 即时加载、预算扩展、探针评估、CC 2.1.32+
context-caching.md
Evaluation
评估
Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.
- -- LLM-as-judge, RAGAS metrics, hallucination detection
evaluation-metrics.md - -- Quality gates, batch evaluation, pairwise comparison
evaluation-benchmarks.md
使用多维度评分、质量门控和基准测试评估LLM输出。
- -- LLM-as-judge、RAGAS指标、幻觉检测
evaluation-metrics.md - -- 质量门控、批量评估、成对比较
evaluation-benchmarks.md
Prompt Engineering
提示工程
Design, version, and optimize prompts for production LLM applications.
- -- Chain-of-Thought, few-shot learning, pattern selection guide
prompt-design.md - -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
prompt-testing.md
为生产级LLM应用设计、版本化和优化提示词。
- -- 思维链(CoT)、少样本学习、模式选择指南
prompt-design.md - -- Langfuse版本控制、DSPy优化、A/B测试、自一致性
prompt-testing.md
Key Decisions
关键决策
| Decision | Recommendation |
|---|---|
| Tool schema mode | |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | |
| Local model (coding) | |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | GPT-5.2-mini or Haiku 4.5 |
| Quality threshold | 0.7 production, 0.6 drafts |
| Few-shot examples | 3-5 diverse, representative |
| Prompt versioning | Langfuse with labels |
| Auto-optimization | DSPy MIPROv2 |
| 决策项 | 推荐方案 |
|---|---|
| 工具schema模式 | |
| 工具数量 | 每个请求最多5-15个 |
| 流式传输协议 | Web端用SSE,双向通信用WebSocket |
| 缓冲区大小 | 50-200个token |
| 本地推理模型(推理任务) | |
| 本地推理模型(编码任务) | |
| 微调方法 | LoRA/QLoRA(先尝试提示工程) |
| LoRA秩 | 典型值16-64 |
| 训练轮数 | 1-3轮(更多轮数有过拟合风险) |
| 上下文压缩 | 锚定迭代式(压缩率60-80%) |
| 压缩触发条件 | 使用率70%,目标压缩至50% |
| 评估模型 | GPT-5.2-mini或Haiku 4.5 |
| 质量阈值 | 生产环境0.7,草稿环境0.6 |
| 少样本示例 | 3-5个多样化、有代表性的示例 |
| 提示词版本控制 | 带标签的Langfuse |
| 自动优化 | DSPy MIPROv2 |
Related Skills
相关技能
- -- Embedding patterns, when RAG is better than fine-tuning
rag-retrieval - -- Multi-step tool use with reasoning
agent-loops - -- Evaluate fine-tuned and local models
llm-evaluation - -- Track training experiments
langfuse-observability
- -- 嵌入模式、何时RAG优于微调
rag-retrieval - -- 结合推理的多步骤工具使用
agent-loops - -- 评估微调后的模型和本地模型
llm-evaluation - -- 跟踪训练实验
langfuse-observability
Capability Details
能力详情
function-calling
function-calling
Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools
Solves:
- Define tools with clear descriptions and strict schemas
- Execute tool calls in parallel with asyncio.gather
- Validate inputs and handle errors in tool execution loops
关键词: 工具、函数、定义工具、工具schema、函数schema、严格模式、并行工具
解决问题:
- 使用清晰的描述和严格的schema定义工具
- 用asyncio.gather并行执行工具调用
- 在工具执行循环中验证输入并处理错误
streaming
streaming
Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream
Solves:
- Stream LLM tokens via SSE endpoints
- Handle tool calls within streams
- Manage backpressure with bounded queues
关键词: 流式传输、SSE、Server-Sent Events、实时、背压、token流
解决问题:
- 通过SSE端点流式传输LLM token
- 在流中处理工具调用
- 使用有界队列管理背压
local-inference
local-inference
Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon
Solves:
- Set up Ollama for local LLM inference
- Select models based on task and hardware
- Optimize GPU usage and CI integration
关键词: Ollama、本地、自托管、模型选择、GPU、Apple Silicon
解决问题:
- 配置Ollama用于本地LLM推理
- 根据任务和硬件选择模型
- 优化GPU使用和CI集成
fine-tuning
fine-tuning
Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment
Solves:
- Configure LoRA/QLoRA for parameter-efficient training
- Generate and validate synthetic training data
- Align models with DPO and evaluate results
关键词: LoRA、QLoRA、微调、DPO、合成数据、PEFT、对齐
解决问题:
- 配置LoRA/QLoRA实现参数高效训练
- 生成并验证合成训练数据
- 用DPO对齐模型并评估结果