llm-integration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Integration

LLM集成

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in

rules/

loaded on-demand.

将LLM集成到生产应用中的模式：工具调用、流式传输、本地推理和微调。每个类别在

rules/

目录下都有单独的规则文件，可按需加载。

Quick Reference

快速参考

Category	Rules	Impact	When to Use
Function Calling	3	CRITICAL	Tool definitions, parallel execution, input validation
Streaming	3	HIGH	SSE endpoints, structured streaming, backpressure handling
Local Inference	3	HIGH	Ollama setup, model selection, GPU optimization
Fine-Tuning	3	HIGH	LoRA/QLoRA training, dataset preparation, evaluation
Context Optimization	2	HIGH	Window management, compression, caching, budget scaling
Evaluation	2	HIGH	LLM-as-judge, RAGAS metrics, quality gates, benchmarks
Prompt Engineering	2	HIGH	CoT, few-shot, versioning, DSPy optimization

Total: 18 rules across 7 categories

类别	规则数量	影响等级	使用场景
函数调用	3	关键	工具定义、并行执行、输入验证
流式传输	3	高	SSE端点、结构化流式传输、背压处理
本地推理	3	高	Ollama配置、模型选择、GPU优化
微调	3	高	LoRA/QLoRA训练、数据集准备、评估
上下文优化	2	高	窗口管理、压缩、缓存、预算扩展
评估	2	高	LLM-as-judge、RAGAS指标、质量门控、基准测试
提示工程	2	高	CoT、少样本、版本控制、DSPy优化

总计：7个类别共18条规则

Quick Start

快速开始

python

undefined

python

undefined

Function calling: strict mode tool definition

tools = [{ "type": "function", "function": { "name": "search_documents", "description": "Search knowledge base", "strict": True, "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"} }, "required": ["query", "limit"], "additionalProperties": False } } }]


```python


```python

Streaming: SSE endpoint with FastAPI

@app.get("/chat/stream") async def stream_chat(prompt: str): async def generate(): async for token in async_stream(prompt): yield {"event": "token", "data": token} yield {"event": "done", "data": ""} return EventSourceResponse(generate())


```python


```python

Local inference: Ollama with LangChain

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )


```python

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )


```python

Fine-tuning: QLoRA with Unsloth

model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

undefined

undefined

Function Calling

函数调用

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

```
calling-tool-definition.md
```
-- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
```
calling-parallel.md
```
-- Parallel tool execution, asyncio.gather, strict mode constraints
```
calling-validation.md
```
-- Input validation, error handling, tool execution loops

让LLM能够使用外部工具并返回结构化数据。为确保可靠性，使用严格模式schema（2026年最佳实践）。每个请求限制使用5-15个工具，使用Pydantic/Zod验证所有输入，并将错误作为工具结果返回。

```
calling-tool-definition.md
```
-- 严格模式schema、OpenAI/Anthropic格式、LangChain绑定
```
calling-parallel.md
```
-- 并行工具执行、asyncio.gather、严格模式约束
```
calling-validation.md
```
-- 输入验证、错误处理、工具执行循环

Streaming

流式传输

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

```
streaming-sse.md
```
-- FastAPI SSE endpoints, frontend consumers, async iterators
```
streaming-structured.md
```
-- Streaming with tool calls, partial JSON parsing, chunk accumulation
```
streaming-backpressure.md
```
-- Backpressure handling, bounded buffers, cancellation

实时交付LLM响应以提升用户体验。Web端使用SSE，双向通信使用WebSocket。使用有界队列处理背压。

```
streaming-sse.md
```
-- FastAPI SSE端点、前端消费者、异步迭代器
```
streaming-structured.md
```
-- 包含工具调用的流式传输、部分JSON解析、块累积
```
streaming-backpressure.md
```
-- 背压处理、有界缓冲区、取消机制

Local Inference

本地推理

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

```
local-ollama-setup.md
```
-- Installation, model pulling, environment configuration
```
local-model-selection.md
```
-- Model comparison by task, hardware profiles, quantization
```
local-gpu-optimization.md
```
-- Apple Silicon tuning, keep-alive, CI integration

使用Ollama在本地运行LLM，可节省成本（比云端低93%）、保障隐私并支持离线开发。预热模型，使用供应商工厂实现云端/本地切换。

```
local-ollama-setup.md
```
-- 安装、模型拉取、环境配置
```
local-model-selection.md
```
-- 按任务对比模型、硬件配置文件、量化
```
local-gpu-optimization.md
```
-- Apple Silicon调优、保持连接、CI集成

Fine-Tuning

微调

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

```
tuning-lora.md
```
-- LoRA/QLoRA configuration, Unsloth training, adapter merging
```
tuning-dataset-prep.md
```
-- Synthetic data generation, quality validation, deduplication
```
tuning-evaluation.md
```
-- DPO alignment, evaluation metrics, anti-patterns

使用参数高效技术定制LLM。只有在穷尽提示工程和RAG之后再进行微调。需要1000+优质样本。

```
tuning-lora.md
```
-- LoRA/QLoRA配置、Unsloth训练、适配器合并
```
tuning-dataset-prep.md
```
-- 合成数据生成、质量验证、去重
```
tuning-evaluation.md
```
-- DPO对齐、评估指标、反模式

Context Optimization

上下文优化

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

```
context-window-management.md
```
-- Five-layer architecture, anchored summarization, compression triggers
```
context-caching.md
```
-- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

管理上下文窗口、压缩和注意力感知定位。针对每个任务优化token使用。

```
context-window-management.md
```
-- 五层架构、锚定摘要、压缩触发机制
```
context-caching.md
```
-- 即时加载、预算扩展、探针评估、CC 2.1.32+

Evaluation

评估

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

```
evaluation-metrics.md
```
-- LLM-as-judge, RAGAS metrics, hallucination detection
```
evaluation-benchmarks.md
```
-- Quality gates, batch evaluation, pairwise comparison

使用多维度评分、质量门控和基准测试评估LLM输出。

```
evaluation-metrics.md
```
-- LLM-as-judge、RAGAS指标、幻觉检测
```
evaluation-benchmarks.md
```
-- 质量门控、批量评估、成对比较

Prompt Engineering

提示工程

Design, version, and optimize prompts for production LLM applications.

```
prompt-design.md
```
-- Chain-of-Thought, few-shot learning, pattern selection guide
```
prompt-testing.md
```
-- Langfuse versioning, DSPy optimization, A/B testing, self-consistency

为生产级LLM应用设计、版本化和优化提示词。

```
prompt-design.md
```
-- 思维链（CoT）、少样本学习、模式选择指南
```
prompt-testing.md
```
-- Langfuse版本控制、DSPy优化、A/B测试、自一致性

Key Decisions

关键决策

Decision	Recommendation
Tool schema mode	`strict: true` (2026 best practice)
Tool count	5-15 max per request
Streaming protocol	SSE for web, WebSocket for bidirectional
Buffer size	50-200 tokens
Local model (reasoning)	`deepseek-r1:70b`
Local model (coding)	`qwen2.5-coder:32b`
Fine-tuning approach	LoRA/QLoRA (try prompting first)
LoRA rank	16-64 typical
Training epochs	1-3 (more risks overfitting)
Context compression	Anchored iterative (60-80%)
Compress trigger	70% utilization, target 50%
Judge model	GPT-5.2-mini or Haiku 4.5
Quality threshold	0.7 production, 0.6 drafts
Few-shot examples	3-5 diverse, representative
Prompt versioning	Langfuse with labels
Auto-optimization	DSPy MIPROv2

决策项	推荐方案
工具schema模式	`strict: true` （2026最佳实践）
工具数量	每个请求最多5-15个
流式传输协议	Web端用SSE，双向通信用WebSocket
缓冲区大小	50-200个token
本地推理模型（推理任务）	`deepseek-r1:70b`
本地推理模型（编码任务）	`qwen2.5-coder:32b`
微调方法	LoRA/QLoRA（先尝试提示工程）
LoRA秩	典型值16-64
训练轮数	1-3轮（更多轮数有过拟合风险）
上下文压缩	锚定迭代式（压缩率60-80%）
压缩触发条件	使用率70%，目标压缩至50%
评估模型	GPT-5.2-mini或Haiku 4.5
质量阈值	生产环境0.7，草稿环境0.6
少样本示例	3-5个多样化、有代表性的示例
提示词版本控制	带标签的Langfuse
自动优化	DSPy MIPROv2

Related Skills

Capability Details

能力详情

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:

Define tools with clear descriptions and strict schemas
Execute tool calls in parallel with asyncio.gather
Validate inputs and handle errors in tool execution loops

关键词: 工具、函数、定义工具、工具schema、函数schema、严格模式、并行工具 解决问题:

使用清晰的描述和严格的schema定义工具
用asyncio.gather并行执行工具调用
在工具执行循环中验证输入并处理错误

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:

Stream LLM tokens via SSE endpoints
Handle tool calls within streams
Manage backpressure with bounded queues

关键词: 流式传输、SSE、Server-Sent Events、实时、背压、token流 解决问题:

通过SSE端点流式传输LLM token
在流中处理工具调用
使用有界队列管理背压

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:

Set up Ollama for local LLM inference
Select models based on task and hardware
Optimize GPU usage and CI integration

关键词: Ollama、本地、自托管、模型选择、GPU、Apple Silicon 解决问题:

配置Ollama用于本地LLM推理
根据任务和硬件选择模型
优化GPU使用和CI集成

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:

Configure LoRA/QLoRA for parameter-efficient training
Generate and validate synthetic training data
Align models with DPO and evaluate results

关键词: LoRA、QLoRA、微调、DPO、合成数据、PEFT、对齐 解决问题:

配置LoRA/QLoRA实现参数高效训练
生成并验证合成训练数据
用DPO对齐模型并评估结果

llm-integration

Original

Translation

LLM Integration

LLM集成

Quick Reference

快速参考

Quick Start

快速开始

Function calling: strict mode tool definition

Function calling: strict mode tool definition

Streaming: SSE endpoint with FastAPI

Streaming: SSE endpoint with FastAPI

Local inference: Ollama with LangChain

Local inference: Ollama with LangChain

Fine-tuning: QLoRA with Unsloth

Fine-tuning: QLoRA with Unsloth

Function Calling

函数调用

Streaming

流式传输

Local Inference

本地推理

Fine-Tuning

微调

Context Optimization

上下文优化

Evaluation

评估

Prompt Engineering

提示工程

Key Decisions

关键决策

Related Skills

相关技能

Capability Details

能力详情

function-calling

function-calling

streaming

streaming

local-inference

local-inference

fine-tuning

fine-tuning