Loading...
Loading...
LLM integration patterns for function calling, streaming responses, local inference with Ollama, and fine-tuning customization. Use when implementing tool use, SSE streaming, local model deployment, LoRA/QLoRA fine-tuning, or multi-provider LLM APIs.
npx skill4agent add yonatangross/orchestkit llm-integrationrules/| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Function Calling | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| Streaming | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| Local Inference | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| Fine-Tuning | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| Context Optimization | 2 | HIGH | Window management, compression, caching, budget scaling |
| Evaluation | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| Prompt Engineering | 2 | HIGH | CoT, few-shot, versioning, DSPy optimization |
# Function calling: strict mode tool definition
tools = [{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search knowledge base",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
},
"required": ["query", "limit"],
"additionalProperties": False
}
}
}]# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
async def generate():
async for token in async_stream(prompt):
yield {"event": "token", "data": token}
yield {"event": "done", "data": ""}
return EventSourceResponse(generate())# Local inference: Ollama with LangChain
llm = ChatOllama(
model="deepseek-r1:70b",
base_url="http://localhost:11434",
temperature=0.0,
num_ctx=32768,
)# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)calling-tool-definition.mdcalling-parallel.mdcalling-validation.mdstreaming-sse.mdstreaming-structured.mdstreaming-backpressure.mdlocal-ollama-setup.mdlocal-model-selection.mdlocal-gpu-optimization.mdtuning-lora.mdtuning-dataset-prep.mdtuning-evaluation.mdcontext-window-management.mdcontext-caching.mdevaluation-metrics.mdevaluation-benchmarks.mdprompt-design.mdprompt-testing.md| Decision | Recommendation |
|---|---|
| Tool schema mode | |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | |
| Local model (coding) | |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | GPT-5.2-mini or Haiku 4.5 |
| Quality threshold | 0.7 production, 0.6 drafts |
| Few-shot examples | 3-5 diverse, representative |
| Prompt versioning | Langfuse with labels |
| Auto-optimization | DSPy MIPROv2 |
rag-retrievalagent-loopsllm-evaluationlangfuse-observability