Loading...
Loading...
Use when user needs LLM system architecture, model deployment, optimization strategies, and production serving infrastructure. Designs scalable large language model applications with focus on performance, cost efficiency, and safety.
npx skill4agent add 404kidwiz/claude-supercode-skills llm-architect| Requirement | Recommended Approach |
|---|---|
| Latency <100ms | Small fine-tuned model (7B quantized) |
| Latency <2s, budget unlimited | Claude 3 Opus / GPT-4 |
| Latency <2s, domain-specific | Claude 3 Sonnet fine-tuned |
| Latency <2s, cost-sensitive | Claude 3 Haiku |
| Batch/async acceptable | Batch API, cheapest tier |
Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│ ├─ Knowledge changes frequently?
│ │ └─ RAG (Retrieval Augmented Generation)
│ └─ Knowledge is static?
│ └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│ ├─ Can describe in prompt?
│ │ └─ Prompt engineering (try first)
│ └─ Format too complex for prompt?
│ └─ Fine-tuning
│
└─ Need latency <100ms?
└─ Fine-tuned small model (7B-13B)[Client] → [API Gateway + Rate Limiting]
↓
[Request Router]
(Route by intent/complexity)
↓
┌────────┴────────┐
↓ ↓
[Fast Model] [Powerful Model]
(Haiku/Small) (Sonnet/Large)
↓ ↓
[Cache Layer] ← [Response Aggregator]
↓
[Logging & Monitoring]
↓
[Response to Client]def select_model(requirements):
if requirements.latency_p95 < 100: # milliseconds
if requirements.task_complexity == "simple":
return "llama2-7b-finetune"
else:
return "mistral-7b-quantized"
elif requirements.latency_p95 < 2000:
if requirements.budget == "unlimited":
return "claude-3-opus"
elif requirements.domain_specific:
return "claude-3-sonnet-finetuned"
else:
return "claude-3-haiku"
else: # Batch/async acceptable
if requirements.accuracy_critical:
return "gpt-4-with-ensemble"
else:
return "batch-api-cheapest-tier"# Run benchmark on eval dataset
python scripts/evaluate_model.py \
--model claude-3-sonnet \
--dataset data/eval_1000_examples.jsonl \
--metrics accuracy,latency,cost
# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15| Strategy | Savings | When to Use |
|---|---|---|
| Semantic caching | 40-80% | 60%+ similar queries |
| Multi-model routing | 30-50% | Mixed complexity queries |
| Prompt compression | 10-20% | Long context inputs |
| Batching | 20-40% | Async-tolerant workloads |
| Smaller model cascade | 40-60% | Simple queries first |
| Observation | Action |
|---|---|
| Accuracy <80% after prompt iteration | Consider fine-tuning |
| Latency 2x requirement | Review infrastructure |
| Cost >2x budget | Aggressive caching/routing |
| Hallucination rate >5% | Add RAG or stronger guardrails |
| Safety bypass detected | Immediate security review |
| Metric | Target | Critical |
|---|---|---|
| P95 Latency | <2x requirement | <3x requirement |
| Accuracy | >90% | >80% |
| Cache Hit Rate | >60% | >40% |
| Error Rate | <1% | <5% |
| Cost/1K requests | Within budget | <150% budget |