runtime-skills
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUniversal Runtime Skills
Universal Runtime 技能
Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.
LlamaFarm本地机器学习推理服务器Universal Runtime的最佳实践与代码评审检查清单。
Overview
概述
The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:
- Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
- Text embeddings (BERT, sentence-transformers, ModernBERT)
- Classification, NER, and reranking
- OCR and document understanding
- Anomaly detection
Directory:
Python: 3.11+
Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python
runtimes/universal/Universal Runtime 为HuggingFace模型提供兼容OpenAI的端点:
- 文本生成(因果语言模型:GPT、Llama、Mistral、Qwen)
- 文本嵌入(BERT、sentence-transformers、ModernBERT)
- 分类、命名实体识别(NER)与重排序
- OCR与文档理解
- 异常检测
目录:
Python版本:3.11+
核心依赖:PyTorch、Transformers、FastAPI、llama-cpp-python
runtimes/universal/Links to Shared Skills
共享技能链接
This skill extends the shared Python practices. Always apply these first:
| Topic | File | Priority |
|---|---|---|
| Patterns | python-skills/patterns.md | Medium |
| Async | python-skills/async.md | High |
| Typing | python-skills/typing.md | Medium |
| Testing | python-skills/testing.md | Medium |
| Errors | python-skills/error-handling.md | High |
| Security | python-skills/security.md | Critical |
本技能基于共享Python实践扩展,请先遵循以下内容:
| 主题 | 文件 | 优先级 |
|---|---|---|
| 设计模式 | python-skills/patterns.md | 中等 |
| 异步编程 | python-skills/async.md | 高 |
| 类型标注 | python-skills/typing.md | 中等 |
| 测试 | python-skills/testing.md | 中等 |
| 错误处理 | python-skills/error-handling.md | 高 |
| 安全 | python-skills/security.md | 关键 |
Runtime-Specific Checklists
运行时专属检查清单
| Topic | File | Key Points |
|---|---|---|
| PyTorch | pytorch.md | Device management, dtype, memory cleanup |
| Transformers | transformers.md | Model loading, tokenization, inference |
| FastAPI | fastapi.md | API design, streaming, lifespan |
| Performance | performance.md | Batching, caching, optimizations |
| 主题 | 文件 | 核心要点 |
|---|---|---|
| PyTorch | pytorch.md | 设备管理、dtype、内存清理 |
| Transformers | transformers.md | 模型加载、分词、推理 |
| FastAPI | fastapi.md | API设计、流式传输、生命周期 |
| 性能 | performance.md | 批处理、缓存、优化 |
Architecture
架构
runtimes/universal/
├── server.py # FastAPI app, model caching, endpoints
├── core/
│ └── logging.py # UniversalRuntimeLogger (structlog)
├── models/
│ ├── base.py # BaseModel ABC with device management
│ ├── language_model.py # Transformers text generation
│ ├── gguf_language_model.py # llama-cpp-python for GGUF
│ ├── encoder_model.py # Embeddings, classification, NER, reranking
│ └── ... # OCR, anomaly, document models
├── routers/
│ └── chat_completions/ # Chat completions with streaming
├── utils/
│ ├── device.py # Device detection (CUDA/MPS/CPU)
│ ├── model_cache.py # TTL-based model caching
│ ├── model_format.py # GGUF vs transformers detection
│ └── context_calculator.py # GGUF context size computation
└── tests/runtimes/universal/
├── server.py # FastAPI 应用、模型缓存、端点
├── core/
│ └── logging.py # UniversalRuntimeLogger(structlog)
├── models/
│ ├── base.py # 带设备管理的BaseModel抽象基类
│ ├── language_model.py # Transformers文本生成
│ ├── gguf_language_model.py # 基于llama-cpp-python的GGUF模型
│ ├── encoder_model.py # 嵌入、分类、NER、重排序
│ └── ... # OCR、异常检测、文档模型
├── routers/
│ └── chat_completions/ # 带流式传输的对话补全
├── utils/
│ ├── device.py # 设备检测(CUDA/MPS/CPU)
│ ├── model_cache.py # 基于TTL的模型缓存
│ ├── model_format.py # GGUF与transformers格式检测
│ └── context_calculator.py # GGUF上下文大小计算
└── tests/Key Patterns
核心模式
1. Model Loading with Double-Checked Locking
1. 双重检查锁定的模型加载
python
_model_load_lock = asyncio.Lock()
async def load_encoder(model_id: str, task: str = "embedding"):
cache_key = f"encoder:{task}:{model_id}"
if cache_key not in _models:
async with _model_load_lock:
# Double-check after acquiring lock
if cache_key not in _models:
model = EncoderModel(model_id, device, task=task)
await model.load()
_models[cache_key] = model
return _models.get(cache_key)python
_model_load_lock = asyncio.Lock()
async def load_encoder(model_id: str, task: str = "embedding"):
cache_key = f"encoder:{task}:{model_id}"
if cache_key not in _models:
async with _model_load_lock:
# 获取锁后再次检查
if cache_key not in _models:
model = EncoderModel(model_id, device, task=task)
await model.load()
_models[cache_key] = model
return _models.get(cache_key)2. Device-Aware Tensor Operations
2. 感知设备的张量操作
python
class BaseModel(ABC):
def get_dtype(self, force_float32: bool = False):
if force_float32:
return torch.float32
if self.device in ("cuda", "mps"):
return torch.float16
return torch.float32
def to_device(self, tensor: torch.Tensor, dtype=None):
# Don't change dtype for integer tensors
if tensor.dtype in (torch.int32, torch.int64, torch.long):
return tensor.to(device=self.device)
dtype = dtype or self.get_dtype()
return tensor.to(device=self.device, dtype=dtype)python
class BaseModel(ABC):
def get_dtype(self, force_float32: bool = False):
if force_float32:
return torch.float32
if self.device in ("cuda", "mps"):
return torch.float16
return torch.float32
def to_device(self, tensor: torch.Tensor, dtype=None):
# 不对整数张量修改数据类型
if tensor.dtype in (torch.int32, torch.int64, torch.long):
return tensor.to(device=self.device)
dtype = dtype or self.get_dtype()
return tensor.to(device=self.device, dtype=dtype)3. TTL-Based Model Caching
3. 基于TTL的模型缓存
python
_models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5 min TTL
async def _cleanup_idle_models():
while True:
await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
for cache_key, model in _models.pop_expired():
await model.unload()python
_models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5分钟TTL
async def _cleanup_idle_models():
while True:
await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
for cache_key, model in _models.pop_expired():
await model.unload()4. Async Generation with Thread Pools
4. 基于线程池的异步生成
python
undefinedpython
undefinedGGUF models use blocking llama-cpp, run in executor
GGUF模型使用阻塞式llama-cpp,在执行器中运行
self._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(self._executor, self._generate_sync)
undefinedself._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(self._executor, self._generate_sync)
undefinedReview Priority
代码评审优先级
When reviewing Universal Runtime code:
-
Critical - Security
- Path traversal prevention in file endpoints
- Input sanitization for model IDs
-
High - Memory & Device
- Proper CUDA/MPS cache clearing on unload
- torch.no_grad() for inference
- Correct dtype for device
-
Medium - Performance
- Model caching patterns
- Batch processing where applicable
- Streaming implementation
-
Low - Code Style
- Consistent with patterns.md
- Proper type hints
评审Universal Runtime代码时:
-
关键 - 安全
- 文件端点中的路径遍历防护
- 模型ID的输入清理
-
高 - 内存与设备
- 卸载时正确清理CUDA/MPS缓存
- 推理时使用torch.no_grad()
- 适配设备的正确数据类型
-
中等 - 性能
- 模型缓存模式
- 适用场景下的批处理
- 流式传输实现
-
低 - 代码风格
- 与patterns.md保持一致
- 正确的类型标注