runtime-skills

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Universal Runtime Skills

Universal Runtime 技能

Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.
LlamaFarm本地机器学习推理服务器Universal Runtime的最佳实践与代码评审检查清单。

Overview

概述

The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:
  • Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
  • Text embeddings (BERT, sentence-transformers, ModernBERT)
  • Classification, NER, and reranking
  • OCR and document understanding
  • Anomaly detection
Directory:
runtimes/universal/
Python: 3.11+ Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python
Universal Runtime 为HuggingFace模型提供兼容OpenAI的端点:
  • 文本生成(因果语言模型:GPT、Llama、Mistral、Qwen)
  • 文本嵌入(BERT、sentence-transformers、ModernBERT)
  • 分类、命名实体识别(NER)与重排序
  • OCR与文档理解
  • 异常检测
目录
runtimes/universal/
Python版本:3.11+ 核心依赖:PyTorch、Transformers、FastAPI、llama-cpp-python

Links to Shared Skills

共享技能链接

This skill extends the shared Python practices. Always apply these first:
TopicFilePriority
Patternspython-skills/patterns.mdMedium
Asyncpython-skills/async.mdHigh
Typingpython-skills/typing.mdMedium
Testingpython-skills/testing.mdMedium
Errorspython-skills/error-handling.mdHigh
Securitypython-skills/security.mdCritical
本技能基于共享Python实践扩展,请先遵循以下内容:
主题文件优先级
设计模式python-skills/patterns.md中等
异步编程python-skills/async.md
类型标注python-skills/typing.md中等
测试python-skills/testing.md中等
错误处理python-skills/error-handling.md
安全python-skills/security.md关键

Runtime-Specific Checklists

运行时专属检查清单

TopicFileKey Points
PyTorchpytorch.mdDevice management, dtype, memory cleanup
Transformerstransformers.mdModel loading, tokenization, inference
FastAPIfastapi.mdAPI design, streaming, lifespan
Performanceperformance.mdBatching, caching, optimizations
主题文件核心要点
PyTorchpytorch.md设备管理、dtype、内存清理
Transformerstransformers.md模型加载、分词、推理
FastAPIfastapi.mdAPI设计、流式传输、生命周期
性能performance.md批处理、缓存、优化

Architecture

架构

runtimes/universal/
├── server.py              # FastAPI app, model caching, endpoints
├── core/
│   └── logging.py         # UniversalRuntimeLogger (structlog)
├── models/
│   ├── base.py            # BaseModel ABC with device management
│   ├── language_model.py  # Transformers text generation
│   ├── gguf_language_model.py  # llama-cpp-python for GGUF
│   ├── encoder_model.py   # Embeddings, classification, NER, reranking
│   └── ...                # OCR, anomaly, document models
├── routers/
│   └── chat_completions/  # Chat completions with streaming
├── utils/
│   ├── device.py          # Device detection (CUDA/MPS/CPU)
│   ├── model_cache.py     # TTL-based model caching
│   ├── model_format.py    # GGUF vs transformers detection
│   └── context_calculator.py  # GGUF context size computation
└── tests/
runtimes/universal/
├── server.py              # FastAPI 应用、模型缓存、端点
├── core/
│   └── logging.py         # UniversalRuntimeLogger(structlog)
├── models/
│   ├── base.py            # 带设备管理的BaseModel抽象基类
│   ├── language_model.py  # Transformers文本生成
│   ├── gguf_language_model.py  # 基于llama-cpp-python的GGUF模型
│   ├── encoder_model.py   # 嵌入、分类、NER、重排序
│   └── ...                # OCR、异常检测、文档模型
├── routers/
│   └── chat_completions/  # 带流式传输的对话补全
├── utils/
│   ├── device.py          # 设备检测(CUDA/MPS/CPU)
│   ├── model_cache.py     # 基于TTL的模型缓存
│   ├── model_format.py    # GGUF与transformers格式检测
│   └── context_calculator.py  # GGUF上下文大小计算
└── tests/

Key Patterns

核心模式

1. Model Loading with Double-Checked Locking

1. 双重检查锁定的模型加载

python
_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):
    cache_key = f"encoder:{task}:{model_id}"
    if cache_key not in _models:
        async with _model_load_lock:
            # Double-check after acquiring lock
            if cache_key not in _models:
                model = EncoderModel(model_id, device, task=task)
                await model.load()
                _models[cache_key] = model
    return _models.get(cache_key)
python
_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):
    cache_key = f"encoder:{task}:{model_id}"
    if cache_key not in _models:
        async with _model_load_lock:
            # 获取锁后再次检查
            if cache_key not in _models:
                model = EncoderModel(model_id, device, task=task)
                await model.load()
                _models[cache_key] = model
    return _models.get(cache_key)

2. Device-Aware Tensor Operations

2. 感知设备的张量操作

python
class BaseModel(ABC):
    def get_dtype(self, force_float32: bool = False):
        if force_float32:
            return torch.float32
        if self.device in ("cuda", "mps"):
            return torch.float16
        return torch.float32

    def to_device(self, tensor: torch.Tensor, dtype=None):
        # Don't change dtype for integer tensors
        if tensor.dtype in (torch.int32, torch.int64, torch.long):
            return tensor.to(device=self.device)
        dtype = dtype or self.get_dtype()
        return tensor.to(device=self.device, dtype=dtype)
python
class BaseModel(ABC):
    def get_dtype(self, force_float32: bool = False):
        if force_float32:
            return torch.float32
        if self.device in ("cuda", "mps"):
            return torch.float16
        return torch.float32

    def to_device(self, tensor: torch.Tensor, dtype=None):
        # 不对整数张量修改数据类型
        if tensor.dtype in (torch.int32, torch.int64, torch.long):
            return tensor.to(device=self.device)
        dtype = dtype or self.get_dtype()
        return tensor.to(device=self.device, dtype=dtype)

3. TTL-Based Model Caching

3. 基于TTL的模型缓存

python
_models: ModelCache[BaseModel] = ModelCache(ttl=300)  # 5 min TTL

async def _cleanup_idle_models():
    while True:
        await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
        for cache_key, model in _models.pop_expired():
            await model.unload()
python
_models: ModelCache[BaseModel] = ModelCache(ttl=300)  # 5分钟TTL

async def _cleanup_idle_models():
    while True:
        await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
        for cache_key, model in _models.pop_expired():
            await model.unload()

4. Async Generation with Thread Pools

4. 基于线程池的异步生成

python
undefined
python
undefined

GGUF models use blocking llama-cpp, run in executor

GGUF模型使用阻塞式llama-cpp,在执行器中运行

self._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...): loop = asyncio.get_running_loop() return await loop.run_in_executor(self._executor, self._generate_sync)
undefined
self._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...): loop = asyncio.get_running_loop() return await loop.run_in_executor(self._executor, self._generate_sync)
undefined

Review Priority

代码评审优先级

When reviewing Universal Runtime code:
  1. Critical - Security
    • Path traversal prevention in file endpoints
    • Input sanitization for model IDs
  2. High - Memory & Device
    • Proper CUDA/MPS cache clearing on unload
    • torch.no_grad() for inference
    • Correct dtype for device
  3. Medium - Performance
    • Model caching patterns
    • Batch processing where applicable
    • Streaming implementation
  4. Low - Code Style
    • Consistent with patterns.md
    • Proper type hints
评审Universal Runtime代码时:
  1. 关键 - 安全
    • 文件端点中的路径遍历防护
    • 模型ID的输入清理
  2. - 内存与设备
    • 卸载时正确清理CUDA/MPS缓存
    • 推理时使用torch.no_grad()
    • 适配设备的正确数据类型
  3. 中等 - 性能
    • 模型缓存模式
    • 适用场景下的批处理
    • 流式传输实现
  4. - 代码风格
    • 与patterns.md保持一致
    • 正确的类型标注