runtime-skills

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Universal Runtime Skills

Universal Runtime 技能

Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.

LlamaFarm本地机器学习推理服务器Universal Runtime的最佳实践与代码评审检查清单。

Overview

概述

The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:

Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
Text embeddings (BERT, sentence-transformers, ModernBERT)
Classification, NER, and reranking
OCR and document understanding
Anomaly detection

Directory:

runtimes/universal/

Python: 3.11+ Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python

Universal Runtime 为HuggingFace模型提供兼容OpenAI的端点：

文本生成（因果语言模型：GPT、Llama、Mistral、Qwen）
文本嵌入（BERT、sentence-transformers、ModernBERT）
分类、命名实体识别（NER）与重排序
OCR与文档理解
异常检测

runtimes/universal/

Python版本：3.11+ 核心依赖：PyTorch、Transformers、FastAPI、llama-cpp-python

Links to Shared Skills

共享技能链接

This skill extends the shared Python practices. Always apply these first:

Topic	File	Priority
Patterns	python-skills/patterns.md	Medium
Async	python-skills/async.md	High
Typing	python-skills/typing.md	Medium
Testing	python-skills/testing.md	Medium
Errors	python-skills/error-handling.md	High
Security	python-skills/security.md	Critical

本技能基于共享Python实践扩展，请先遵循以下内容：

主题	文件	优先级
设计模式	python-skills/patterns.md	中等
异步编程	python-skills/async.md	高
类型标注	python-skills/typing.md	中等
测试	python-skills/testing.md	中等
错误处理	python-skills/error-handling.md	高
安全	python-skills/security.md	关键

Runtime-Specific Checklists

运行时专属检查清单

Topic	File	Key Points
PyTorch	pytorch.md	Device management, dtype, memory cleanup
Transformers	transformers.md	Model loading, tokenization, inference
FastAPI	fastapi.md	API design, streaming, lifespan
Performance	performance.md	Batching, caching, optimizations

主题	文件	核心要点
PyTorch	pytorch.md	设备管理、dtype、内存清理
Transformers	transformers.md	模型加载、分词、推理
FastAPI	fastapi.md	API设计、流式传输、生命周期
性能	performance.md	批处理、缓存、优化

Architecture

架构

runtimes/universal/
├── server.py              # FastAPI app, model caching, endpoints
├── core/
│   └── logging.py         # UniversalRuntimeLogger (structlog)
├── models/
│   ├── base.py            # BaseModel ABC with device management
│   ├── language_model.py  # Transformers text generation
│   ├── gguf_language_model.py  # llama-cpp-python for GGUF
│   ├── encoder_model.py   # Embeddings, classification, NER, reranking
│   └── ...                # OCR, anomaly, document models
├── routers/
│   └── chat_completions/  # Chat completions with streaming
├── utils/
│   ├── device.py          # Device detection (CUDA/MPS/CPU)
│   ├── model_cache.py     # TTL-based model caching
│   ├── model_format.py    # GGUF vs transformers detection
│   └── context_calculator.py  # GGUF context size computation
└── tests/

runtimes/universal/
├── server.py              # FastAPI 应用、模型缓存、端点
├── core/
│   └── logging.py         # UniversalRuntimeLogger（structlog）
├── models/
│   ├── base.py            # 带设备管理的BaseModel抽象基类
│   ├── language_model.py  # Transformers文本生成
│   ├── gguf_language_model.py  # 基于llama-cpp-python的GGUF模型
│   ├── encoder_model.py   # 嵌入、分类、NER、重排序
│   └── ...                # OCR、异常检测、文档模型
├── routers/
│   └── chat_completions/  # 带流式传输的对话补全
├── utils/
│   ├── device.py          # 设备检测（CUDA/MPS/CPU）
│   ├── model_cache.py     # 基于TTL的模型缓存
│   ├── model_format.py    # GGUF与transformers格式检测
│   └── context_calculator.py  # GGUF上下文大小计算
└── tests/

Key Patterns

核心模式

1. Model Loading with Double-Checked Locking

1. 双重检查锁定的模型加载

python

_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):
    cache_key = f"encoder:{task}:{model_id}"
    if cache_key not in _models:
        async with _model_load_lock:
            # Double-check after acquiring lock
            if cache_key not in _models:
                model = EncoderModel(model_id, device, task=task)
                await model.load()
                _models[cache_key] = model
    return _models.get(cache_key)

python

_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):
    cache_key = f"encoder:{task}:{model_id}"
    if cache_key not in _models:
        async with _model_load_lock:
            # 获取锁后再次检查
            if cache_key not in _models:
                model = EncoderModel(model_id, device, task=task)
                await model.load()
                _models[cache_key] = model
    return _models.get(cache_key)

2. Device-Aware Tensor Operations

2. 感知设备的张量操作

python

class BaseModel(ABC):
    def get_dtype(self, force_float32: bool = False):
        if force_float32:
            return torch.float32
        if self.device in ("cuda", "mps"):
            return torch.float16
        return torch.float32

    def to_device(self, tensor: torch.Tensor, dtype=None):
        # Don't change dtype for integer tensors
        if tensor.dtype in (torch.int32, torch.int64, torch.long):
            return tensor.to(device=self.device)
        dtype = dtype or self.get_dtype()
        return tensor.to(device=self.device, dtype=dtype)

python

class BaseModel(ABC):
    def get_dtype(self, force_float32: bool = False):
        if force_float32:
            return torch.float32
        if self.device in ("cuda", "mps"):
            return torch.float16
        return torch.float32

    def to_device(self, tensor: torch.Tensor, dtype=None):
        # 不对整数张量修改数据类型
        if tensor.dtype in (torch.int32, torch.int64, torch.long):
            return tensor.to(device=self.device)
        dtype = dtype or self.get_dtype()
        return tensor.to(device=self.device, dtype=dtype)

3. TTL-Based Model Caching

3. 基于TTL的模型缓存

python

_models: ModelCache[BaseModel] = ModelCache(ttl=300)  # 5 min TTL

async def _cleanup_idle_models():
    while True:
        await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
        for cache_key, model in _models.pop_expired():
            await model.unload()

python

_models: ModelCache[BaseModel] = ModelCache(ttl=300)  # 5分钟TTL

async def _cleanup_idle_models():
    while True:
        await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
        for cache_key, model in _models.pop_expired():
            await model.unload()

4. Async Generation with Thread Pools

4. 基于线程池的异步生成

python

undefined

python

undefined

GGUF models use blocking llama-cpp, run in executor

GGUF模型使用阻塞式llama-cpp，在执行器中运行

self._executor = ThreadPoolExecutor(max_workers=1)

async def generate(self, messages, max_tokens=512, ...): loop = asyncio.get_running_loop() return await loop.run_in_executor(self._executor, self._generate_sync)

undefined

self._executor = ThreadPoolExecutor(max_workers=1)

async def generate(self, messages, max_tokens=512, ...): loop = asyncio.get_running_loop() return await loop.run_in_executor(self._executor, self._generate_sync)

undefined

Review Priority

代码评审优先级

When reviewing Universal Runtime code:

Critical - Security
- Path traversal prevention in file endpoints
- Input sanitization for model IDs
High - Memory & Device
- Proper CUDA/MPS cache clearing on unload
- torch.no_grad() for inference
- Correct dtype for device
Medium - Performance
- Model caching patterns
- Batch processing where applicable
- Streaming implementation
Low - Code Style
- Consistent with patterns.md
- Proper type hints

评审Universal Runtime代码时：

关键 - 安全
- 文件端点中的路径遍历防护
- 模型ID的输入清理
高 - 内存与设备
- 卸载时正确清理CUDA/MPS缓存
- 推理时使用torch.no_grad()
- 适配设备的正确数据类型
中等 - 性能
- 模型缓存模式
- 适用场景下的批处理
- 流式传输实现
低 - 代码风格
- 与patterns.md保持一致
- 正确的类型标注