llmfit-hardware-model-matcher
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesellmfit Hardware Model Matcher
llmfit 硬件模型匹配工具
Skill by ara.so — Daily 2026 Skills collection.
llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner).
由ara.so开发的技能工具——属于Daily 2026技能合集。
llmfit 可检测系统的内存、CPU和GPU,然后从质量、速度、适配度、上下文维度为数百个LLM模型评分,精准告诉你哪些模型能在你的硬件上流畅运行。它内置交互式TUI和CLI,支持多GPU、MoE架构、动态量化,以及本地运行时提供商(Ollama、llama.cpp、MLX、Docker Model Runner)。
Installation
安装方式
macOS / Linux (Homebrew)
macOS / Linux(Homebrew)
sh
brew install llmfitsh
brew install llmfitQuick install script
快速安装脚本
sh
curl -fsSL https://llmfit.axjns.dev/install.sh | shsh
curl -fsSL https://llmfit.axjns.dev/install.sh | shWithout sudo, installs to ~/.local/bin
无需sudo,安装到~/.local/bin目录
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
undefinedcurl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
undefinedWindows (Scoop)
Windows(Scoop)
sh
scoop install llmfitsh
scoop install llmfitDocker / Podman
Docker / Podman
sh
docker run ghcr.io/alexsjones/llmfitsh
docker run ghcr.io/alexsjones/llmfitWith jq for scripting
结合jq用于脚本开发
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
undefinedpodman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
undefinedFrom source (Rust)
从源码安装(Rust)
sh
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --releasesh
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --releasebinary at target/release/llmfit
二进制文件位于target/release/llmfit
---
---Core Concepts
核心概念
- Fit tiers: (runs great),
perfect(runs well),good(runs but tight),marginal(won't run)too_tight - Scoring dimensions: quality, speed (tok/s estimate), fit (memory headroom), context capacity
- Run modes: GPU, CPU+GPU offload, CPU-only, MoE
- Quantization: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
- Providers: Ollama, llama.cpp, MLX, Docker Model Runner
- 适配等级:(运行流畅)、
perfect(运行良好)、good(可运行但资源紧张)、marginal(无法运行)too_tight - 评分维度:质量、速度(预估每秒生成token数)、适配度(内存剩余空间)、上下文容量
- 运行模式:GPU、CPU+GPU卸载、仅CPU、MoE
- 量化处理:自动为你的硬件选择最优量化方案(如Q4_K_M、Q5_K_S、mlx-4bit)
- 运行提供商:Ollama、llama.cpp、MLX、Docker Model Runner
Key Commands
关键命令
Launch Interactive TUI
启动交互式TUI
sh
llmfitsh
llmfitCLI Table Output
CLI表格输出
sh
llmfit --clish
llmfit --cliShow System Hardware Detection
查看系统硬件检测结果
sh
llmfit system
llmfit --json system # JSON outputsh
llmfit system
llmfit --json system # JSON格式输出List All Models
列出所有模型
sh
llmfit listsh
llmfit listSearch Models
搜索模型
sh
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"sh
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"Fit Analysis
适配性分析
sh
undefinedsh
undefinedAll runnable models ranked by fit
按适配度排序的所有可运行模型
llmfit fit
llmfit fit
Only perfect fits, top 5
仅显示完美适配的前5个模型
llmfit fit --perfect -n 5
llmfit fit --perfect -n 5
JSON output
JSON格式输出
llmfit --json fit -n 10
undefinedllmfit --json fit -n 10
undefinedModel Detail
模型详情
sh
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"sh
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"Recommendations
模型推荐
sh
undefinedsh
undefinedTop 5 recommendations (JSON default)
前5个推荐模型(默认JSON格式)
llmfit recommend --json --limit 5
llmfit recommend --json --limit 5
Filter by use case: general, coding, reasoning, chat, multimodal, embedding
按使用场景筛选:通用、编码、推理、对话、多模态、嵌入
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 5
undefinedllmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 5
undefinedHardware Planning (invert: what hardware do I need?)
硬件规划(反向查询:我需要什么硬件?)
sh
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --jsonsh
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --jsonREST API Server (for cluster scheduling)
REST API服务(用于集群调度)
sh
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787sh
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787Hardware Overrides
硬件配置覆盖
When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
sh
undefined当自动检测失败时(如虚拟机、nvidia-smi故障、透传配置):
sh
undefinedOverride GPU VRAM
覆盖GPU显存配置
llmfit --memory=32G
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G recommend --json
llmfit --memory=32G
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G recommend --json
Megabytes
以兆字节为单位
llmfit --memory=32000M
llmfit --memory=32000M
Works with any subcommand
该参数可与任何子命令配合使用
llmfit --memory=16G info "Llama-3.1-70B"
Accepted suffixes: `G`/`GB`/`GiB`, `M`/`MB`/`MiB`, `T`/`TB`/`TiB` (case-insensitive).llmfit --memory=16G info "Llama-3.1-70B"
支持的单位后缀:`G`/`GB`/`GiB`、`M`/`MB`/`MiB`、`T`/`TB`/`TiB`(不区分大小写)。Context Length Cap
上下文长度限制
sh
undefinedsh
undefinedEstimate memory fit at 4K context
估算4K上下文下的内存适配情况
llmfit --max-context 4096 --cli
llmfit --max-context 4096 --cli
With subcommands
与子命令配合使用
llmfit --max-context 8192 fit --perfect -n 5
llmfit --max-context 16384 recommend --json --limit 5
llmfit --max-context 8192 fit --perfect -n 5
llmfit --max-context 16384 recommend --json --limit 5
Environment variable alternative
也可通过环境变量设置
export OLLAMA_CONTEXT_LENGTH=8192
llmfit recommend --json
---export OLLAMA_CONTEXT_LENGTH=8192
llmfit recommend --json
---REST API Reference
REST API参考
Start the server:
sh
llmfit serve --host 0.0.0.0 --port 8787启动服务:
sh
llmfit serve --host 0.0.0.0 --port 8787Endpoints
接口端点
sh
undefinedsh
undefinedHealth check
健康检查
Node hardware info
节点硬件信息
Full model list with filters
带筛选条件的完整模型列表
Top runnable models for this node (key scheduling endpoint)
当前节点的最优可运行模型(集群调度核心接口)
Search by model name/provider
按模型名称/提供商搜索
undefinedundefinedQuery Parameters for /models
and /models/top
/models/models/top/models
和/models/top
的查询参数
/models/models/top| Param | Values | Description |
|---|---|---|
| integer | Max rows returned |
| | Minimum fit tier |
| | Force perfect-only |
| | Filter by runtime |
| | Use case filter |
| string | Substring match on provider |
| string | Free-text across name/provider/size/use-case |
| | Sort column |
| | Include non-runnable models |
| integer | Per-request context cap |
| 参数 | 可选值 | 描述 |
|---|---|---|
| 整数 | 返回结果的最大行数 |
| | 最小适配等级 |
| | 仅显示完美适配的模型 |
| | 按运行时筛选 |
| | 按使用场景筛选 |
| 字符串 | 按提供商名称子串匹配 |
| 字符串 | 按名称/提供商/规模/使用场景进行自由文本搜索 |
| | 排序字段 |
| | 是否包含无法运行的模型 |
| 整数 | 单次请求的上下文长度限制 |
Scripting & Automation Examples
脚本与自动化示例
Bash: Get top coding models as JSON
Bash:获取编码场景的最优模型(JSON格式)
bash
#!/bin/bashbash
#!/bin/bashGet top 3 coding models that fit perfectly
获取3个完美适配的编码场景最优模型
llmfit recommend --json --use-case coding --limit 3 |
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
undefinedllmfit recommend --json --use-case coding --limit 3 |
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
undefinedBash: Check if a specific model fits
Bash:检查特定模型是否适配
bash
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
echo "$MODEL will run well (fit: $FIT)"
else
echo "$MODEL may not run well (fit: $FIT)"
fibash
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
echo "$MODEL 可流畅运行(适配等级:$FIT)"
else
echo "$MODEL 可能无法流畅运行(适配等级:$FIT)"
fiBash: Auto-pull top Ollama model
Bash:自动拉取最优Ollama模型
bash
#!/bin/bashbash
#!/bin/bashGet the top fitting model name and pull it with Ollama
获取适配度最高的模型名称并通过Ollama拉取
TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
echo "Pulling: $TOP_MODEL"
ollama pull "$TOP_MODEL"
undefinedTOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
echo "正在拉取:$TOP_MODEL"
ollama pull "$TOP_MODEL"
undefinedPython: Query the REST API
Python:调用REST API查询
python
import requests
BASE_URL = "http://localhost:8787"
def get_system_info():
resp = requests.get(f"{BASE_URL}/api/v1/system")
return resp.json()
def get_top_models(use_case="coding", limit=5, min_fit="good"):
params = {
"use_case": use_case,
"limit": limit,
"min_fit": min_fit,
"sort": "score"
}
resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
return resp.json()
def search_models(query, runtime="any"):
resp = requests.get(
f"{BASE_URL}/api/v1/models/{query}",
params={"runtime": runtime}
)
return resp.json()python
import requests
BASE_URL = "http://localhost:8787"
def get_system_info():
resp = requests.get(f"{BASE_URL}/api/v1/system")
return resp.json()
def get_top_models(use_case="coding", limit=5, min_fit="good"):
params = {
"use_case": use_case,
"limit": limit,
"min_fit": min_fit,
"sort": "score"
}
resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
return resp.json()
def search_models(query, runtime="any"):
resp = requests.get(
f"{BASE_URL}/api/v1/models/{query}",
params={"runtime": runtime}
)
return resp.json()Example usage
使用示例
system = get_system_info()
print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3)
for m in models.get("models", []):
print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
undefinedsystem = get_system_info()
print(f"GPU: {system.get('gpu_name')} | 显存: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3)
for m in models.get("models", []):
print(f"{m['name']}: 评分={m['score']}, 适配等级={m['fit']}, 量化方案={m['quantization']}")
undefinedPython: Hardware-aware model selector for agents
Python:面向Agent的硬件感知模型选择器
python
import subprocess
import json
def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
"""Use llmfit to select the best model for a given task."""
result = subprocess.run(
["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
models = data.get("models", [])
return models[0] if models else None
def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
"""Get hardware requirements for running a specific model."""
result = subprocess.run(
["llmfit", "plan", model_name, "--context", str(context), "--json"],
capture_output=True,
text=True
)
return json.loads(result.stdout)python
import subprocess
import json
def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
"""使用llmfit为指定任务选择最优模型。"""
result = subprocess.run(
["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
models = data.get("models", [])
return models[0] if models else None
def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
"""获取运行指定模型所需的硬件配置要求。"""
result = subprocess.run(
["llmfit", "plan", model_name, "--context", str(context), "--json"],
capture_output=True,
text=True
)
return json.loads(result.stdout)Select best coding model
选择最优编码模型
best = get_best_model_for_task("coding")
if best:
print(f"Best coding model: {best['name']}")
print(f" Quantization: {best['quantization']}")
print(f" Estimated tok/s: {best['tps']}")
print(f" Memory usage: {best['mem_pct']}%")
best = get_best_model_for_task("coding")
if best:
print(f"最优编码模型:{best['name']}")
print(f" 量化方案:{best['quantization']}")
print(f" 预估每秒token数:{best['tps']}")
print(f" 内存占用率:{best['mem_pct']}%")
Plan hardware for a specific model
规划指定模型的硬件需求
plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)
print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")
print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
undefinedplan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)
print(f"所需最小显存:{plan['hardware']['min_vram_gb']}GB")
print(f"推荐显存:{plan['hardware']['recommended_vram_gb']}GB")
undefinedDocker Compose: Node scheduler pattern
Docker Compose:节点调度模式
yaml
version: "3.8"
services:
llmfit-api:
image: ghcr.io/alexsjones/llmfit
command: serve --host 0.0.0.0 --port 8787
ports:
- "8787:8787"
environment:
- OLLAMA_CONTEXT_LENGTH=8192
devices:
- /dev/nvidia0:/dev/nvidia0 # pass GPU throughyaml
version: "3.8"
services:
llmfit-api:
image: ghcr.io/alexsjones/llmfit
command: serve --host 0.0.0.0 --port 8787
ports:
- "8787:8787"
environment:
- OLLAMA_CONTEXT_LENGTH=8192
devices:
- /dev/nvidia0:/dev/nvidia0 # 透传GPUTUI Key Reference
TUI快捷键参考
| Key | Action |
|---|---|
| Navigate models |
| Search (name, provider, params, use case) |
| Exit search |
| Clear search |
| Cycle fit filter: All → Runnable → Perfect → Good → Marginal |
| Cycle availability: All → GGUF Avail → Installed |
| Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case |
| Cycle color theme (auto-saved) |
| Visual mode (multi-select for comparison) |
| Select mode (column-based filtering) |
| Plan mode (what hardware needed for this model?) |
| Provider filter popup |
| Use-case filter popup |
| Capability filter popup |
| Mark model for comparison |
| Compare view (marked vs selected) |
| Download model (via detected runtime) |
| Refresh installed models from runtimes |
| Toggle detail view |
| Jump to top/bottom |
| Quit |
| 按键 | 操作 |
|---|---|
| 导航模型列表 |
| 搜索(名称、提供商、参数、使用场景) |
| 退出搜索 |
| 清空搜索内容 |
| 切换适配等级筛选:全部→可运行→完美适配→良好适配→资源紧张 |
| 切换可用状态筛选:全部→GGUF可用→已安装 |
| 切换排序方式:评分→参数→内存占用率→上下文→日期→使用场景 |
| 切换颜色主题(自动保存) |
| 可视化模式(多选用于对比) |
| 选择模式(基于列的筛选) |
| 规划模式(运行该模型需要什么硬件?) |
| 提供商筛选弹窗 |
| 使用场景筛选弹窗 |
| 能力筛选弹窗 |
| 标记模型用于对比 |
| 对比视图(已标记模型与选中模型) |
| 下载模型(通过检测到的运行时) |
| 从运行时刷新已安装模型列表 |
| 切换详情视图 |
| 跳转到列表顶部/底部 |
| 退出TUI |
Themes
主题
tTheme saved to
~/.config/llmfit/themet主题设置保存到
~/.config/llmfit/themeGPU Detection Details
GPU检测细节
| GPU Vendor | Detection Method |
|---|---|
| NVIDIA | |
| AMD | |
| Intel Arc | sysfs (discrete) / |
| Apple Silicon | |
| Ascend | |
| GPU厂商 | 检测方式 |
|---|---|
| NVIDIA | |
| AMD | |
| Intel Arc | sysfs(独立显卡)/ |
| Apple Silicon | |
| Ascend | |
Common Patterns
常见使用场景
"What can I run on my 16GB M2 Mac?"
"我的16GB M2 Mac能运行哪些模型?"
sh
llmfit fit --perfect -n 10sh
llmfit fit --perfect -n 10or interactively
或使用交互式界面
llmfit
llmfit
press 'f' to filter to Perfect fit
按'f'键筛选到完美适配等级
undefinedundefined"I have a 3090 (24GB VRAM), what coding models fit?"
"我有3090(24GB显存),哪些编码模型适配?"
sh
llmfit recommend --json --use-case coding | jq '.models[]'sh
llmfit recommend --json --use-case coding | jq '.models[]'or with manual override if detection fails
若自动检测失败,可手动覆盖显存配置
llmfit --memory=24G recommend --json --use-case coding
undefinedllmfit --memory=24G recommend --json --use-case coding
undefined"Can Llama 70B run on my machine?"
"我的机器能运行Llama 70B吗?"
sh
llmfit info "Llama-3.1-70B"sh
llmfit info "Llama-3.1-70B"Plan what hardware you'd need
规划运行该模型所需的硬件配置
llmfit plan "Llama-3.1-70B" --context 4096 --json
undefinedllmfit plan "Llama-3.1-70B" --context 4096 --json
undefined"Show me only models already installed in Ollama"
"只显示已在Ollama中安装的模型"
sh
llmfitsh
llmfitpress 'a' to cycle to Installed filter
按'a'键切换到已安装筛选
or
或
llmfit fit -n 20 # run, press 'i' in TUI for installed-first
undefinedllmfit fit -n 20 # 运行后在TUI中按'i'键优先显示已安装模型
undefined"Script: find best model and start Ollama"
"脚本:找到最优模型并启动Ollama"
bash
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"bash
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL""API: poll node capabilities for cluster scheduler"
"API:为集群调度轮询节点能力"
bash
undefinedbash
undefinedCheck node, get top 3 good+ models for reasoning
检查节点,获取3个适配等级为良好及以上的推理模型
curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" |
jq '.models[].name'
jq '.models[].name'
---curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" |
jq '.models[].name'
jq '.models[].name'
---Troubleshooting
故障排查
GPU not detected / wrong VRAM reported
sh
undefinedGPU未被检测到 / 显存报告错误
sh
undefinedVerify detection
验证检测结果
llmfit system
llmfit system
Manual override
手动覆盖显存配置
llmfit --memory=24G --cli
**`nvidia-smi` not found but you have an NVIDIA GPU**
```shllmfit --memory=24G --cli
**已安装NVIDIA GPU但找不到`nvidia-smi`**
```shInstall CUDA toolkit or nvidia-utils, then retry
安装CUDA工具包或nvidia-utils,然后重试
Or override manually:
或手动覆盖配置:
llmfit --memory=8G fit --perfect
**Models show as too_tight but you have enough RAM**
```shllmfit --memory=8G fit --perfect
**模型显示为too_tight但内存足够**
```shllmfit may be using context-inflated estimates; cap context
llmfit可能使用了上下文膨胀后的估算值,可限制上下文长度
llmfit --max-context 2048 fit --perfect -n 10
**REST API: test endpoints**
```shllmfit --max-context 2048 fit --perfect -n 10
**REST API:测试接口**
```shSpawn server and run validation suite
启动服务并运行验证套件
python3 scripts/test_api.py --spawn
python3 scripts/test_api.py --spawn
Test already-running server
测试已运行的服务
python3 scripts/test_api.py --base-url http://127.0.0.1:8787
**Apple Silicon: VRAM shows as system RAM (expected)**
```shpython3 scripts/test_api.py --base-url http://127.0.0.1:8787
**Apple Silicon:显存显示为系统内存(正常现象)**
```shThis is correct — Apple Silicon uses unified memory
这是正确的——Apple Silicon使用统一内存
llmfit accounts for this automatically
llmfit会自动适配这种情况
llmfit system # should show backend: Metal
**Context length environment variable**
```sh
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json # uses 4096 as context capllmfit system # 应显示backend: Metal
**上下文长度环境变量**
```sh
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json # 使用4096作为上下文长度限制