llmfit-hardware-model-matcher

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

llmfit Hardware Model Matcher

llmfit 硬件模型匹配工具

Skill by ara.so — Daily 2026 Skills collection.
llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner).

ara.so开发的技能工具——属于Daily 2026技能合集。
llmfit 可检测系统的内存、CPU和GPU,然后从质量、速度、适配度、上下文维度为数百个LLM模型评分,精准告诉你哪些模型能在你的硬件上流畅运行。它内置交互式TUI和CLI,支持多GPU、MoE架构、动态量化,以及本地运行时提供商(Ollama、llama.cpp、MLX、Docker Model Runner)。

Installation

安装方式

macOS / Linux (Homebrew)

macOS / Linux(Homebrew)

sh
brew install llmfit
sh
brew install llmfit

Quick install script

快速安装脚本

sh
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
sh
curl -fsSL https://llmfit.axjns.dev/install.sh | sh

Without sudo, installs to ~/.local/bin

无需sudo,安装到~/.local/bin目录

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
undefined
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
undefined

Windows (Scoop)

Windows(Scoop)

sh
scoop install llmfit
sh
scoop install llmfit

Docker / Podman

Docker / Podman

sh
docker run ghcr.io/alexsjones/llmfit
sh
docker run ghcr.io/alexsjones/llmfit

With jq for scripting

结合jq用于脚本开发

podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
undefined
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
undefined

From source (Rust)

从源码安装(Rust)

sh
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
sh
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release

binary at target/release/llmfit

二进制文件位于target/release/llmfit


---

---

Core Concepts

核心概念

  • Fit tiers:
    perfect
    (runs great),
    good
    (runs well),
    marginal
    (runs but tight),
    too_tight
    (won't run)
  • Scoring dimensions: quality, speed (tok/s estimate), fit (memory headroom), context capacity
  • Run modes: GPU, CPU+GPU offload, CPU-only, MoE
  • Quantization: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
  • Providers: Ollama, llama.cpp, MLX, Docker Model Runner

  • 适配等级
    perfect
    (运行流畅)、
    good
    (运行良好)、
    marginal
    (可运行但资源紧张)、
    too_tight
    (无法运行)
  • 评分维度:质量、速度(预估每秒生成token数)、适配度(内存剩余空间)、上下文容量
  • 运行模式:GPU、CPU+GPU卸载、仅CPU、MoE
  • 量化处理:自动为你的硬件选择最优量化方案(如Q4_K_M、Q5_K_S、mlx-4bit)
  • 运行提供商:Ollama、llama.cpp、MLX、Docker Model Runner

Key Commands

关键命令

Launch Interactive TUI

启动交互式TUI

sh
llmfit
sh
llmfit

CLI Table Output

CLI表格输出

sh
llmfit --cli
sh
llmfit --cli

Show System Hardware Detection

查看系统硬件检测结果

sh
llmfit system
llmfit --json system   # JSON output
sh
llmfit system
llmfit --json system   # JSON格式输出

List All Models

列出所有模型

sh
llmfit list
sh
llmfit list

Search Models

搜索模型

sh
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"
sh
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"

Fit Analysis

适配性分析

sh
undefined
sh
undefined

All runnable models ranked by fit

按适配度排序的所有可运行模型

llmfit fit
llmfit fit

Only perfect fits, top 5

仅显示完美适配的前5个模型

llmfit fit --perfect -n 5
llmfit fit --perfect -n 5

JSON output

JSON格式输出

llmfit --json fit -n 10
undefined
llmfit --json fit -n 10
undefined

Model Detail

模型详情

sh
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"
sh
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"

Recommendations

模型推荐

sh
undefined
sh
undefined

Top 5 recommendations (JSON default)

前5个推荐模型(默认JSON格式)

llmfit recommend --json --limit 5
llmfit recommend --json --limit 5

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

按使用场景筛选:通用、编码、推理、对话、多模态、嵌入

llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 5
undefined
llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 5
undefined

Hardware Planning (invert: what hardware do I need?)

硬件规划(反向查询:我需要什么硬件?)

sh
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json
sh
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json

REST API Server (for cluster scheduling)

REST API服务(用于集群调度)

sh
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787

sh
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787

Hardware Overrides

硬件配置覆盖

When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
sh
undefined
当自动检测失败时(如虚拟机、nvidia-smi故障、透传配置):
sh
undefined

Override GPU VRAM

覆盖GPU显存配置

llmfit --memory=32G llmfit --memory=24G --cli llmfit --memory=24G fit --perfect -n 5 llmfit --memory=24G recommend --json
llmfit --memory=32G llmfit --memory=24G --cli llmfit --memory=24G fit --perfect -n 5 llmfit --memory=24G recommend --json

Megabytes

以兆字节为单位

llmfit --memory=32000M
llmfit --memory=32000M

Works with any subcommand

该参数可与任何子命令配合使用

llmfit --memory=16G info "Llama-3.1-70B"

Accepted suffixes: `G`/`GB`/`GiB`, `M`/`MB`/`MiB`, `T`/`TB`/`TiB` (case-insensitive).
llmfit --memory=16G info "Llama-3.1-70B"

支持的单位后缀:`G`/`GB`/`GiB`、`M`/`MB`/`MiB`、`T`/`TB`/`TiB`(不区分大小写)。

Context Length Cap

上下文长度限制

sh
undefined
sh
undefined

Estimate memory fit at 4K context

估算4K上下文下的内存适配情况

llmfit --max-context 4096 --cli
llmfit --max-context 4096 --cli

With subcommands

与子命令配合使用

llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5
llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5

Environment variable alternative

也可通过环境变量设置

export OLLAMA_CONTEXT_LENGTH=8192 llmfit recommend --json

---
export OLLAMA_CONTEXT_LENGTH=8192 llmfit recommend --json

---

REST API Reference

REST API参考

Start the server:
sh
llmfit serve --host 0.0.0.0 --port 8787
启动服务:
sh
llmfit serve --host 0.0.0.0 --port 8787

Endpoints

接口端点

sh
undefined
sh
undefined

Health check

健康检查

Node hardware info

节点硬件信息

Full model list with filters

带筛选条件的完整模型列表

Top runnable models for this node (key scheduling endpoint)

当前节点的最优可运行模型(集群调度核心接口)

Search by model name/provider

按模型名称/提供商搜索

Query Parameters for
/models
and
/models/top

/models
/models/top
的查询参数

ParamValuesDescription
limit
/
n
integerMax rows returned
min_fit
perfect|good|marginal|too_tight
Minimum fit tier
perfect
true|false
Force perfect-only
runtime
any|mlx|llamacpp
Filter by runtime
use_case
general|coding|reasoning|chat|multimodal|embedding
Use case filter
provider
stringSubstring match on provider
search
stringFree-text across name/provider/size/use-case
sort
score|tps|params|mem|ctx|date|use_case
Sort column
include_too_tight
true|false
Include non-runnable models
max_context
integerPer-request context cap

参数可选值描述
limit
/
n
整数返回结果的最大行数
min_fit
perfect|good|marginal|too_tight
最小适配等级
perfect
true|false
仅显示完美适配的模型
runtime
any|mlx|llamacpp
按运行时筛选
use_case
general|coding|reasoning|chat|multimodal|embedding
按使用场景筛选
provider
字符串按提供商名称子串匹配
search
字符串按名称/提供商/规模/使用场景进行自由文本搜索
sort
score|tps|params|mem|ctx|date|use_case
排序字段
include_too_tight
true|false
是否包含无法运行的模型
max_context
整数单次请求的上下文长度限制

Scripting & Automation Examples

脚本与自动化示例

Bash: Get top coding models as JSON

Bash:获取编码场景的最优模型(JSON格式)

bash
#!/bin/bash
bash
#!/bin/bash

Get top 3 coding models that fit perfectly

获取3个完美适配的编码场景最优模型

llmfit recommend --json --use-case coding --limit 3 |
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
undefined
llmfit recommend --json --use-case coding --limit 3 |
jq -r '.models[] | "(.name) ((.score)) - (.quantization)"'
undefined

Bash: Check if a specific model fits

Bash:检查特定模型是否适配

bash
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
  echo "$MODEL will run well (fit: $FIT)"
else
  echo "$MODEL may not run well (fit: $FIT)"
fi
bash
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
  echo "$MODEL 可流畅运行(适配等级:$FIT)"
else
  echo "$MODEL 可能无法流畅运行(适配等级:$FIT)"
fi

Bash: Auto-pull top Ollama model

Bash:自动拉取最优Ollama模型

bash
#!/bin/bash
bash
#!/bin/bash

Get the top fitting model name and pull it with Ollama

获取适配度最高的模型名称并通过Ollama拉取

TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name') echo "Pulling: $TOP_MODEL" ollama pull "$TOP_MODEL"
undefined
TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name') echo "正在拉取:$TOP_MODEL" ollama pull "$TOP_MODEL"
undefined

Python: Query the REST API

Python:调用REST API查询

python
import requests

BASE_URL = "http://localhost:8787"

def get_system_info():
    resp = requests.get(f"{BASE_URL}/api/v1/system")
    return resp.json()

def get_top_models(use_case="coding", limit=5, min_fit="good"):
    params = {
        "use_case": use_case,
        "limit": limit,
        "min_fit": min_fit,
        "sort": "score"
    }
    resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
    return resp.json()

def search_models(query, runtime="any"):
    resp = requests.get(
        f"{BASE_URL}/api/v1/models/{query}",
        params={"runtime": runtime}
    )
    return resp.json()
python
import requests

BASE_URL = "http://localhost:8787"

def get_system_info():
    resp = requests.get(f"{BASE_URL}/api/v1/system")
    return resp.json()

def get_top_models(use_case="coding", limit=5, min_fit="good"):
    params = {
        "use_case": use_case,
        "limit": limit,
        "min_fit": min_fit,
        "sort": "score"
    }
    resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
    return resp.json()

def search_models(query, runtime="any"):
    resp = requests.get(
        f"{BASE_URL}/api/v1/models/{query}",
        params={"runtime": runtime}
    )
    return resp.json()

Example usage

使用示例

system = get_system_info() print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3) for m in models.get("models", []): print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
undefined
system = get_system_info() print(f"GPU: {system.get('gpu_name')} | 显存: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3) for m in models.get("models", []): print(f"{m['name']}: 评分={m['score']}, 适配等级={m['fit']}, 量化方案={m['quantization']}")
undefined

Python: Hardware-aware model selector for agents

Python:面向Agent的硬件感知模型选择器

python
import subprocess
import json

def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
    """Use llmfit to select the best model for a given task."""
    result = subprocess.run(
        ["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
        capture_output=True,
        text=True
    )
    data = json.loads(result.stdout)
    models = data.get("models", [])
    return models[0] if models else None

def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
    """Get hardware requirements for running a specific model."""
    result = subprocess.run(
        ["llmfit", "plan", model_name, "--context", str(context), "--json"],
        capture_output=True,
        text=True
    )
    return json.loads(result.stdout)
python
import subprocess
import json

def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
    """使用llmfit为指定任务选择最优模型。"""
    result = subprocess.run(
        ["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
        capture_output=True,
        text=True
    )
    data = json.loads(result.stdout)
    models = data.get("models", [])
    return models[0] if models else None

def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
    """获取运行指定模型所需的硬件配置要求。"""
    result = subprocess.run(
        ["llmfit", "plan", model_name, "--context", str(context), "--json"],
        capture_output=True,
        text=True
    )
    return json.loads(result.stdout)

Select best coding model

选择最优编码模型

best = get_best_model_for_task("coding") if best: print(f"Best coding model: {best['name']}") print(f" Quantization: {best['quantization']}") print(f" Estimated tok/s: {best['tps']}") print(f" Memory usage: {best['mem_pct']}%")
best = get_best_model_for_task("coding") if best: print(f"最优编码模型:{best['name']}") print(f" 量化方案:{best['quantization']}") print(f" 预估每秒token数:{best['tps']}") print(f" 内存占用率:{best['mem_pct']}%")

Plan hardware for a specific model

规划指定模型的硬件需求

plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192) print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB") print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
undefined
plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192) print(f"所需最小显存:{plan['hardware']['min_vram_gb']}GB") print(f"推荐显存:{plan['hardware']['recommended_vram_gb']}GB")
undefined

Docker Compose: Node scheduler pattern

Docker Compose:节点调度模式

yaml
version: "3.8"
services:
  llmfit-api:
    image: ghcr.io/alexsjones/llmfit
    command: serve --host 0.0.0.0 --port 8787
    ports:
      - "8787:8787"
    environment:
      - OLLAMA_CONTEXT_LENGTH=8192
    devices:
      - /dev/nvidia0:/dev/nvidia0  # pass GPU through

yaml
version: "3.8"
services:
  llmfit-api:
    image: ghcr.io/alexsjones/llmfit
    command: serve --host 0.0.0.0 --port 8787
    ports:
      - "8787:8787"
    environment:
      - OLLAMA_CONTEXT_LENGTH=8192
    devices:
      - /dev/nvidia0:/dev/nvidia0  # 透传GPU

TUI Key Reference

TUI快捷键参考

KeyAction
/
or
j
/
k
Navigate models
/
Search (name, provider, params, use case)
Esc
/
Enter
Exit search
Ctrl-U
Clear search
f
Cycle fit filter: All → Runnable → Perfect → Good → Marginal
a
Cycle availability: All → GGUF Avail → Installed
s
Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case
t
Cycle color theme (auto-saved)
v
Visual mode (multi-select for comparison)
V
Select mode (column-based filtering)
p
Plan mode (what hardware needed for this model?)
P
Provider filter popup
U
Use-case filter popup
C
Capability filter popup
m
Mark model for comparison
c
Compare view (marked vs selected)
d
Download model (via detected runtime)
r
Refresh installed models from runtimes
Enter
Toggle detail view
g
/
G
Jump to top/bottom
q
Quit
按键操作
/
j
/
k
导航模型列表
/
搜索(名称、提供商、参数、使用场景)
Esc
/
Enter
退出搜索
Ctrl-U
清空搜索内容
f
切换适配等级筛选:全部→可运行→完美适配→良好适配→资源紧张
a
切换可用状态筛选:全部→GGUF可用→已安装
s
切换排序方式:评分→参数→内存占用率→上下文→日期→使用场景
t
切换颜色主题(自动保存)
v
可视化模式(多选用于对比)
V
选择模式(基于列的筛选)
p
规划模式(运行该模型需要什么硬件?)
P
提供商筛选弹窗
U
使用场景筛选弹窗
C
能力筛选弹窗
m
标记模型用于对比
c
对比视图(已标记模型与选中模型)
d
下载模型(通过检测到的运行时)
r
从运行时刷新已安装模型列表
Enter
切换详情视图
g
/
G
跳转到列表顶部/底部
q
退出TUI

Themes

主题

t
cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox
Theme saved to
~/.config/llmfit/theme

t
键循环切换:默认→Dracula→Solarized→Nord→Monokai→Gruvbox
主题设置保存到
~/.config/llmfit/theme

GPU Detection Details

GPU检测细节

GPU VendorDetection Method
NVIDIA
nvidia-smi
(multi-GPU, aggregates VRAM)
AMD
rocm-smi
Intel Arcsysfs (discrete) /
lspci
(integrated)
Apple Silicon
system_profiler
(unified memory = VRAM)
Ascend
npu-smi

GPU厂商检测方式
NVIDIA
nvidia-smi
(支持多GPU,汇总显存)
AMD
rocm-smi
Intel Arcsysfs(独立显卡)/
lspci
(集成显卡)
Apple Silicon
system_profiler
(统一内存=显存)
Ascend
npu-smi

Common Patterns

常见使用场景

"What can I run on my 16GB M2 Mac?"

"我的16GB M2 Mac能运行哪些模型?"

sh
llmfit fit --perfect -n 10
sh
llmfit fit --perfect -n 10

or interactively

或使用交互式界面

llmfit
llmfit

press 'f' to filter to Perfect fit

按'f'键筛选到完美适配等级

undefined
undefined

"I have a 3090 (24GB VRAM), what coding models fit?"

"我有3090(24GB显存),哪些编码模型适配?"

sh
llmfit recommend --json --use-case coding | jq '.models[]'
sh
llmfit recommend --json --use-case coding | jq '.models[]'

or with manual override if detection fails

若自动检测失败,可手动覆盖显存配置

llmfit --memory=24G recommend --json --use-case coding
undefined
llmfit --memory=24G recommend --json --use-case coding
undefined

"Can Llama 70B run on my machine?"

"我的机器能运行Llama 70B吗?"

sh
llmfit info "Llama-3.1-70B"
sh
llmfit info "Llama-3.1-70B"

Plan what hardware you'd need

规划运行该模型所需的硬件配置

llmfit plan "Llama-3.1-70B" --context 4096 --json
undefined
llmfit plan "Llama-3.1-70B" --context 4096 --json
undefined

"Show me only models already installed in Ollama"

"只显示已在Ollama中安装的模型"

sh
llmfit
sh
llmfit

press 'a' to cycle to Installed filter

按'a'键切换到已安装筛选

or

llmfit fit -n 20 # run, press 'i' in TUI for installed-first
undefined
llmfit fit -n 20 # 运行后在TUI中按'i'键优先显示已安装模型
undefined

"Script: find best model and start Ollama"

"脚本:找到最优模型并启动Ollama"

bash
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"
bash
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"

"API: poll node capabilities for cluster scheduler"

"API:为集群调度轮询节点能力"

bash
undefined
bash
undefined

Check node, get top 3 good+ models for reasoning

检查节点,获取3个适配等级为良好及以上的推理模型

Troubleshooting

故障排查

GPU not detected / wrong VRAM reported
sh
undefined
GPU未被检测到 / 显存报告错误
sh
undefined

Verify detection

验证检测结果

llmfit system
llmfit system

Manual override

手动覆盖显存配置

llmfit --memory=24G --cli

**`nvidia-smi` not found but you have an NVIDIA GPU**
```sh
llmfit --memory=24G --cli

**已安装NVIDIA GPU但找不到`nvidia-smi`**
```sh

Install CUDA toolkit or nvidia-utils, then retry

安装CUDA工具包或nvidia-utils,然后重试

Or override manually:

或手动覆盖配置:

llmfit --memory=8G fit --perfect

**Models show as too_tight but you have enough RAM**
```sh
llmfit --memory=8G fit --perfect

**模型显示为too_tight但内存足够**
```sh

llmfit may be using context-inflated estimates; cap context

llmfit可能使用了上下文膨胀后的估算值,可限制上下文长度

llmfit --max-context 2048 fit --perfect -n 10

**REST API: test endpoints**
```sh
llmfit --max-context 2048 fit --perfect -n 10

**REST API:测试接口**
```sh

Spawn server and run validation suite

启动服务并运行验证套件

python3 scripts/test_api.py --spawn
python3 scripts/test_api.py --spawn

Test already-running server

测试已运行的服务

python3 scripts/test_api.py --base-url http://127.0.0.1:8787

**Apple Silicon: VRAM shows as system RAM (expected)**
```sh
python3 scripts/test_api.py --base-url http://127.0.0.1:8787

**Apple Silicon:显存显示为系统内存(正常现象)**
```sh

This is correct — Apple Silicon uses unified memory

这是正确的——Apple Silicon使用统一内存

llmfit accounts for this automatically

llmfit会自动适配这种情况

llmfit system # should show backend: Metal

**Context length environment variable**
```sh
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json  # uses 4096 as context cap
llmfit system # 应显示backend: Metal

**上下文长度环境变量**
```sh
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json  # 使用4096作为上下文长度限制