huggingface-local-models

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Hugging Face Local Models

Hugging Face本地模型

Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with

llama-cli

llama-server

在Hugging Face Hub中搜索兼容llama.cpp的GGUF仓库，选择合适的量化版本，并通过

llama-cli

或

llama-server

启动模型。

Default Workflow

默认工作流程

Search the Hub with
```
apps=llama.cpp
```
.

Open

https://huggingface.co/<repo>?local-app=llama.cpp

Prefer the exact HF local-app snippet and quant recommendation when it is visible.

Confirm exact

.gguf

filenames with

https://huggingface.co/api/models/<repo>/tree/main?recursive=true

Launch with

llama-cli -hf <repo>:<QUANT>

llama-server -hf <repo>:<QUANT>

Fall back to
```
--hf-repo
```
plus
```
--hf-file
```
when the repo uses custom file naming.
Convert from Transformers weights only if the repo does not already expose GGUF files.

使用
```
apps=llama.cpp
```
在Hub中搜索模型。

打开

https://huggingface.co/<repo>?local-app=llama.cpp

。

当页面显示HF本地应用代码片段和量化版本推荐时，优先使用这些内容。

通过

https://huggingface.co/api/models/<repo>/tree/main?recursive=true

确认准确的

.gguf

文件名。

使用

llama-cli -hf <repo>:<QUANT>

或

llama-server -hf <repo>:<QUANT>

启动模型。

当仓库使用自定义文件命名时，退而使用
```
--hf-repo
```
搭配
```
--hf-file
```
参数。
仅当仓库未提供GGUF文件时，才从Transformers权重转换模型。

Quick Start

快速开始

Install llama.cpp

安装llama.cpp

bash

brew install llama.cpp
winget install llama.cpp

bash

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make

bash

brew install llama.cpp
winget install llama.cpp

bash

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make

Authenticate for gated repos

认证 gated 仓库

bash

hf auth login

bash

hf auth login

Search the Hub

在Hub中搜索模型

text

https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending

text

https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending

Run directly from the Hub

直接从Hub运行模型

bash

llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

bash

llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Run an exact GGUF file

运行指定的GGUF文件

bash

llama-server \
    --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
    --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
    -c 4096

bash

llama-server \
    --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
    --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
    -c 4096

Convert only when no GGUF is available

仅当无GGUF文件时进行转换

bash

hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
    --outfile model-f16.gguf \
    --outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

bash

hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
    --outfile model-f16.gguf \
    --outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Smoke test a local server

本地服务器冒烟测试

bash

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a limerick about exception handling"}
    ]
  }'

bash

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a limerick about exception handling"}
    ]
  }'

Quant Choice

量化版本选择

Prefer the exact quant that HF marks as compatible on the
```
?local-app=llama.cpp
```
page.
Keep repo-native labels such as
```
UD-Q4_K_M
```
instead of normalizing them.
Default to
```
Q4_K_M
```
unless the repo page or hardware profile suggests otherwise.
Prefer
```
Q5_K_M
```
or
```
Q6_K
```
for code or technical workloads when memory allows.
Consider
```
Q3_K_M
```
,
```
Q4_K_S
```
, or repo-specific
```
IQ
```
/
```
UD-*
```
variants for tighter RAM or VRAM budgets.
Treat
```
mmproj-*.gguf
```
files as projector weights, not the main checkpoint.

优先选择HF在
```
?local-app=llama.cpp
```
页面标记为兼容的量化版本。
保留仓库原生的标签，如
```
UD-Q4_K_M
```
，不进行标准化。
除非仓库页面或硬件配置另有建议，默认使用
```
Q4_K_M
```
。
若内存允许，针对代码或技术类任务优先选择
```
Q5_K_M
```
或
```
Q6_K
```
。
若RAM或VRAM预算有限，可考虑
```
Q3_K_M
```
、
```
Q4_K_S
```
或仓库特定的
```
IQ
```
/
```
UD-*
```
变体。
将
```
mmproj-*.gguf
```
文件视为投影权重，而非主检查点。

Load References

参考文档

Read hub-discovery.md for URL-first workflows, model search, tree API extraction, and command reconstruction.
Read quantization.md for format tables, model scaling, quality tradeoffs, and
```
imatrix
```
.
Read hardware.md for Metal, CUDA, ROCm, or CPU build and acceleration details.

阅读hub-discovery.md了解基于URL的工作流程、模型搜索、树API提取和命令重构。
阅读quantization.md了解格式表、模型缩放、质量权衡和
```
imatrix
```
相关内容。
阅读hardware.md了解Metal、CUDA、ROCm或CPU的构建和加速细节。

Resources

资源

llama.cpp:
```
https://github.com/ggml-org/llama.cpp
```

Hugging Face GGUF + llama.cpp docs:

https://huggingface.co/docs/hub/gguf-llamacpp

Hugging Face Local Apps docs:

https://huggingface.co/docs/hub/main/local-apps

Hugging Face Local Agents docs:

https://huggingface.co/docs/hub/agents-local

GGUF converter Space:

https://huggingface.co/spaces/ggml-org/gguf-my-repo

llama.cpp:
```
https://github.com/ggml-org/llama.cpp
```

Hugging Face GGUF + llama.cpp文档:

https://huggingface.co/docs/hub/gguf-llamacpp

Hugging Face本地应用文档:

https://huggingface.co/docs/hub/main/local-apps

Hugging Face本地代理文档:

https://huggingface.co/docs/hub/agents-local

GGUF转换器Space:

https://huggingface.co/spaces/ggml-org/gguf-my-repo