huggingface-local-models
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face Local Models
Hugging Face本地模型
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with or .
llama-clillama-server在Hugging Face Hub中搜索兼容llama.cpp的GGUF仓库,选择合适的量化版本,并通过或启动模型。
llama-clillama-serverDefault Workflow
默认工作流程
- Search the Hub with .
apps=llama.cpp - Open .
https://huggingface.co/<repo>?local-app=llama.cpp - Prefer the exact HF local-app snippet and quant recommendation when it is visible.
- Confirm exact filenames with
.gguf.https://huggingface.co/api/models/<repo>/tree/main?recursive=true - Launch with or
llama-cli -hf <repo>:<QUANT>.llama-server -hf <repo>:<QUANT> - Fall back to plus
--hf-repowhen the repo uses custom file naming.--hf-file - Convert from Transformers weights only if the repo does not already expose GGUF files.
- 使用在Hub中搜索模型。
apps=llama.cpp - 打开。
https://huggingface.co/<repo>?local-app=llama.cpp - 当页面显示HF本地应用代码片段和量化版本推荐时,优先使用这些内容。
- 通过确认准确的
https://huggingface.co/api/models/<repo>/tree/main?recursive=true文件名。.gguf - 使用或
llama-cli -hf <repo>:<QUANT>启动模型。llama-server -hf <repo>:<QUANT> - 当仓库使用自定义文件命名时,退而使用搭配
--hf-repo参数。--hf-file - 仅当仓库未提供GGUF文件时,才从Transformers权重转换模型。
Quick Start
快速开始
Install llama.cpp
安装llama.cpp
bash
brew install llama.cpp
winget install llama.cppbash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
makebash
brew install llama.cpp
winget install llama.cppbash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
makeAuthenticate for gated repos
认证 gated 仓库
bash
hf auth loginbash
hf auth loginSearch the Hub
在Hub中搜索模型
text
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trendingtext
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trendingRun directly from the Hub
直接从Hub运行模型
bash
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mbash
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_MRun an exact GGUF file
运行指定的GGUF文件
bash
llama-server \
--hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
--hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 4096bash
llama-server \
--hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
--hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 4096Convert only when no GGUF is available
仅当无GGUF文件时进行转换
bash
hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
--outfile model-f16.gguf \
--outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_Mbash
hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
--outfile model-f16.gguf \
--outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_MSmoke test a local server
本地服务器冒烟测试
bash
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mbash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about exception handling"}
]
}'bash
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mbash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about exception handling"}
]
}'Quant Choice
量化版本选择
- Prefer the exact quant that HF marks as compatible on the page.
?local-app=llama.cpp - Keep repo-native labels such as instead of normalizing them.
UD-Q4_K_M - Default to unless the repo page or hardware profile suggests otherwise.
Q4_K_M - Prefer or
Q5_K_Mfor code or technical workloads when memory allows.Q6_K - Consider ,
Q3_K_M, or repo-specificQ4_K_S/IQvariants for tighter RAM or VRAM budgets.UD-* - Treat files as projector weights, not the main checkpoint.
mmproj-*.gguf
- 优先选择HF在页面标记为兼容的量化版本。
?local-app=llama.cpp - 保留仓库原生的标签,如,不进行标准化。
UD-Q4_K_M - 除非仓库页面或硬件配置另有建议,默认使用。
Q4_K_M - 若内存允许,针对代码或技术类任务优先选择或
Q5_K_M。Q6_K - 若RAM或VRAM预算有限,可考虑、
Q3_K_M或仓库特定的Q4_K_S/IQ变体。UD-* - 将文件视为投影权重,而非主检查点。
mmproj-*.gguf
Load References
参考文档
- Read hub-discovery.md for URL-first workflows, model search, tree API extraction, and command reconstruction.
- Read quantization.md for format tables, model scaling, quality tradeoffs, and .
imatrix - Read hardware.md for Metal, CUDA, ROCm, or CPU build and acceleration details.
- 阅读hub-discovery.md了解基于URL的工作流程、模型搜索、树API提取和命令重构。
- 阅读quantization.md了解格式表、模型缩放、质量权衡和相关内容。
imatrix - 阅读hardware.md了解Metal、CUDA、ROCm或CPU的构建和加速细节。
Resources
资源
- llama.cpp:
https://github.com/ggml-org/llama.cpp - Hugging Face GGUF + llama.cpp docs:
https://huggingface.co/docs/hub/gguf-llamacpp - Hugging Face Local Apps docs:
https://huggingface.co/docs/hub/main/local-apps - Hugging Face Local Agents docs:
https://huggingface.co/docs/hub/agents-local - GGUF converter Space:
https://huggingface.co/spaces/ggml-org/gguf-my-repo
- llama.cpp:
https://github.com/ggml-org/llama.cpp - Hugging Face GGUF + llama.cpp文档:
https://huggingface.co/docs/hub/gguf-llamacpp - Hugging Face本地应用文档:
https://huggingface.co/docs/hub/main/local-apps - Hugging Face本地代理文档:
https://huggingface.co/docs/hub/agents-local - GGUF转换器Space:
https://huggingface.co/spaces/ggml-org/gguf-my-repo