Loading...
Loading...
Found 4 Skills
Expert skill for integrating local Large Language Models using llama.cpp and Ollama. Covers secure model loading, inference optimization, prompt handling, and protection against LLM-specific vulnerabilities including prompt injection, model theft, and denial of service attacks.
Use when integrating Foundation Models framework, implementing on-device AI with Apple Intelligence, building tool-calling AI features, working with guided generation schemas, converting models with Core ML and coremltools, or running open-source LLMs on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, MLTensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM).
Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
Use when "LLM inference", "serving LLM", "vLLM", "llama.cpp", "GGUF", "text generation", "model serving", "inference optimization", "KV cache", "continuous batching", "speculative decoding", "local LLM", "CPU inference"