LocalAI - Models

minicpm5-1b-claude-opus-fable5-thinking

# MiniCPM5-1B-Claude-Opus-Fable5-Thinking GGUF quantizations for local deployment: **MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF** 中文说明 **MiniCPM5-1B-Claude-Opus-Fable5-Thinking** is a compact 1B **Thinking** language model built on openbmb/MiniCPM5-1B. It is further fine-tuned on **Fable 5** data to improve **coding** and **instruction-following** while keeping MiniCPM5's native Thinking chat template and tool-call format. For llama.cpp / Ollama / LM Studio deployment, see the **GGUF repository**. ## Overview ## Capabilities - **Coding** — code generation, debugging, and software-engineering-style tasks - **Instruction following** — more reliable adherence to user prompts and structured constraints - **Thinking mode** — chain-of-thought reasoning via the MiniCPM5 chat template - **Tool calling** — inherits MiniCPM5's XML tool-call format - **Long context** — up to **128K tokens** (131,072 tokens per `config.json`) ## Quick start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking" ...

Links

https://huggingface.co/GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF

Tags

lfm2.5-1.2b-instruct

Try LFM • Docs • LEAP • Discord # LFM2.5-1.2B-Instruct LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket. - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM. - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning. Find more information about LFM2.5 in our blog post. ## 🗒️ Model Details LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF

Tags

qwopus3.6-27b-coder-compat-mtp

🪐 Qwopus-3.6-27B-Coder Coder SFT Release Agentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2 🧬 Trace Inversion & Negentropy 🧠 27B Dense Model ⚡ Agentic Coding 🛠️ Tool Calling & Agent 🏆 SWE-bench Verified: 67.0% (off-thinking) 💡 What is Qwopus-3.6-27B-Coder? 🪐 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments. 🧩 Agentic Coding Optimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows. 🛠️ Tool Calling Learns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution. ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-Compat-MTP-GGUF

Tags

qwopus3.6-27b-coder-mtp-nvfp4

🪐 Qwopus-3.6-27B-Coder Coder SFT Release Agentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2 🧬 Trace Inversion & Negentropy 🧠 27B Dense Model ⚡ Agentic Coding 🛠️ Tool Calling & Agent 🏆 SWE-bench Verified: 67.0% (off-thinking) 💡 What is Qwopus-3.6-27B-Coder? 🪐 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro (300ex) and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments. 🧩 Agentic Coding Optimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows. 🛠️ Tool Calling Learns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution. ...

Links

https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF

Tags

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

Tags

gemma-4-12b-coder-fable5-composer2.5-v1

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Tags

dark-scarlett-v0.3-26b-a4b

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/ReadyArt/Dark-Scarlett-v0.3-26B-A4B-GGUF

Tags

gemma-4-e2b-it-qat-q4_0

Gemma 4 E2B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E2B is a MatFormer "effective 2B" elastic variant: it carries a larger backbone but runs at an effective 2B-parameter footprint, making it well suited to lightweight and on-device deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf

Tags

gemma-4-e4b-it-qat-q4_0

Gemma 4 E4B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf

Tags

gemma-4-26b-a4b-it-qat-q4_0

Gemma 4 26B-A4B is a multimodal (text + image) instruction-tuned Mixture-of-Experts model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. With 26B total parameters and ~4B active per token, it delivers large-model quality at a much lower inference cost. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf

Tags

gemma-4-31b-it-qat-q4_0

Gemma 4 31B is the largest dense multimodal (text + image) instruction-tuned model in the Gemma 4 family from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality while dramatically reducing the memory required to load the model. This is the official Google Q4_0 GGUF, shipped with its multimodal projector. License: Apache 2.0 | Authors: Google DeepMind

Links

https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf

Tags

lfm2.5-8b-a1b

Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF

Tags

qwopus3.5-9b-coder-mtp

# 🌟 Qwopus3.5-9B-v3.5 ## 💡 Model Overview & v3.5 Design Qwopus3.5-9B-v3.5 is a **data-scaled continuation** of the Qwopus3.5-9B-v3 model. The training data in v3.5 is expanded to cover a broader range of domains, including mathematics, programming, puzzle-solving, multilingual dialogue, instruction-following, multi-turn interactions, and STEM-related tasks. Qwopus3.5-9B-v3.5 is a reasoning-enhanced model based on **Qwen3.5-9B**, designed for: - 🧩 Structured reasoning - 🔧 Tool-augmented workflows - 🔁 Multi-step agentic tasks - ⚡ Token-efficient inference Compared with Qwopus3.5-9B-v3, **3.5 version does not introduce a new architecture, RL stage, or template redesign**. This version is trained with approximately **2× more SFT data**. ## 🎯 Motivation & Generalization Insight The motivation behind v3.5 comes from a simple observation: > This work is motivated by the hypothesis that scaling high-quality SFT data may further enhance the generalization ability of large language models. In earlier Qwopus3.5 experiments, structured reasoning was observed to improve both **accuracy and efficiency**: ...

Links

https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF

Tags

qwen3.6-27b-heretic-uncensored-finetune-neo-code-di-imatrix-max

Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking Yes... fully uncensored AND fine tuned lightly. Freedom and brainpower. Trained on different Heretic base, with different KLD/Refusals. Model fine tune was used to finalize and "firm up" Heretic / uncensored changes. The goal here was light, minor fixes rather than full / heavy fine tune. That being said, the tuning still raised critical metrics. This is Version 2, using "trohrbaugh" Heretic, which has a lower refusal rate, and tuning bumped up the metrics a bit more too. This has also positively impacted "NEO-Coder Di-Matrix" (dual imatrix) GGUF quants as well (vs heretic/non heretic too). https://huggingface.co/DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF ``` IN HOUSE BENCHMARKS [by Nightmedia]: arc-c arc/e boolq hswag obkqa piqa wino Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking mxfp8 0.673,0.846,0.905... [instruct mode] Qwen3.6-27B-Heretic-Uncensored-Finetune-Thinking mxfp8 0.669,0.835,0.906,... [instruct mode] BASE UNTUNED MODEL: Qwen3.6-27B HERETIC (by llmfan46) [instruct mode] mxfp8 0.644,0.788,0.902,... ...

Links

https://huggingface.co/DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF

Tags

qwopus-glm-18b-merged

# 🪐 Qwen3.5-9B-GLM5.1-Distill-v1 ## 📌 Model Overview **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1` **Base Model:** Qwen3.5-9B **Training Type:** Supervised Fine-Tuning (SFT, Distillation) **Parameter Scale:** 9B **Training Framework:** Unsloth This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**. The primary goals are to: - Improve **structured reasoning ability** - Enhance **instruction-following consistency** - Activate **latent knowledge via better reasoning structure** ## 📊 Training Data ### Main Dataset - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned` - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset. - Generated from a **GLM-5.1 teacher model** - Approximately **700x** the scale of `Qwen3.5-reasoning-700x` - Training used a **filtered subset**, not the full source dataset. ### Auxiliary Dataset - `Jackrong/Qwen3.5-reasoning-700x` ...

Links

https://huggingface.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF

Tags

qwen3.5-9b-glm5.1-distill-v1

# 🪐 Qwen3.5-9B-GLM5.1-Distill-v1 ## 📌 Model Overview **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1` **Base Model:** Qwen3.5-9B **Training Type:** Supervised Fine-Tuning (SFT, Distillation) **Parameter Scale:** 9B **Training Framework:** Unsloth This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**. The primary goals are to: - Improve **structured reasoning ability** - Enhance **instruction-following consistency** - Activate **latent knowledge via better reasoning structure** ## 📊 Training Data ### Main Dataset - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned` - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset. - Generated from a **GLM-5.1 teacher model** - Approximately **700x** the scale of `Qwen3.5-reasoning-700x` - Training used a **filtered subset**, not the full source dataset. ### Auxiliary Dataset - `Jackrong/Qwen3.5-reasoning-700x` ...

Links

https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF

Tags

supergemma4-26b-uncensored-v2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2

Tags

qwopus-glm-18b-merged

# 🪐 Qwen3.5-9B-GLM5.1-Distill-v1 ## 📌 Model Overview **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1` **Base Model:** Qwen3.5-9B **Training Type:** Supervised Fine-Tuning (SFT, Distillation) **Parameter Scale:** 9B **Training Framework:** Unsloth This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**. The primary goals are to: - Improve **structured reasoning ability** - Enhance **instruction-following consistency** - Activate **latent knowledge via better reasoning structure** ## 📊 Training Data ### Main Dataset - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned` - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset. - Generated from a **GLM-5.1 teacher model** - Approximately **700x** the scale of `Qwen3.5-reasoning-700x` - Training used a **filtered subset**, not the full source dataset. ### Auxiliary Dataset - `Jackrong/Qwen3.5-reasoning-700x` ...

Links

https://huggingface.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF

Tags

qwen_qwen3.5-0.8b

Qwen 3.5 0.8B parameter model quantized for llama-cpp backend. Supports chat interactions and multimodal image-text inputs.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-0.8B-GGUF

Tags

qwen_qwen3.5-2b

Qwen3.5-2B is a highly efficient, instruction-tuned multilingual language model available in various quantized GGUF formats. Optimized for llama-cpp inference, it supports chat and completion tasks with strong performance on low-RAM hardware. The model is available in multiple quantization levels ranging from Q8_0 to IQ2_M to balance quality and resource usage.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-2B-GGUF

Tags

q3.5-bluestar-27b

Links

https://huggingface.co/mradermacher/Q3.5-BlueStar-27B-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

minicpm5-1b-claude-opus-fable5-thinking

lfm2.5-1.2b-instruct

qwopus3.6-27b-coder-compat-mtp

qwopus3.6-27b-coder-mtp-nvfp4

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

gemma-4-12b-coder-fable5-composer2.5-v1

dark-scarlett-v0.3-26b-a4b

gemma-4-e2b-it-qat-q4_0

gemma-4-e4b-it-qat-q4_0

gemma-4-26b-a4b-it-qat-q4_0

gemma-4-31b-it-qat-q4_0

lfm2.5-8b-a1b

qwopus3.5-9b-coder-mtp

qwen3.6-27b-heretic-uncensored-finetune-neo-code-di-imatrix-max

qwopus-glm-18b-merged

qwen3.5-9b-glm5.1-distill-v1

supergemma4-26b-uncensored-v2

qwopus-glm-18b-merged

qwen_qwen3.5-0.8b

qwen_qwen3.5-2b

q3.5-bluestar-27b