LocalAI - Models

qwen3-omni-30b-a3b-instruct

Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. This GGUF build runs on llama.cpp with the bundled mmproj for multimodal inputs.

Links

Tags

qwen3-omni-30b-a3b-thinking

Qwen3-Omni-30B-A3B-Thinking is the reasoning-enhanced variant of Qwen3-Omni, a natively end-to-end multilingual omni-modal foundation model. It processes text, images, and audio and produces chain-of-thought reasoning before the final answer. This GGUF build runs on llama.cpp with the bundled mmproj.

Links

Tags

glm-ocr

GLM-OCR is a vision-language model specialized for optical character recognition and document understanding, built on the GLM architecture. This GGUF build runs on llama.cpp with the bundled mmproj.

Links

Tags

deepseek-ocr

DeepSeek-OCR is a vision-language model from DeepSeek AI specialized for optical character recognition and document understanding. This GGUF build runs on llama.cpp with the bundled mmproj.

Links

Tags

lfm2-vl-450m

LFM2‑VL is Liquid AI's first series of multimodal models, designed to process text and images with variable resolutions. Built on the LFM2 backbone, it is optimized for low-latency and edge AI applications. We're releasing the weights of two post-trained checkpoints with 450M (for highly constrained devices) and 1.6B (more capable yet still lightweight) parameters. 2× faster inference speed on GPUs compared to existing VLMs while maintaining competitive accuracy Flexible architecture with user-tunable speed-quality tradeoffs at inference time Native resolution processing up to 512×512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion

Links

Tags

openbuddy_openbuddy-r1-0528-distill-qwen3-32b-preview0-qat

OpenBuddy distillation of Qwen3-32B from DeepSeek-R1, featuring 40K context window and multilingual support (zh, en, fr, de, ja, ko, it, fi). GGUF quantized version optimized for local inference with llama.cpp.

Links

Tags

google-gemma-3-27b-it-qat-q4_0-small

This is a requantized version of https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Requantizing with llama.cpp achieves a very similar result. Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. The perplexity score for this one is even lower with this model compared to the original model by Google, but the results are within margin of error, so it's probably just luck. I also fixed the control token metadata, which was slightly degrading the performance of the model in instruct mode.

Links

Tags

opencoder-8b-base

The model is a quantized version of infly/OpenCoder-8B-Base created using llama.cpp. It is part of the OpenCoder LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. The original OpenCoder model was pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples. It achieves high performance across multiple language model benchmarks and is one of the most comprehensively open-sourced models available.

Links

Tags

opencoder-8b-instruct

The LLM model is QuantFactory/OpenCoder-8B-Instruct-GGUF, which is a quantized version of infly/OpenCoder-8B-Instruct. It is created using llama.cpp and supports both English and Chinese languages. The original model, infly/OpenCoder-8B-Instruct, is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples. It achieves high performance across multiple language model benchmarks and is one of the leading open-source models for code.

Links

Tags

opencoder-1.5b-instruct

The model is a quantized version of [infly/OpenCoder-1.5B-Instruct](https://huggingface.co/infly/OpenCoder-1.5B-Instruct) created using llama.cpp. The original model, infly/OpenCoder-1.5B-Instruct, is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. The model is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples. It achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.

Links

https://huggingface.co/QuantFactory/OpenCoder-1.5B-Instruct-GGUF

Tags

llama-3.2-3b-agent007-coder

The Llama-3.2-3B-Agent007-Coder-GGUF is a quantized version of the EpistemeAI/Llama-3.2-3B-Agent007-Coder model, which is a fine-tuned version of the unsloth/llama-3.2-3b-instruct-bnb-4bit model. It is created using llama.cpp and trained with additional datasets such as the Agent dataset, Code Alpaca 20K, and magpie ultra 0.1. This model is optimized for multilingual dialogue use cases and agentic retrieval and summarization tasks. The model is available for commercial and research use in multiple languages and is best used with the transformers library.

Links

https://huggingface.co/QuantFactory/Llama-3.2-3B-Agent007-Coder-GGUF

Tags

nihappy-l3.1-8b-v0.09

The model is a quantized version of Arkana08/NIHAPPY-L3.1-8B-v0.09 created using llama.cpp. It is a role-playing model that integrates the finest qualities of various pre-trained language models, focusing on dynamic storytelling.

Links

Tags

tlacuilo-12b

**Tlacuilo-12B** is a 12-billion-parameter fine-tuned language model developed by Allura Org, based on **Mistral-Nemo-Base-2407** and **Muse-12B**, optimized for high-quality creative writing, roleplay, and narrative generation. Trained using a three-stage QLoRA process with diverse datasets—including literary texts, roleplay content, and instruction-following data—the model excels in coherent, expressive, and stylistically rich prose. Key features: - **Base models**: Built on Mistral-Nemo-Base-2407 and Muse-12B for strong reasoning and narrative capability. - **Fine-tuned for creativity**: Optimized for roleplay, storytelling, and imaginative writing with natural, fluid prose. - **Chat template**: Uses **ChatML**, making it compatible with standard conversational interfaces. - **Recommended settings**: Works well with temperature 1.0–1.3 and min-p 0.02–0.05 for balanced, engaging responses. Ideal for writers, game masters, and creative professionals seeking a versatile, high-performance model for narrative tasks. > *Note: The GGUF quantized version (e.g., `Ennthen/Tlacuilo-12B-Q4_K_M-GGUF`) is a conversion of this base model for local inference via llama.cpp.*

Links

https://huggingface.co/Ennthen/Tlacuilo-12B-Q4_K_M-GGUF

Tags

magidonia-24b-v4.2.0-i1

**Model Name:** Magidonia 24B v4.2.0 **Base Model:** mistralai/Magistral-Small-2509 **Author:** TheDrummer **License:** MIT (as per standard for Hugging Face models) **Model Type:** Fine-tuned large language model (LLM) **Size:** 24 billion parameters **Description:** Magidonia 24B v4.2.0 is a creatively-oriented, open-weight fine-tuned language model developed by TheDrummer. Built upon the **Magistral-Small-2509** base, this model emphasizes **creativity, narrative dynamism, and expressive language use**—ideal for storytelling, roleplay, and imaginative writing. It features enhanced reasoning with a built-in **THINKING MODE**, activated using `` and `` tokens, encouraging detailed inner monologue before response generation. Designed for flexibility and minimal alignment constraints, it's well-suited for entertainment, world-building, and experimental use cases. **Key Features:** - Strong creative and literary capabilities - Supports structured thinking via special tokens - Optimized for roleplay and dynamic storytelling - Available in GGUF format for local inference (via llama.cpp, etc.) - Includes iMatrix quantization for high-quality low-precision performance **Use Case:** Ideal for writers, game masters, and AI artists seeking expressive, unfiltered, and imaginative language models. **Repository:** [TheDrummer/Magidonia-24B-v4.2.0](https://huggingface.co/TheDrummer/Magidonia-24B-v4.2.0) **Quantized Version (GGUF):** [mradermacher/Magidonia-24B-v4.2.0-i1-GGUF](https://huggingface.co/mradermacher/Magidonia-24B-v4.2.0-i1-GGUF) *(for reference only — use original for full description)*

Links

https://huggingface.co/mradermacher/Magidonia-24B-v4.2.0-i1-GGUF

Tags

llama-3.2-3b-small_shiro_roleplay

**Model Name:** Llama-3.2-3B-small_Shiro_roleplay-gguf **Base Model:** Meta-Llama-3.2-3B-Instruct (via unsloth/Meta-Llama-3.2-3B-Instruct-bnb-4bit) **Fine-Tuned With:** LoRA (rank 64) using Unsloth for optimized performance **Task:** Roleplay & creative storytelling **Format:** GGUF (Q4_K_M, Q8_0) – optimized for local inference via llama.cpp, LM Studio, Ollama **Context Length:** 4096 tokens **Description:** A compact yet powerful 3.2B-parameter fine-tuned Llama 3.2 model specialized for immersive, witty, and darkly imaginative roleplay. Trained on creative and absurd narrative scenarios, it excels at generating unique characters, engaging scenes, and high-concept storytelling with a distinct, sarcastic flair. Ideal for writers, game masters, and creative developers seeking a responsive, locally runnable assistant for imaginative storytelling.

Links

https://huggingface.co/samunder12/Llama-3.2-3B-small_Shiro_roleplay-gguf

Tags

simia-tau-sft-qwen3-8b

The **Simia-Tau-SFT-Qwen3-8B** is a fine-tuned version of the Qwen3-8B language model, developed by Simia-Agent and adapted for enhanced instruction-following capabilities. This model is optimized for dialogue and task-oriented interactions, making it highly effective for real-world applications requiring nuanced understanding and coherent responses. The model is available in multiple quantized formats (GGUF), including Q4_K_S, Q5_K_M, Q8_0, and others, enabling efficient deployment across devices with varying computational resources. These quantized versions maintain strong performance while reducing memory footprint and inference latency. While this repository hosts a quantized variant (specifically designed for GGUF-based inference via tools like llama.cpp), the original base model is **Qwen3-8B**, a large-scale open-source language model from Alibaba Cloud. The fine-tuning (SFT) process improves its alignment with human intent and enhances its ability to follow complex instructions. > 🔍 **Note**: This is a quantized version; for the full-precision base model, refer to [Simia-Agent/Simia-Tau-SFT-Qwen3-8B](https://huggingface.co/Simia-Agent/Simia-Tau-SFT-Qwen3-8B) on Hugging Face. **Use Case**: Ideal for chatbots, assistant systems, and interactive applications requiring strong reasoning, safety, and fluency. **Model Size**: 8B parameters (quantized for efficiency). **License**: See the original model's license (typically Apache 2.0 for Qwen series). 👉 Recommended for edge deployment with GGUF-compatible tools.

Links

https://huggingface.co/mradermacher/Simia-Tau-SFT-Qwen3-8B-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

qwen3-omni-30b-a3b-instruct

qwen3-omni-30b-a3b-thinking

glm-ocr

deepseek-ocr

lfm2-vl-450m

openbuddy_openbuddy-r1-0528-distill-qwen3-32b-preview0-qat

google-gemma-3-27b-it-qat-q4_0-small

opencoder-8b-base

opencoder-8b-instruct

opencoder-1.5b-instruct

llama-3.2-3b-agent007-coder

nihappy-l3.1-8b-v0.09

tlacuilo-12b

magidonia-24b-v4.2.0-i1

llama-3.2-3b-small_shiro_roleplay

simia-tau-sft-qwen3-8b