LocalAI - Models

ced-base-q8

CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-tiny-q8

CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-mini-q8

CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-small-q8

CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

opengvlab_internvl3_5-30b-a3b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-14b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-8b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

opengvlab_internvl3_5-4b-q8_0

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 ×\times× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Links

Tags

depth-anything-3-base-q8_0

Depth Anything 3 (base), q8_0 — near-lossless 8-bit quant (~149 MB). Same depth + camera pose output as the q4_k default at higher fidelity.

Links

Tags

depth-anything-2-base-q8_0

Depth Anything V2 (base / ViT-B), q8_0 — near-lossless 8-bit quant. Same relative monocular depth output as the q4_k default at higher fidelity. Use GenerateImage (src -> depth PNG) or the Depth endpoint.

Links

Tags

qwen3-reranker-0.6b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Reranker-0.6B** has the following features: - Model Type: Text Reranking - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Reranker-0.6B

Tags

qwen3-embedding-4b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-4B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 4B - Context Length: 32k - Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-4B-GGUF

Tags

qwen3-embedding-8b

The Qwen3 Embedding series model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-8B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 8B - Context Length: 32k - Embedding Dimension: Up to 4096, supports user-defined output dimensions ranging from 32 to 4096 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-8B-GGUF

Tags

qwen3-embedding-0.6b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-0.6B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 - Quantization: q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF

Tags

llama-3.2-3b-instruct:q8_0

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF

Tags

llama-3.2-1b-instruct:q8_0

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF

Tags

nano_imp_1b-q8_0

It's the 10th of May, 2025—lots of progress is being made in the world of AI (DeepSeek, Qwen, etc...)—but still, there has yet to be a fully coherent 1B RP model. Why? Well, at 1B size, the mere fact a model is even coherent is some kind of a marvel—and getting it to roleplay feels like you're asking too much from 1B parameters. Making very small yet smart models is quite hard, making one that does RP is exceedingly hard. I should know. I've made the world's first 3B roleplay model—Impish_LLAMA_3B—and I thought that this was the absolute minimum size for coherency and RP capabilities. I was wrong. One of my stated goals was to make AI accessible and available for everyone—but not everyone could run 13B or even 8B models. Some people only have mid-tier phones, should they be left behind? A growing sentiment often says something along the lines of: If your waifu runs on someone else's hardware—then she's not your waifu. I'm not an expert in waifu culture, but I do agree that people should be able to run models locally, without their data (knowingly or unknowingly) being used for X or Y. I thought my goal of making a roleplay model that everyone could run would only be realized sometime in the future—when mid-tier phones got the equivalent of a high-end Snapdragon chipset. Again I was wrong, as this changes today. Today, the 10th of May 2025, I proudly present to you—Nano_Imp_1B, the world's first and only fully coherent 1B-parameter roleplay model.

Links

Tags

flux.1-dev-ggml-q8_0

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license.

Links

Tags

flux.1-dev-ggml-abliterated-v2-q8_0

FLUX.1 [dev] is an abliterated version of FLUX.1 [dev]

Links

Tags

flux.1-krea-dev-ggml-q8_0

FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Links

Tags

ideogram-4-q8_0-ggml

Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model. This is the Q8_0 (8-bit) quantization for highest quality (~10.1GB diffusion + ~10.1GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

ced-base-q8

ced-tiny-q8

ced-mini-q8

ced-small-q8

opengvlab_internvl3_5-30b-a3b-q8_0

opengvlab_internvl3_5-14b-q8_0

opengvlab_internvl3_5-8b-q8_0

opengvlab_internvl3_5-4b-q8_0

depth-anything-3-base-q8_0

depth-anything-2-base-q8_0

qwen3-reranker-0.6b

qwen3-embedding-4b

qwen3-embedding-8b

qwen3-embedding-0.6b

llama-3.2-3b-instruct:q8_0

llama-3.2-1b-instruct:q8_0

nano_imp_1b-q8_0

flux.1-dev-ggml-q8_0

flux.1-dev-ggml-abliterated-v2-q8_0

flux.1-krea-dev-ggml-q8_0

ideogram-4-q8_0-ggml