LocalAI - Models

supergemma4-26b-uncensored-v2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2

Tags

gemma-4-26b-a4b-it-apex

AI model: gemma-4-26b-a4b-it-apex

Links

https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF

gemma-4-26b-a4b-it

Google Gemma 4 26B-A4B-IT is an open-source multimodal Mixture-of-Experts model with 26B total parameters and 4B active parameters. It handles text and image input, generating text output, with a 256K context window and support for 140+ languages. The MoE architecture provides strong performance with efficient inference. Well-suited for question answering, summarization, reasoning, and image understanding tasks.

Links

Tags

nemo-parakeet-tdt-0.6b

NVIDIA NeMo Parakeet TDT 0.6B v3 is an automatic speech recognition (ASR) model from NVIDIA's NeMo toolkit. Parakeet models are state-of-the-art ASR models trained on large-scale English audio data.

Links

Tags

qwen3-tts-0.6b-custom-voice

Qwen3-TTS is a high-quality text-to-speech model supporting custom voice, voice design, and voice cloning.

Links

https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Tags

qwen3-asr-0.6b

Qwen3-ASR is an automatic speech recognition model supporting multiple languages and batch inference.

Links

https://huggingface.co/Qwen/Qwen3-ASR-0.6B

Tags

liquidai.lfm2-2.6b-transcript

This is a large language model (2.6B parameters) designed for text-generation tasks. It is a quantized version of the original model `LiquidAI/LFM2-2.6B-Transcript`, optimized for efficiency while retaining strong performance. The model is built on the foundation of the base model, with additional optimizations for deployment and use cases like transcription or language modeling. It is trained on large-scale text data and supports multiple languages.

Links

https://huggingface.co/DevQuasar/LiquidAI.LFM2-2.6B-Transcript-GGUF

Tags

qwen3-asr-0.6b

Qwen3-ASR is an automatic speech recognition model supporting multiple languages and batch inference.

Links

https://huggingface.co/Qwen/Qwen3-ASR-0.6B

Tags

lfm2-vl-1.6b

LFM2‑VL is Liquid AI's first series of multimodal models, designed to process text and images with variable resolutions. Built on the LFM2 backbone, it is optimized for low-latency and edge AI applications. We're releasing the weights of two post-trained checkpoints with 450M (for highly constrained devices) and 1.6B (more capable yet still lightweight) parameters. 2× faster inference speed on GPUs compared to existing VLMs while maintaining competitive accuracy Flexible architecture with user-tunable speed-quality tradeoffs at inference time Native resolution processing up to 512×512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion

Links

Tags

qwen3-reranker-0.6b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Reranker-0.6B** has the following features: - Model Type: Text Reranking - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Reranker-0.6B

Tags

qwen3-0.6b

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: Type: Causal Language Models Training Stage: Pretraining & Post-training Number of Parameters: 0.6B Number of Paramaters (Non-Embedding): 0.44B Number of Layers: 28 Number of Attention Heads (GQA): 16 for Q and 8 for KV Context Length: 32,768

Links

Tags

kalomaze_qwen3-16b-a3b

A man-made horror beyond your comprehension. But no, seriously, this is my experiment to: measure the probability that any given expert will activate (over my personal set of fairly diverse calibration data), per layer prune 64/128 of the least used experts per layer (with reordered router and indexing per layer) It can still write semi-coherently without any additional training or distillation done on top of it from the original 30b MoE. The .txt files with the original measurements are provided in the repo along with the exported weights. Custom testing to measure the experts was done on a hacked version of vllm, and then I made a bespoke script to selectively export the weights according to the measurements.

Links

Tags

qwen3-embedding-0.6b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-0.6B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 - Quantization: q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF

Tags

qwen3-deckard-large-almost-human-6b-i1

A love letter to all things Philip K Dick, trained and fine tuned on an in house dataset. This is V1, "Light", "Large" and "Almost Human". "Almost Human" is about adding (back) the humanity, the real person called Philip K Dick back into the model - with tone, thinking, and a touch of prose. "Deckard" is the main character in Blade Runner.

Links

Tags

gustavecortal_beck-0.6b

A language model that handles delicate life situations and tries to really help you. Beck is based on Piaget and was finetuned on psychotherapeutic preferences from PsychoCounsel-Preference. Methodology Beck was trained using preference optimization (ORPO) and LoRA. You can reproduce the results using my repo for lightweight preference optimization using this config that contains the hyperparameters. This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205). Inspiration Beck aims to reason about psychological and philosophical concepts such as self-image, emotion, and existence. Beck was inspired by my position paper on emotion analysis: Improving Language Models for Emotion Analysis: Insights from Cognitive Science.

Links

Tags

l3.1-moe-2x8b-v0.2

This model is a Mixture of Experts (MoE) made with mergekit-moe. It uses the following base models: Joseph717171/Llama-3.1-SuperNova-8B-Lite_TIES_with_Base ArliAI/Llama-3.1-8B-ArliAI-RPMax-v1.2 Heavily inspired by mlabonne/Beyonder-4x7B-v3.

Links

Tags

deepseek-coder-v2-lite-instruct

DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from DeepSeek-Coder-V2-Base with 6 trillion tokens sourced from a high-quality and multi-source corpus. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-Coder-V2-Base, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks. The list of supported programming languages can be found in the paper.

Links

Tags

mistral-small-3.2-46b-the-brilliant-raconteur-ii-instruct-2506

WARNING: MADNESS - UN HINGED and... NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Mistral-Small-3.2-46B-The-Brilliant-Raconteur-II-Instruct-2506 This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. ABOUT: A stronger, more creative Mistral (Mistral-Small-3.2-24B-Instruct-2506) extended to 79 layers, 46B parameters with Brainstorm 40x by DavidAU (details at very bottom of the page). This is version II, which has a jump in detail, and raw emotion relative to version 1. This model pushes Mistral's Instruct 2506 to the limit: Regens will be very different, even with same prompt / settings. Output generation will vary vastly on each generation. Reasoning will be changed, and often shorter. Prose, creativity, word choice, and general "flow" are improved. Several system prompts below help push this model even further. Model is partly de-censored / abliterated. Most Mistrals are more uncensored that most other models too. This model can also be used for coding too; even at low quants. Model can be used for all use cases too. As this is an instruct model, this model thrives on instructions - both in the system prompt and/or the prompt itself. One example below with 3 generations using Q4_K_S. Second example below with 2 generations using Q4_K_S. Quick Details: Model is 128k context, Jinja template (embedded) OR Chatml Template. Reasoning can be turned on/off (see system prompts below) and is OFF by default. Temp range .1 to 1 suggested, with 1-2 for enhanced creative. Above temp 2, is strong but can be very different. Rep pen range: 1 (off) or very light 1.01, 1.02 to 1.05. (model is sensitive to rep pen - this affects reasoning / generation length.) For creative/brainstorming use: suggest 2-5 generations due to variations caused by Brainstorm. Observations: Sometimes using Chatml (or Alpaca / others ) template (VS Jinja) will result in stronger creative generation. Model can be operated with NO system prompt; however a system prompt will enhance generation. Longer prompts, that more detailed, with more instructions will result in much stronger generations. For prose directives: You may need to add directions, because the model may follow your instructions too closely. IE: "use short sentences" vs "use short sentences sparsely". Reasoning (on) can lead to better creative generation, however sometimes generation with reasoning off is better. Rep pen of up to 1.05 may be needed on quants Q2k/q3ks for some prompts to address "low bit" issues. Detailed settings, system prompts, how to and examples below. NOTES: Image generation should also be possible with this model, just like the base model. Brainstorm was not applied to the image generation systems of the model... yet. This is Version II and subject to change / revision. This model is a slightly different version of: https://huggingface.co/DavidAU/Mistral-Small-3.2-46B-The-Brilliant-Raconteur-Instruct-2506

Links

Tags

yi-1.5-6b-chat

Yi-1.5-6B-Chat is an instruction-tuned LLM optimized for chat, coding, and reasoning tasks. It leverages a 3M sample fine-tuning corpus for strong instruction-following capabilities. Available in GGUF format for efficient local inference.

Links

Tags

Z-Image-Turbo

Z-Image is a powerful and highly efficient image generation model with 6B parameters. Currently there are three variants of which this is the Turbo edition. 🚀 Z-Image-Turbo – A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers ⚡️sub-second inference latency⚡️ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

Links

https://github.com/Tongyi-MAI/Z-Image

Tags

aevum-0.6b-finetuned

**Model Name:** Aevum-0.6B-Finetuned **Base Model:** Qwen3-0.6B **Architecture:** Decoder-only Transformer **Parameters:** 0.6 Billion **Task:** Code Generation, Instruction Following **Languages:** English, Python (optimized for code) **License:** Apache 2.0 **Overview:** Aevum-0.6B-Finetuned is a highly efficient, small-scale language model fine-tuned for code generation and task following. Built on the Qwen3-0.6B foundation, it delivers strong performance—achieving a **HumanEval Pass@1 score of 21.34%**—making it the most parameter-efficient sub-1B model in its category. **Key Features:** - Optimized for low-latency inference on CPU and edge devices. - Fine-tuned on MBPP and DeepMind Code Contests for superior code generation accuracy. - Ideal for lightweight development, education, and prototyping. **Use Case:** Perfect for developers and researchers needing a fast, compact, and open model for Python code generation without requiring high-end hardware. **Performance Benchmark:** Outperforms larger models in efficiency: comparable to models 10x its size in task accuracy. **Cite:** @misc{aveum06B2025, title={aevum-0.6B-Finetuned: Lightweight Python Code Generation Model}, author={anonymous}, year={2025}} **Try it:** Use via Hugging Face `transformers` library with minimal setup. 👉 [Model Page on Hugging Face](https://huggingface.co/Aevum-Official/aveum-0.6B-Finetuned)

Links

https://huggingface.co/mradermacher/Aevum-0.6B-Finetuned-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

supergemma4-26b-uncensored-v2

gemma-4-26b-a4b-it-apex

gemma-4-26b-a4b-it

nemo-parakeet-tdt-0.6b

qwen3-tts-0.6b-custom-voice

qwen3-asr-0.6b

liquidai.lfm2-2.6b-transcript

qwen3-asr-0.6b

lfm2-vl-1.6b

qwen3-reranker-0.6b

qwen3-0.6b

kalomaze_qwen3-16b-a3b

qwen3-embedding-0.6b

qwen3-deckard-large-almost-human-6b-i1

gustavecortal_beck-0.6b

l3.1-moe-2x8b-v0.2

deepseek-coder-v2-lite-instruct

mistral-small-3.2-46b-the-brilliant-raconteur-ii-instruct-2506

yi-1.5-6b-chat

Z-Image-Turbo

aevum-0.6b-finetuned