LocalAI - Models

deepseek-v4-flash

# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Technical Report👁️ ## Introduction We present a preview version of **DeepSeek-V4** series, including two strong Mixture-of-Experts (MoE) language models — **DeepSeek-V4-Pro** with 1.6T parameters (49B activated) and **DeepSeek-V4-Flash** with 284B parameters (13B activated) — both supporting a context length of **one million tokens**. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: 1. **Hybrid Attention Architecture:** We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only **27% of single-token inference FLOPs** and **10% of KV cache** compared with DeepSeek-V3.2. 2. **Manifold-Constrained Hyper-Connections (mHC):** We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity. 3. **Muon Optimizer:** We employ the Muon optimizer for faster convergence and greater training stability. ...

Links

https://huggingface.co/unsloth/DeepSeek-V4-Flash-GGUF

Tags

ai21labs_ai21-jamba-reasoning-3b

AI21’s Jamba Reasoning 3B is a top-performing reasoning model that packs leading scores on intelligence benchmarks and highly-efficient processing into a compact 3B build. The hybrid design combines Transformer attention with Mamba (a state-space model). Mamba layers are more efficient for sequence processing, while attention layers capture complex dependencies. This mix reduces memory overhead, improves throughput, and makes the model run smoothly on laptops, GPUs, and even mobile devices, while maintainig impressive quality.

Links

Tags

ibm-granite_granite-4.0-h-small

Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications.

Links

Tags

ibm-granite_granite-4.0-h-tiny

Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications.

Links

Tags

ibm-granite_granite-4.0-h-micro

Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications.

Links

Tags

ibm-granite_granite-4.0-micro

Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications.

Links

Tags

baidu_ernie-4.5-21b-a3b-thinking

Over the past three months, we have continued to scale the thinking capability of ERNIE-4.5-21B-A3B, improving both the quality and depth of reasoning, thereby advancing the competitiveness of ERNIE lightweight models in complex reasoning tasks. We are pleased to introduce ERNIE-4.5-21B-A3B-Thinking, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, text generation, and academic benchmarks that typically require human expertise. Efficient tool usage capabilities. Enhanced 128K long-context understanding capabilities. Note: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token.

Links

Tags

huggingfacetb_smollm3-3b

SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale. The model is a decoder-only transformer using GQA and NoPE (with 3:1 ratio), it was pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO).

Links

Tags

qwen3-235b-a22b-instruct-2507

We introduce the updated version of the Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring the following key enhancements: Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. Substantial gains in long-tail knowledge coverage across multiple languages. Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. Enhanced capabilities in 256K long-context understanding.

Links

Tags

menlo_lucy-128k

Lucy is a compact but capable 1.7B model focused on agentic web search and lightweight browsing. Built on Qwen3-1.7B, Lucy inherits deep research capabilities from larger models while being optimized to run efficiently on mobile devices, even with CPU-only configurations. We achieved this through machine-generated task vectors that optimize thinking processes, smooth reward functions across multiple categories, and pure reinforcement learning without any supervised fine-tuning.

Links

Tags

qwen_qwen3-30b-a3b-instruct-2507

We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. Substantial gains in long-tail knowledge coverage across multiple languages. Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. Enhanced capabilities in 256K long-context understanding.

Links

Tags

qwen_qwen3-30b-a3b-thinking-2507

Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.

Links

Tags

qwen_qwen3-4b-instruct-2507

We introduce the updated version of the Qwen3-4B non-thinking mode, named Qwen3-4B-Instruct-2507, featuring the following key enhancements: Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. Substantial gains in long-tail knowledge coverage across multiple languages. Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. Enhanced capabilities in 256K long-context understanding.

Links

Tags

qwen_qwen3-4b-thinking-2507

Over the past three months, we have continued to scale the thinking capability of Qwen3-4B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-4B-Thinking-2507, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.

Links

Tags

minicpm3-4b

MiniCPM3-4B is a 4B-parameter language model that surpasses many larger models. It features enhanced long-context capability up to 32K tokens, strong function calling, and improved instruction following.

Links

Tags

impish_qwen_14b-1m

Supreme context One million tokens to play with. Strong Roleplay internet RP format lovers will appriciate it, medium size paragraphs. Qwen smarts built-in, but naughty and playful Maybe it's even too naughty. VERY compliant with low censorship. VERY high IFeval for a 14B RP model: 78.68.

Links

Tags

ibm-granite_granite-3.2-8b-instruct

Granite-3.2-8B-Instruct is an 8-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Links

Tags

ibm-granite_granite-3.2-2b-instruct

Granite-3.2-2B-Instruct is an 2-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-2B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Links

Tags

fireball-meta-llama-3.2-8b-instruct-agent-003-128k-code-dpo

The LLM model is a quantized version of EpistemeAI/Fireball-Meta-Llama-3.2-8B-Instruct-agent-003-128k-code-DPO, which is an experimental and revolutionary fine-tune with DPO dataset to allow LLama 3.1 8B to be an agentic coder. It has some built-in agent features such as search, calculator, and ReAct. Other noticeable features include self-learning using unsloth, RAG applications, and memory. The context window of the model is 128K. It can be integrated into projects using popular libraries like Transformers and vLLM. The model is suitable for use with Langchain or LLamaIndex. The model is developed by EpistemeAI and licensed under the Apache 2.0 license.

Links

https://huggingface.co/QuantFactory/Fireball-Meta-Llama-3.2-8B-Instruct-agent-003-128k-code-DPO-GGUF

Tags

openbuddy-llama3.1-8b-v22.1-131k

OpenBuddy - Open Multilingual Chatbot

Links

https://huggingface.co/sunnyyy/openbuddy-llama3.1-8b-v22.1-131k-Q4_K_M-GGUF

Tags

nvidia_llama-3.1-8b-ultralong-1m-instruct

We introduce UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

deepseek-v4-flash

ai21labs_ai21-jamba-reasoning-3b

ibm-granite_granite-4.0-h-small

ibm-granite_granite-4.0-h-tiny

ibm-granite_granite-4.0-h-micro

ibm-granite_granite-4.0-micro

baidu_ernie-4.5-21b-a3b-thinking

huggingfacetb_smollm3-3b

qwen3-235b-a22b-instruct-2507

menlo_lucy-128k

qwen_qwen3-30b-a3b-instruct-2507

qwen_qwen3-30b-a3b-thinking-2507

qwen_qwen3-4b-instruct-2507

qwen_qwen3-4b-thinking-2507

minicpm3-4b

impish_qwen_14b-1m

ibm-granite_granite-3.2-8b-instruct

ibm-granite_granite-3.2-2b-instruct

fireball-meta-llama-3.2-8b-instruct-agent-003-128k-code-dpo

openbuddy-llama3.1-8b-v22.1-131k

nvidia_llama-3.1-8b-ultralong-1m-instruct