LocalAI - Models

deepseek-v4-flash

# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Technical Report👁️ ## Introduction We present a preview version of **DeepSeek-V4** series, including two strong Mixture-of-Experts (MoE) language models — **DeepSeek-V4-Pro** with 1.6T parameters (49B activated) and **DeepSeek-V4-Flash** with 284B parameters (13B activated) — both supporting a context length of **one million tokens**. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: 1. **Hybrid Attention Architecture:** We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only **27% of single-token inference FLOPs** and **10% of KV cache** compared with DeepSeek-V3.2. 2. **Manifold-Constrained Hyper-Connections (mHC):** We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity. 3. **Muon Optimizer:** We employ the Muon optimizer for faster convergence and greater training stability. ...

Links

https://huggingface.co/unsloth/DeepSeek-V4-Flash-GGUF

Tags

chroma1-hd

Chroma1-HD is an 8.9B-parameter text-to-image foundation model derived from FLUX.1-schnell with reduced parameter count via architectural optimizations. Designed as a base for creators, researchers, and downstream fine-tuning. Recommended inference: 40 steps, CFG 3.0, bfloat16.

Links

https://huggingface.co/lodestones/Chroma1-HD

Tags

kimi-k2.6

🤗 huggingchat | 📰 Tech Blog ## 1. Model Introduction Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration. ### Key Features - **Long-Horizon Coding**: K2.6 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization. - **Coding-Driven Design**: K2.6 is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision. - **Elevated Agent Swarm**: Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run. - **Proactive & Open Orchestration**: For autonomous tasks, K2.6 demonstra ...

Links

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

Tags

nanbeige4.1-3b-q8

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

nanbeige4.1-3b-q4

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

liquidai.lfm2-2.6b-transcript

This is a large language model (2.6B parameters) designed for text-generation tasks. It is a quantized version of the original model `LiquidAI/LFM2-2.6B-Transcript`, optimized for efficiency while retaining strong performance. The model is built on the foundation of the base model, with additional optimizations for deployment and use cases like transcription or language modeling. It is trained on large-scale text data and supports multiple languages.

Links

https://huggingface.co/DevQuasar/LiquidAI.LFM2-2.6B-Transcript-GGUF

Tags

zhi-create-qwen3-32b-i1

Zhi-Create-Qwen3-32B is a fine-tuned model derived from Qwen/Qwen3-32B, with a focus on enhancing creative writing capabilities. Through careful optimization, the model shows promising improvements in creative writing performance, as evaluated using the WritingBench. In our evaluation, the model attains a score of 82.08 on WritingBench, which represents a significant improvement over the base Qwen3-32B model's score of 78.97. Additionally, to maintain the model's general capabilities such as knowledge and reasoning, we performed fine-grained data mixture experiments by combining general knowledge, mathematics, code, and other data types. The final evaluation results show that general capabilities remain stable with no significant decline compared to the base model.

Links

Tags

gustavecortal_beck-8b

A language model that handles delicate life situations and tries to really help you. Beck is based on Piaget and was finetuned on psychotherapeutic preferences from PsychoCounsel-Preference. Methodology Beck was trained using preference optimization (ORPO) and LoRA. You can reproduce the results using my repo for lightweight preference optimization using this config that contains the hyperparameters. This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205). Inspiration Beck aims to reason about psychological and philosophical concepts such as self-image, emotion, and existence. Beck was inspired by my position paper on emotion analysis: Improving Language Models for Emotion Analysis: Insights from Cognitive Science.

Links

Tags

gustavecortal_beck-0.6b

A language model that handles delicate life situations and tries to really help you. Beck is based on Piaget and was finetuned on psychotherapeutic preferences from PsychoCounsel-Preference. Methodology Beck was trained using preference optimization (ORPO) and LoRA. You can reproduce the results using my repo for lightweight preference optimization using this config that contains the hyperparameters. This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205). Inspiration Beck aims to reason about psychological and philosophical concepts such as self-image, emotion, and existence. Beck was inspired by my position paper on emotion analysis: Improving Language Models for Emotion Analysis: Insights from Cognitive Science.

Links

Tags

gustavecortal_beck-1.7b

A language model that handles delicate life situations and tries to really help you. Beck is based on Piaget and was finetuned on psychotherapeutic preferences from PsychoCounsel-Preference. Methodology Beck was trained using preference optimization (ORPO) and LoRA. You can reproduce the results using my repo for lightweight preference optimization using this config that contains the hyperparameters. This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205). Inspiration Beck aims to reason about psychological and philosophical concepts such as self-image, emotion, and existence. Beck was inspired by my position paper on emotion analysis: Improving Language Models for Emotion Analysis: Insights from Cognitive Science.

Links

Tags

gustavecortal_beck-4b

A language model that handles delicate life situations and tries to really help you. Beck is based on Piaget and was finetuned on psychotherapeutic preferences from PsychoCounsel-Preference. Methodology Beck was trained using preference optimization (ORPO) and LoRA. You can reproduce the results using my repo for lightweight preference optimization using this config that contains the hyperparameters. This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205). Inspiration Beck aims to reason about psychological and philosophical concepts such as self-image, emotion, and existence. Beck was inspired by my position paper on emotion analysis: Improving Language Models for Emotion Analysis: Insights from Cognitive Science.

Links

Tags

llmevollama-3.1-8b-v0.1-i1

This project aims to optimize model merging by integrating LLMs into evolutionary strategies in a novel way. Instead of using the CMA-ES approach, the goal is to improve model optimization by leveraging the search capabilities of LLMs to explore the parameter space more efficiently and adjust the search scope based on high-performing solutions. Currently, the project supports optimization only within the Parameter Space, but I plan to extend its functionality to enable merging and optimization in the Data Flow Space as well. This will further enhance model merging by optimizing the interaction between data flow and parameters.

Links

Tags

facebook_kernelllm

We introduce KernelLLM, a large language model based on Llama 3.1 Instruct, which has been trained specifically for the task of authoring GPU kernels using Triton. KernelLLM translates PyTorch modules into Triton kernels and was evaluated on KernelBench-Triton (see here). KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient. KernelLLM's vision is to meet the growing demand for high-performance GPU kernels by automating the generation of efficient Triton implementations. As workloads grow larger and more diverse accelerator architectures emerge, the need for tailored kernel solutions has increased significantly. Although a number of works exist, most of them are limited to test-time optimization, while others tune on solutions traced of KernelBench problems itself, thereby limiting the informativeness of the results towards out-of-distribution generalization. To the best of our knowledge KernelLLM is the first LLM finetuned on external (torch, triton) pairs, and we hope that making our model available can accelerate progress towards intelligent kernel authoring systems. KernelLLM Workflow for Triton Kernel Generation: Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output). The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations, and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset is [KernelBook]](https://huggingface.co/datasets/GPUMODE/KernelBook). We finetuned Llama3.1-8B-Instruct on the created dataset using supervised instruction tuning and measured its ability to generate correct Triton kernels and corresponding calling code on KernelBench-Triton, our newly created variant of KernelBench [Ouyang et al. 2025] targeting Triton kernel generation. The torch code was used with a prompt template containing a format example as instruction during both training and evaluation. The model was trained for 10 epochs with a batch size of 32 and a standard SFT recipe with hyperparameters selected by perplexity on a held-out subset of the training data. Training took circa 12 hours wall clock time on 16 GPUs (192 GPU hours), and we report the best checkpoint's validation results.

Links

Tags

knoveleng_open-rs3

This repository hosts model for the Open RS project, accompanying the paper Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t. The project explores enhancing reasoning capabilities in small large language models (LLMs) using reinforcement learning (RL) under resource-constrained conditions. We focus on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, trained on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. By adapting the Group Relative Policy Optimization (GRPO) algorithm and leveraging a curated, compact mathematical reasoning dataset, we conducted three experiments to assess performance and behavior. Key findings include: Significant reasoning improvements, e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, outperforming o1-preview. Efficient training with just 7,000 samples at a cost of $42, compared to thousands of dollars for baseline models. Challenges like optimization instability and length constraints with extended training. These results showcase RL-based fine-tuning as a cost-effective approach for small LLMs, making reasoning capabilities accessible in resource-limited settings. We open-source our code, models, and datasets to support further research.

Links

Tags

deepseek-ai_deepseek-r1-0528-qwen3-8b

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528. In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro.

Links

Tags

llm-compiler-13b-imat

LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning. LLM Compiler is free for both research and commercial use. LLM Compiler is available in two flavors: LLM Compiler, the foundational models, pretrained on over 500B tokens of LLVM-IR, x86_84, ARM, and CUDA assembly codes and trained to predict the effect of LLVM optimizations; and LLM Compiler FTD, which is further fine-tuned to predict the best optimizations for code in LLVM assembly to reduce code size, and to disassemble assembly code to LLVM-IR.

Links

Tags

llm-compiler-13b-ftd

LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning. LLM Compiler is free for both research and commercial use. LLM Compiler is available in two flavors: LLM Compiler, the foundational models, pretrained on over 500B tokens of LLVM-IR, x86_84, ARM, and CUDA assembly codes and trained to predict the effect of LLVM optimizations; and LLM Compiler FTD, which is further fine-tuned to predict the best optimizations for code in LLVM assembly to reduce code size, and to disassemble assembly code to LLVM-IR.

Links

Tags

llm-compiler-7b-imat-GGUF

LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning. LLM Compiler is free for both research and commercial use. LLM Compiler is available in two flavors: LLM Compiler, the foundational models, pretrained on over 500B tokens of LLVM-IR, x86_84, ARM, and CUDA assembly codes and trained to predict the effect of LLVM optimizations; and LLM Compiler FTD, which is further fine-tuned to predict the best optimizations for code in LLVM assembly to reduce code size, and to disassemble assembly code to LLVM-IR.

Links

Tags

llm-compiler-7b-ftd-imat

LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning. LLM Compiler is free for both research and commercial use. LLM Compiler is available in two flavors: LLM Compiler, the foundational models, pretrained on over 500B tokens of LLVM-IR, x86_84, ARM, and CUDA assembly codes and trained to predict the effect of LLVM optimizations; and LLM Compiler FTD, which is further fine-tuned to predict the best optimizations for code in LLVM assembly to reduce code size, and to disassemble assembly code to LLVM-IR.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

deepseek-v4-flash

chroma1-hd

kimi-k2.6

nanbeige4.1-3b-q8

nanbeige4.1-3b-q4

liquidai.lfm2-2.6b-transcript

zhi-create-qwen3-32b-i1

gustavecortal_beck-8b

gustavecortal_beck-0.6b

gustavecortal_beck-1.7b

gustavecortal_beck-4b

llmevollama-3.1-8b-v0.1-i1

facebook_kernelllm

knoveleng_open-rs3

deepseek-ai_deepseek-r1-0528-qwen3-8b

llm-compiler-13b-imat

llm-compiler-13b-ftd

llm-compiler-7b-imat-GGUF

llm-compiler-7b-ftd-imat