LocalAI - Models

deepseek-v4-flash

# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Technical Report👁️ ## Introduction We present a preview version of **DeepSeek-V4** series, including two strong Mixture-of-Experts (MoE) language models — **DeepSeek-V4-Pro** with 1.6T parameters (49B activated) and **DeepSeek-V4-Flash** with 284B parameters (13B activated) — both supporting a context length of **one million tokens**. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: 1. **Hybrid Attention Architecture:** We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only **27% of single-token inference FLOPs** and **10% of KV cache** compared with DeepSeek-V3.2. 2. **Manifold-Constrained Hyper-Connections (mHC):** We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity. 3. **Muon Optimizer:** We employ the Muon optimizer for faster convergence and greater training stability. ...

Links

https://huggingface.co/unsloth/DeepSeek-V4-Flash-GGUF

Tags

step-3.7-flash

**[ModelPage]**: https://static.stepfun.com/blog/step-3.7-flash/ ## 1. Introduction Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth. We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines. ## 2. Capabilities & Performance ### Multimodal Perception and Verification ...

Links

https://huggingface.co/unsloth/Step-3.7-Flash-GGUF

Tags

qwen3.5-9b-deepseek-v4-flash

# Qwen3.5-9B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency. ## Qwen3.5 Highlights Qwen3.5 features the following enhancement: - **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. - **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead. ...

Links

https://huggingface.co/Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-GGUF

Tags

glm-4.7-flash-derestricted

This model is a quantized version of the original GLM-4.7-Flash-Derestricted model, derived from the base model `koute/GLM-4.7-Flash-Derestricted`. It is designed for restricted use, featuring tags like "derestricted," "uncensored," and "unlimited." The quantized versions (e.g., Q2_K, Q4_K_S, Q6_K) offer varying trade-offs between accuracy and efficiency, with the Q4_K_S and Q6_K variants being recommended for balanced performance. The model is optimized for fast inference and supports multiple quantization schemes, though some advanced quantization options (like IQ4_XS) are not available. It is intended for use in environments with specific constraints or restrictions.

Links

https://huggingface.co/mradermacher/GLM-4.7-Flash-Derestricted-GGUF

Tags

huihui-glm-4.7-flash-abliterated-i1

The model is a quantized version of **huihui-ai/Huihui-GLM-4.7-Flash-abliterated**, optimized for efficiency and deployment. It uses GGUF files with various quantization levels (e.g., IQ1_M, IQ2_XXS, Q4_K_M) and is designed for tasks requiring low-resource deployment. Key features include: - **Base Model**: Huihui-GLM-4.7-Flash-abliterated (unmodified, original model). - **Quantization**: Supports IQ1_M to Q4_K_M, balancing accuracy and efficiency. - **Use Cases**: Suitable for applications needing lightweight inference, such as edge devices or resource-constrained environments. - **Downloads**: Available in GGUF format with varying quality and size (e.g., 0.2GB to 18.2GB). - **Tags**: Abliterated, uncensored, and optimized for specific tasks. This model is a modified version of the original GLM-4.7, tailored for deployment with quantized weights.

Links

https://huggingface.co/mradermacher/Huihui-GLM-4.7-Flash-abliterated-i1-GGUF

Tags

glm-4.7-flash

**GLM-4.7-Flash** is a 30B-A3B MoE (Model Organism Ensemble) model designed for efficient deployment. It outperforms competitors in benchmarks like AIME 25, GPQA, and τ²-Bench, offering strong accuracy while balancing performance and efficiency. Optimized for lightweight use cases, it supports inference via frameworks like vLLM and SGLang, with detailed deployment instructions in the official repository. Ideal for applications requiring high-quality text generation with minimal resource consumption.

Links

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Tags

deepseek-v4-flash-q2

DeepSeek V4 Flash (IQ2XXS GGUF, ~81 GB) - only loadable via the ds4 backend. Requires >=128 GB RAM. Metal (Darwin) or CUDA (Linux). See https://github.com/antirez/ds4 for details.

Links

https://huggingface.co/antirez/deepseek-v4-gguf

Tags

deepseek-v4-flash-q2-q4

DeepSeek V4 Flash (mixed q2/q4 GGUF, ~91 GB) - only loadable via the ds4 backend. The last 6 expert layers are kept at Q4_K (the rest IQ2XXS), trading a little extra memory for higher quality than the pure-q2 build while still fitting in RAM on a 128 GB machine. imatrix-tuned. Metal (Darwin) or CUDA (Linux). See https://github.com/antirez/ds4 for details.

Links

https://huggingface.co/antirez/deepseek-v4-gguf

Tags

deepseek-v4-flash-q4-ssd

DeepSeek V4 Flash (full 4-bit experts GGUF, ~153 GB) - only loadable via the ds4 backend, with SSD streaming enabled so it runs on a 128 GB machine even though the weights do not fit in RAM: routed MoE experts stream from the GGUF on SSD while the non-routed weights stay resident. SSD streaming is Metal (Darwin) only; generation speed depends on SSD speed and the expert cache. Tune the routed-expert cache with the 'ssd_streaming_cache_experts:NGB' option (default: automatic budget). See https://github.com/antirez/ds4.

Links

https://huggingface.co/antirez/deepseek-v4-gguf

Tags

deepseek-v4-flash-q2-mtp

DeepSeek V4 Flash (IQ2XXS GGUF, ~81 GB) paired with the optional MTP speculative-decoding weights (~3.5 GB) for a slight speedup. Only loadable via the ds4 backend; requires >=128 GB RAM. MTP helps only with greedy decoding (temperature 0), so the override pins temperature to 0. Metal (Darwin) or CUDA (Linux). See https://github.com/antirez/ds4 for details.

Links

https://huggingface.co/antirez/deepseek-v4-gguf

Tags

Model Gallery

Filter by type:

Filter by tags:

deepseek-v4-flash

step-3.7-flash

qwen3.5-9b-deepseek-v4-flash

glm-4.7-flash-derestricted

huihui-glm-4.7-flash-abliterated-i1

glm-4.7-flash

deepseek-v4-flash-q2

deepseek-v4-flash-q2-q4

deepseek-v4-flash-q4-ssd

deepseek-v4-flash-q2-mtp