LocalAI - Models

kimi-k2.7-code

## 1. Model Introduction Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6. ## 2. Model Summary ## 3. Evaluation Results Benchmark Kimi K2.6 Kimi K2.7 Code GPT-5.5 Claude Opus 4.8 Coding Kimi Code Bench v2 50.9 62.0 69.0 67.4 Program Bench 48.3 53.6 69.1 63.8 MLS Bench Lite 26.7 35.1 35.5 42.8 Agentic Kimi Claw 24/7 Bench 42.9 46.9 52.8 50.4 MCP Atlas 69.4 76.0 79.4 81.3 MCP Mark Verified 72.8 81.1 92.9 76.4 Footnotes ...

Links

https://huggingface.co/unsloth/Kimi-K2.7-Code-GGUF

Tags

qwen_qwen3.5-2b

Qwen3.5-2B is a highly efficient, instruction-tuned multilingual language model available in various quantized GGUF formats. Optimized for llama-cpp inference, it supports chat and completion tasks with strong performance on low-RAM hardware. The model is available in multiple quantization levels ranging from Q8_0 to IQ2_M to balance quality and resource usage.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-2B-GGUF

Tags

onerec-8b

The model `mradermacher/OneRec-8B-GGUF` is a quantized version of the base model `OpenOneRec/OneRec-8B`, a large language model designed for tasks like recommendations or content generation. It is optimized for efficiency with various quantization schemes (e.g., Q2_K, Q4_K, Q8_0) and available in multiple sizes (3.5–9.0 GB). The model uses the GGUF format and is licensed under Apache-2.0. Key features include: - **Base Model**: `OpenOneRec/OneRec-8B` (a pre-trained language model for recommendations). - **Quantization**: Supports multiple quantized variants (Q2_K, Q3_K, Q4_K, etc.), with the best quality for `Q4_K_S` and `Q8_0`. - **Sizes**: Available in sizes ranging from 3.5 GB (Q2_K) to 9.0 GB (Q8_0), with faster speeds for lower-bit quantized versions. - **Usage**: Compatible with GGUF files, suitable for deployment in applications requiring efficient model inference. - **Licence**: Apache-2.0, available at [https://huggingface.co/OpenOneRec/OneRec-8B/blob/main/LICENSE](https://huggingface.co/OpenOneRec/OneRec-8B/blob/main/LICENSE). For detailed specifications, refer to the [model page](https://hf.tst.eu/model#OneRec-8B-GGUF).

Links

https://huggingface.co/mradermacher/OneRec-8B-GGUF

Tags

gemma-3-glitter-12b-i1

A creative writing model based on Gemma 3 12B IT. This is a 50/50 merge of two separate trains: ToastyPigeon/g3-12b-rp-system-v0.1 - ~13.5M tokens of instruct-based training related to RP (2:1 human to synthetic) and examples using a system prompt. ToastyPigeon/g3-12b-storyteller-v0.2-textonly - ~20M tokens of completion training on long-form creative writing; 1.6M synthetic from R1, the rest human-created

Links

Tags

watt-ai_watt-tool-70b

watt-tool-70B is a fine-tuned language model based on LLaMa-3.3-70B-Instruct, optimized for tool usage and multi-turn dialogue. It achieves state-of-the-art performance on the Berkeley Function-Calling Leaderboard (BFCL). Model Description This model is specifically designed to excel at complex tool usage scenarios that require multi-turn interactions, making it ideal for empowering platforms like Lupan, an AI-powered workflow building tool. By leveraging a carefully curated and optimized dataset, watt-tool-70B demonstrates superior capabilities in understanding user requests, selecting appropriate tools, and effectively utilizing them across multiple turns of conversation. Target Application: AI Workflow Building as in https://lupan.watt.chat/ and Coze. Key Features Enhanced Tool Usage: Fine-tuned for precise and efficient tool selection and execution. Multi-Turn Dialogue: Optimized for maintaining context and effectively utilizing tools across multiple turns of conversation, enabling more complex task completion. State-of-the-Art Performance: Achieves top performance on the BFCL, demonstrating its capabilities in function calling and tool usage. Based on LLaMa-3.1-70B-Instruct: Inherits the strong language understanding and generation capabilities of the base model.

Links

Tags

yi-1.5-9b-chat

Yi-1.5-9B-Chat is a quantized GGUF model optimized for local inference. It delivers strong performance in coding, math, and reasoning while maintaining excellent instruction-following capabilities. Suitable for chat and completion tasks on consumer hardware.

Links

Tags

openvino-wizardlm2

WizardLM-2 7B instruction-tuned language model optimized for OpenVINO backend. Supports conversational chat and text completion with 8192 context window.

Links

https://huggingface.co/fakezeta/Not-WizardLM-2-7B-ov-int8

Tags

qwen3-coder-reap-25b-a3b-i1

**Model Name:** Qwen3-Coder-REAP-25B-A3B (Base Model: cerebras/Qwen3-Coder-REAP-25B-A3B) **Model Type:** Large Language Model (LLM) for Code Generation **Architecture:** Mixture-of-Experts (MoE) – Qwen3-Coder variant **Size:** 25B parameters (with 3 active experts at inference time) **License:** Apache 2.0 **Library:** Hugging Face Transformers **Language Support:** Primarily English, optimized for coding tasks across multiple programming languages **Description:** The **Qwen3-Coder-REAP-25B-A3B** is a high-performance, open-source, Mixture-of-Experts (MoE) language model developed by Cerebras Systems, specifically fine-tuned for advanced code generation and reasoning. Built on the Qwen3 architecture, this model excels in understanding complex codebases, generating syntactically correct and semantically meaningful code, and solving programming challenges across diverse domains. This version is the **original, unquantized base model** and serves as the foundation for various quantized GGUF variants (e.g., by mradermacher), which are optimized for local inference with reduced memory footprint while preserving strong performance. Ideal for developers, AI researchers, and engineers working on code completion, debugging, documentation generation, and automated software development workflows. ✅ **Key Features:** - State-of-the-art code generation - 25B parameter scale with expert routing - MoE architecture for efficient inference - Full compatibility with Hugging Face Transformers - Designed for real-world coding tasks **Base Model Repository:** [cerebras/Qwen3-Coder-REAP-25B-A3B](https://huggingface.co/cerebras/Qwen3-Coder-REAP-25B-A3B) **Quantized Versions:** Available via [mradermacher/Qwen3-Coder-REAP-25B-A3B-i1-GGUF](https://huggingface.co/mradermacher/Qwen3-Coder-REAP-25B-A3B-i1-GGUF) (for local inference with GGUF) > 🔍 **Note:** The quantized versions (e.g., GGUF) are optimized for performance on consumer hardware and are not the original model. For the full, unquantized model description, refer to the base model above.

Links

https://huggingface.co/mradermacher/Qwen3-Coder-REAP-25B-A3B-i1-GGUF

Tags

ibm-granite.granite-4.0-1b

### **Granite-4.0-1B** *By IBM | Apache 2.0 License* **Overview:** Granite-4.0-1B is a lightweight, instruction-tuned language model designed for efficient on-device and research use. Built on a decoder-only dense transformer architecture, it delivers strong performance in instruction following, code generation, tool calling, and multilingual tasks—making it ideal for applications requiring low latency and minimal resource usage. **Key Features:** - **Size:** 1.6 billion parameters (1B Dense), optimized for efficiency. - **Capabilities:** - Text generation, summarization, question answering - Code completion and function calling (e.g., API integration) - Multilingual support (English, Spanish, French, German, Japanese, Chinese, Arabic, Korean, Portuguese, Italian, Dutch, Czech) - Robust safety and alignment via instruction tuning and reinforcement learning - **Architecture:** Uses GQA (Grouped Query Attention), SwiGLU activation, RMSNorm, shared input/output embeddings, and RoPE position embeddings. - **Context Length:** Up to 128K tokens — suitable for long-form content and complex reasoning. - **Training:** Finetuned from *Granite-4.0-1B-Base* using open-source datasets, synthetic data, and human-curated instruction pairs. **Performance Highlights (1B Dense):** - **MMLU (5-shot):** 59.39 - **HumanEval (pass@1):** 74 - **IFEval (Alignment):** 80.82 - **GSM8K (8-shot):** 76.35 - **SALAD-Bench (Safety):** 93.44 **Use Cases:** - On-device AI applications - Research and prototyping - Fine-tuning for domain-specific tasks - Low-resource environments with high performance expectations **Resources:** - [Hugging Face Model](https://huggingface.co/ibm-granite/granite-4.0-1b) - [Granite Docs](https://www.ibm.com/granite/docs/) - [GitHub Repository](https://github.com/ibm-granite/granite-4.0-nano-language-models) > *“Make knowledge free for everyone.” – IBM Granite Team*

Links

https://huggingface.co/DevQuasar/ibm-granite.granite-4.0-1b-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

kimi-k2.7-code

qwen_qwen3.5-2b

onerec-8b

gemma-3-glitter-12b-i1

watt-ai_watt-tool-70b

yi-1.5-9b-chat

openvino-wizardlm2

qwen3-coder-reap-25b-a3b-i1

ibm-granite.granite-4.0-1b