LocalAI - Models

qwen3-8b-jailbroken

This jailbroken LLM is released strictly for academic research purposes in AI safety and model alignment studies. The author bears no responsibility for any misuse or harm resulting from the deployment of this model. Users must comply with all applicable laws and ethical guidelines when conducting research. A jailbroken Qwen3-8B model using weight orthogonalization[1]. Implementation script: https://gist.github.com/cooperleong00/14d9304ba0a4b8dba91b60a873752d25 [1]: Arditi, Andy, et al. "Refusal in language models is mediated by a single direction." arXiv preprint arXiv:2406.11717 (2024).

Links

Tags

granite-3.0-1b-a400m-instruct

Granite 3.0 language models are a new set of lightweight state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models' data curation and training procedure were designed for enterprise usage and customization in mind, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM's standard data clearance process and document quality checks. Granite 3.0 includes 4 different models of varying sizes: Dense Models: 2B and 8B parameter models, trained on 12 trillion tokens in total. Mixture-of-Expert (MoE) Models: Sparse 1B and 3B MoE models, with 400M and 800M activated parameters respectively, trained on 10 trillion tokens in total. Accordingly, these options provide a range of models with different compute requirements to choose from, with appropriate trade-offs with their performance on downstream tasks. At each scale, we release a base model — checkpoints of models after pretraining, as well as instruct checkpoints — models finetuned for dialogue, instruction-following, helpfulness, and safety.

Links

Tags

llama-3.2-1b-instruct:q4_k_m

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF

Tags

llama-3.2-3b-instruct:q4_k_m

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q4_K_M-GGUF

Tags

llama-3.2-3b-instruct:q8_0

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF

Tags

llama-3.2-1b-instruct:q8_0

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF

Tags

meta-llama-3.1-8b-instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model developer: Meta Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

Tags

meta-llama-3.1-70b-instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model developer: Meta Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

Tags

meta-llama-3.1-8b-claude-imat

Meta-Llama-3.1-8B-Claude-iMat-GGUF: Quantized from Meta-Llama-3.1-8B-Claude fp16. Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 88 chunks and n_ctx=512. Static fp16 will also be included in repo. For a brief rundown of iMatrix quant performance, please see this PR. All quants are verified working prior to uploading to repo for your safety and convenience.

Links

Tags

sekhmet_aleph-l3.1-8b-v0.1-i1

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model developer: Meta Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

Tags

llama-guard-3-8b

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Llama Guard 3 was aligned to safeguard against the MLCommons standardized hazards taxonomy and designed to support Llama 3.1 capabilities. Specifically, it provides content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls.

Links

Tags

krutrim-ai-labs_krutrim-2-instruct

Krutrim-2 is a 12B parameter language model developed by the OLA Krutrim team. It is built on the Mistral-NeMo 12B architecture and trained across various domains, including web data, code, math, Indic languages, Indian context data, synthetic data, and books. Following pretraining, the model was finetuned for instruction following on diverse data covering a wide range of tasks, including knowledge recall, math, reasoning, coding, safety, and creative writing.

Links

Tags

cydonia-24b-v4.2.0-i1

**Cydonia-24B-v4.2.0** is a creatively oriented, large language model developed by *TheDrummer*, based on the **Mistral-Small-3.2-24B-Instruct-2507** foundation. Fine-tuned for dynamic storytelling, imaginative writing, and expressive roleplay, it excels in narrative coherence, linguistic flair, and non-aligned, open-ended interaction. Designed for users seeking creativity over strict alignment, the model delivers rich, engaging, and often surprising outputs—ideal for fiction writing, worldbuilding, and entertainment-focused AI use. **Key Features:** - Built on Mistral-Small-3.2-24B-Instruct-2507 base - Optimized for creative writing, roleplay, and narrative depth - Minimal alignment constraints for greater freedom and expression - Available in GGUF, EXL3, and iMatrix formats for local inference > *“This is the best model of yours I've tried yet… It writes superbly well.”* – User testimonial **Best For:** Writers, worldbuilders, and creators who value imagination, voice, and stylistic richness over rigid safety or factual accuracy. *Model Repository:* [TheDrummer/Cydonia-24B-v4.2.0](https://huggingface.co/TheDrummer/Cydonia-24B-v4.2.0)

Links

https://huggingface.co/mradermacher/Cydonia-24B-v4.2.0-i1-GGUF

Tags

qwen-sea-lion-v4-32b-it-i1

**Model Name:** Qwen-SEA-LION-v4-32B-IT **Base Model:** Qwen3-32B **Type:** Instruction-tuned Large Language Model (LLM) **Language Support:** 11 languages including English, Mandarin, Burmese, Indonesian, Malay, Filipino, Tamil, Thai, Vietnamese, Khmer, and Lao **Context Length:** 128,000 tokens **Repository:** [aisingapore/Qwen-SEA-LION-v4-32B-IT](https://huggingface.co/aisingapore/Qwen-SEA-LION-v4-32B-IT) **License:** [Qwen Terms of Service](https://qwen.ai/termsservice) / [Qwen Usage Policy](https://qwen.ai/usagepolicy) **Overview:** Qwen-SEA-LION-v4-32B-IT is a high-performance, multilingual instruction-tuned LLM developed by AI Singapore, specifically optimized for Southeast Asia (SEA). Built on the Qwen3-32B foundation, it underwent continued pre-training on 100B tokens from the SEA-Pile v2 corpus and further fine-tuned on ~8 million question-answer pairs to enhance instruction-following and reasoning. Designed for real-world multilingual applications across government, education, and business sectors in Southeast Asia, it delivers strong performance in dialogue, content generation, and cross-lingual tasks. **Key Features:** - Trained for 11 major SEA languages with high linguistic accuracy - 128K token context for long-form content and complex reasoning - Optimized for instruction following, multi-turn dialogue, and cultural relevance - Available in full precision and quantized variants (4-bit/8-bit) - Not safety-aligned — suitable for downstream safety fine-tuning **Use Cases:** - Multilingual chatbots and virtual assistants in SEA regions - Cross-lingual content generation and translation - Educational tools and public sector applications in Southeast Asia - Research and development in low-resource language modeling **Note:** This model is not safety-aligned. Use with caution and consider additional alignment measures for production deployment. **Contact:** [[email protected]](mailto:[email protected]) for inquiries.

Links

https://huggingface.co/mradermacher/Qwen-SEA-LION-v4-32B-IT-i1-GGUF

Tags

simia-tau-sft-qwen3-8b

The **Simia-Tau-SFT-Qwen3-8B** is a fine-tuned version of the Qwen3-8B language model, developed by Simia-Agent and adapted for enhanced instruction-following capabilities. This model is optimized for dialogue and task-oriented interactions, making it highly effective for real-world applications requiring nuanced understanding and coherent responses. The model is available in multiple quantized formats (GGUF), including Q4_K_S, Q5_K_M, Q8_0, and others, enabling efficient deployment across devices with varying computational resources. These quantized versions maintain strong performance while reducing memory footprint and inference latency. While this repository hosts a quantized variant (specifically designed for GGUF-based inference via tools like llama.cpp), the original base model is **Qwen3-8B**, a large-scale open-source language model from Alibaba Cloud. The fine-tuning (SFT) process improves its alignment with human intent and enhances its ability to follow complex instructions. > 🔍 **Note**: This is a quantized version; for the full-precision base model, refer to [Simia-Agent/Simia-Tau-SFT-Qwen3-8B](https://huggingface.co/Simia-Agent/Simia-Tau-SFT-Qwen3-8B) on Hugging Face. **Use Case**: Ideal for chatbots, assistant systems, and interactive applications requiring strong reasoning, safety, and fluency. **Model Size**: 8B parameters (quantized for efficiency). **License**: See the original model's license (typically Apache 2.0 for Qwen series). 👉 Recommended for edge deployment with GGUF-compatible tools.

Links

https://huggingface.co/mradermacher/Simia-Tau-SFT-Qwen3-8B-GGUF

Tags

ibm-granite.granite-4.0-1b

### **Granite-4.0-1B** *By IBM | Apache 2.0 License* **Overview:** Granite-4.0-1B is a lightweight, instruction-tuned language model designed for efficient on-device and research use. Built on a decoder-only dense transformer architecture, it delivers strong performance in instruction following, code generation, tool calling, and multilingual tasks—making it ideal for applications requiring low latency and minimal resource usage. **Key Features:** - **Size:** 1.6 billion parameters (1B Dense), optimized for efficiency. - **Capabilities:** - Text generation, summarization, question answering - Code completion and function calling (e.g., API integration) - Multilingual support (English, Spanish, French, German, Japanese, Chinese, Arabic, Korean, Portuguese, Italian, Dutch, Czech) - Robust safety and alignment via instruction tuning and reinforcement learning - **Architecture:** Uses GQA (Grouped Query Attention), SwiGLU activation, RMSNorm, shared input/output embeddings, and RoPE position embeddings. - **Context Length:** Up to 128K tokens — suitable for long-form content and complex reasoning. - **Training:** Finetuned from *Granite-4.0-1B-Base* using open-source datasets, synthetic data, and human-curated instruction pairs. **Performance Highlights (1B Dense):** - **MMLU (5-shot):** 59.39 - **HumanEval (pass@1):** 74 - **IFEval (Alignment):** 80.82 - **GSM8K (8-shot):** 76.35 - **SALAD-Bench (Safety):** 93.44 **Use Cases:** - On-device AI applications - Research and prototyping - Fine-tuning for domain-specific tasks - Low-resource environments with high performance expectations **Resources:** - [Hugging Face Model](https://huggingface.co/ibm-granite/granite-4.0-1b) - [Granite Docs](https://www.ibm.com/granite/docs/) - [GitHub Repository](https://github.com/ibm-granite/granite-4.0-nano-language-models) > *“Make knowledge free for everyone.” – IBM Granite Team*

Links

https://huggingface.co/DevQuasar/ibm-granite.granite-4.0-1b-GGUF

Tags

qwen3-nemotron-32b-rlbff-i1

**Model Name:** Qwen3-Nemotron-32B-RLBFF **Base Model:** Qwen/Qwen3-32B **Developer:** NVIDIA **License:** NVIDIA Open Model License **Description:** Qwen3-Nemotron-32B-RLBFF is a high-performance, fine-tuned large language model built on the Qwen3-32B foundation. It is specifically optimized to generate high-quality, helpful responses in a default thinking mode through advanced reinforcement learning with binary flexible feedback (RLBFF). Trained on the HelpSteer3 dataset, this model excels in reasoning, planning, coding, and information-seeking tasks while maintaining strong safety and alignment with human preferences. **Key Performance (as of Sep 2025):** - **MT-Bench:** 9.50 (near GPT-4-Turbo level) - **Arena Hard V2:** 55.6% - **WildBench:** 70.33% **Architecture & Efficiency:** - 32 billion parameters, based on the Qwen3 Transformer architecture - Designed for deployment on NVIDIA GPUs (Ampere, Hopper, Turing) - Achieves performance comparable to DeepSeek R1 and O3-mini at less than 5% of the inference cost **Use Case:** Ideal for applications requiring reliable, thoughtful, and safe responses—such as advanced chatbots, research assistants, and enterprise AI systems. **Access & Usage:** Available on Hugging Face with support for Hugging Face Transformers and vLLM. **Cite:** [Wang et al., 2025 — RLBFF: Binary Flexible Feedback](https://arxiv.org/abs/2509.21319) 👉 *Note: The GGUF version (mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF) is a user-quantized variant. The original model is available at nvidia/Qwen3-Nemotron-32B-RLBFF.*

Links

https://huggingface.co/mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF

Tags

nvidia.qwen3-nemotron-32b-rlbff

The **nvidia/Qwen3-Nemotron-32B-RLBFF** is a large language model based on the Qwen3 architecture, fine-tuned by NVIDIA using Reinforcement Learning from Human Feedback (RLHF) for improved alignment with human preferences. With 32 billion parameters, it excels in complex reasoning, instruction following, and natural language generation, making it suitable for advanced tasks such as code generation, dialogue systems, and content creation. This model is part of NVIDIA’s Nemotron series, designed to deliver high performance and safety in real-world applications. It is optimized for efficient deployment while maintaining strong language understanding and generation capabilities. **Key Features:** - **Base Model**: Qwen3-32B - **Fine-tuning**: Reinforcement Learning from Human Feedback (RLBFF) - **Use Case**: Advanced text generation, coding, dialogue, and reasoning - **License**: MIT (check Hugging Face for full details) 👉 [View on Hugging Face](https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF) *Note: The GGUF version hosted by DevQuasar is a quantized variant for efficient local inference. The original, unquantized model is available at the link above.*

Links

https://huggingface.co/DevQuasar/nvidia.Qwen3-Nemotron-32B-RLBFF-GGUF

Tags

metatune-gpt20b-r1.1-i1

**Model Name:** MetaTune-GPT20B-R1.1 **Base Model:** unsloth/gpt-oss-20b-unsloth-bnb-4bit **Repository:** [EpistemeAI/metatune-gpt20b-R1.1](https://huggingface.co/EpistemeAI/metatune-gpt20b-R1.1) **License:** Apache 2.0 **Description:** MetaTune-GPT20B-R1.1 is a large language model fine-tuned for recursive self-improvement, making it one of the first publicly released models capable of autonomously generating training data, evaluating its own performance, and adjusting its hyperparameters to improve over time. Built upon the open-weight GPT-OSS 20B architecture and trained with Unsloth's optimized 4-bit quantization, this model excels in complex reasoning, agentic tasks, and function calling. It supports tools like web browsing and structured output generation, and is particularly effective in high-reasoning use cases such as scientific problem-solving and math reasoning. **Performance Highlights (Zero-shot):** - **GPQA Diamond:** 93.3% exact match - **GSM8K (Chain-of-Thought):** 100% exact match **Recommended Use:** - Advanced reasoning & planning - Autonomous agent workflows - Research, education, and technical problem-solving **Safety Note:** Use with caution. For safety-critical applications, pair with a safety guardrail model such as [openai/gpt-oss-safeguard-20b](https://huggingface.co/openai/gpt-oss-safeguard-20b). **Fine-Tuned From:** unsloth/gpt-oss-20b-unsloth-bnb-4bit **Training Method:** Recursive Self-Improvement on the [Recursive Self-Improvement Dataset](https://huggingface.co/datasets/EpistemeAI/recursive_self_improvement_dataset) **Framework:** Hugging Face TRL + Unsloth for fast, efficient training **Inference Tip:** Set reasoning level to "high" for best results and to reduce prompt injection risks. 👉 [View on Hugging Face](https://huggingface.co/EpistemeAI/metatune-gpt20b-R1.1) | [GitHub: Recursive Self-Improvement](https://github.com/openai/harmony)

Links

https://huggingface.co/mradermacher/metatune-gpt20b-R1.1-i1-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

qwen3-8b-jailbroken

granite-3.0-1b-a400m-instruct

llama-3.2-1b-instruct:q4_k_m

llama-3.2-3b-instruct:q4_k_m

llama-3.2-3b-instruct:q8_0

llama-3.2-1b-instruct:q8_0

meta-llama-3.1-8b-instruct

meta-llama-3.1-70b-instruct

meta-llama-3.1-8b-claude-imat

sekhmet_aleph-l3.1-8b-v0.1-i1

llama-guard-3-8b

krutrim-ai-labs_krutrim-2-instruct

cydonia-24b-v4.2.0-i1

qwen-sea-lion-v4-32b-it-i1

simia-tau-sft-qwen3-8b

ibm-granite.granite-4.0-1b

qwen3-nemotron-32b-rlbff-i1

nvidia.qwen3-nemotron-32b-rlbff

metatune-gpt20b-r1.1-i1