LocalAI - Models

privacy-filter-nemotron

A fine-grained English PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed on NVIDIA's Nemotron-PII dataset. It labels every token with a BIOES tag over 55 PII categories (221 classes), trading the multilingual sibling's language breadth for category depth - identity, contact, address, dates, government IDs, financial, healthcare, enterprise, vehicle and digital entities (including api_key, ipv4/ipv6 and mac_address). For multilingual text prefer privacy-filter-multilingual instead. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, d_model 640, 128 experts top-4, ~1.5B total / ~50M active per token), bidirectional banded attention, o200k tokenizer and a 221-way token-classification head; served via the openai-privacy-filter architecture. F16, ~2.8 GB. (A smaller Q8_0 quant exists on the GGUF repo for RAM-constrained use - validate it on your own data, since for PII a single dropped span is a leak.)

Links

Tags

privacy-filter-nemotron-q8

Q8_0 quant of privacy-filter-nemotron (~1.64 GB, vs ~2.8 GB for F16) for RAM-constrained / edge use (e.g. a 4 GB Raspberry Pi 5). The MoE expert weights are stored 8-bit; attention, embeddings and the classifier head stay F16. Same model, policy and runtime as the F16 entry - see privacy-filter-nemotron for the full description. Prefer the F16 entry when you can afford it: it is the reference artifact. On a mixed-PII document the publisher measured q8 matching F16 on 99.93% of token labels with an identical span set at threshold 0.5 - but one token flipped, and for PII a single dropped span is a leak. Treat q8 as a deliberate size/speed tradeoff and validate it on your own data.

Links

Tags

nemotron-3-nano-omni-30b-a3b-reasoning-apex

# Model Overview ### Description: NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document intelligence workflows. It extends the Nemotron Nano family with integrated video+speech comprehension, Graphical User Interface (GUI), Optical Character Recognition (OCR), and speech transcription capabilities, enabling end-to-end processing of rich enterprise content such as meeting recordings, M&E assets, training videos, and complex business documents. NVIDIA Nemotron 3 Nano Omni was developed by NVIDIA as part of the Nemotron model family. This model is available for commercial use. This model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. For more information, please see the Training Dataset section below. ### License/Terms of Use Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement ### Deployment Geography: Global ...

Links

https://huggingface.co/mudler/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-APEX-GGUF

Tags

l3.3-ms-nevoria-70b

This model was created as I liked the storytelling of EVA, the prose and details of scenes from EURYALE and Anubis, enhanced with Negative_LLAMA to kill off the positive bias with a touch of nemotron sprinkeled in. The choice to use the lorablated model as a base was intentional - while it might seem counterintuitive, this approach creates unique interactions between the weights, similar to what was achieved in the original Astoria model and Astoria V2 model . Rather than simply removing refusals, this "weight twisting" effect that occurs when subtracting the lorablated base model from the other models during the merge process creates an interesting balance in the final model's behavior. While this approach differs from traditional sequential application of components, it was chosen for its unique characteristics in the model's responses.

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-v1

Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. Llama-3.3-Nemotron-Super-49B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. For more details on how the model was trained, please see this blog.

Links

Tags

thedrummer_valkyrie-49b-v1

it swears unprompted 10/10 model ... characters work well, groups work well, scenarios also work really well so great model overall This is pretty exciting though. GLM-4 already had me on the verge of deleting all of my other 32b and lower models. I got to test this more but I think this model at Q3m is the death blow lol Smart Nemotron 49b learned how to roleplay Even without thinking it rock solid at 4qm. Without thinking is like 40-70b level. With thinking is 100+b level This model would have been AGI if it were named properly with a name like "Bob". Alas, it was not. I think this model is nice. It follows prompts very well. I didn't really note any major issues or repetition Yeah this is good. I think its clearly smart enough, close to the other L3.3 70b models. It follows directions and formatting very well. I asked it to create the intro message, my first response was formatted differently, and it immediately followed my format on the second message. I also have max tokens at 2k cause I like the model to finish it's thought. But I started trimming the models responses when I felt the last bit was unnecessary and it started replying closer to that length. It's pretty much uncensored. Nemotron is my favorite model, and I think you fixed it!!

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses. Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score. A higher score means the response is of higher quality.

Links

Tags

llama-3.1-nemotron-70b-instruct-hf

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet. This model was trained using RLHF (specifically, REINFORCE), Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy. Llama-3.1-Nemotron-70B-Instruct-HF has been converted from Llama-3.1-Nemotron-70B-Instruct to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the Llama-3.1-Nemotron-70B-Instruct as evaluated in NeMo-Aligner, which the evaluation results below are based on.

Links

Tags

l3.1-70blivion-v0.1-rc1-70b-i1

70Blivion v0.1 is a model in the release candidate stage, based on a merge of L3.1 Nemotron 70B & Euryale 2.2 with a healing training step. Further training will be needed to get this model to release quality. This model is designed to be suitable for creative writing and roleplay. This RC is not a finished product, but your feedback will drive the creation of better models. This is a release candidate model. It has some known issues and probably some unknown ones too, because the purpose of these early releases is to seek feedback.

Links

Tags

l3.1-nemotron-sunfall-v0.7.0-i1

Significant revamping of the dataset metadata generation process, resulting in higher quality dataset overall. The "Diamond Law" experiment has been removed as it didn't seem to affect the model output enough to warrant set up complexity. Recommended starting point: Temperature: 1 MinP: 0.05~0.1 DRY: 0.8 1.75 2 0 At early context, I recommend keeping XTC disabled. Once you hit higher context sizes (10k+), enabling XTC at 0.1 / 0.5 seems to significantly improve the output, but YMMV. If the output drones on and is uninspiring, XTC can be extremely effective. General heuristic: Lots of slop? Temperature is too low. Raise it, or enable XTC. For early context, temp bump is probably preferred. Is the model making mistakes about subtle or obvious details in the scene? Temperature is too high, OR XTC is enabled and/or XTC settings are too high. Lower temp and/or disable XTC.

Links

Tags

nvidia_llama-3.1-8b-ultralong-4m-instruct

We introduce UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

Links

Tags

nvidia_llama-3.1-nemotron-nano-4b-v1.1

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: Llama-3.3-Nemotron-Ultra-253B-v1 Llama-3.3-Nemotron-Super-49B-v1 Llama-3.1-Nemotron-Nano-8B-v1 This model is ready for commercial use.

Links

Tags

nvidia_acereason-nemotron-14b

We're thrilled to introduce AceReason-Nemotron-14B, a math and code reasoning model trained entirely through reinforcement learning (RL), starting from the DeepSeek-R1-Distilled-Qwen-14B. It delivers impressive results, achieving 78.6% on AIME 2024 (+8.9%), 67.4% on AIME 2025 (+17.4%), 61.1% on LiveCodeBench v5 (+8%), 54.9% on LiveCodeBench v6 (+7%), and 2024 on Codeforces (+543). We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first RL training on math-only prompts, then RL training on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks, but also code reasoning tasks. In addition, extended code-only RL further improves code benchmark performance while causing minimal degradation in math results. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

Links

Tags

nvidia_nemotron-research-reasoning-qwen-1.5b

Nemotron-Research-Reasoning-Qwen-1.5B is the world’s leading 1.5B open-weight model for complex reasoning tasks such as mathematical problems, coding challenges, scientific questions, and logic puzzles. It is trained using the ProRL algorithm on a diverse and comprehensive set of datasets. Our model has achieved impressive results, outperforming Deepseek’s 1.5B model by a large margin on a broad range of tasks, including math, coding, and GPQA. This model is for research and development only.

Links

Tags

qwen3-nemotron-32b-rlbff-i1

**Model Name:** Qwen3-Nemotron-32B-RLBFF **Base Model:** Qwen/Qwen3-32B **Developer:** NVIDIA **License:** NVIDIA Open Model License **Description:** Qwen3-Nemotron-32B-RLBFF is a high-performance, fine-tuned large language model built on the Qwen3-32B foundation. It is specifically optimized to generate high-quality, helpful responses in a default thinking mode through advanced reinforcement learning with binary flexible feedback (RLBFF). Trained on the HelpSteer3 dataset, this model excels in reasoning, planning, coding, and information-seeking tasks while maintaining strong safety and alignment with human preferences. **Key Performance (as of Sep 2025):** - **MT-Bench:** 9.50 (near GPT-4-Turbo level) - **Arena Hard V2:** 55.6% - **WildBench:** 70.33% **Architecture & Efficiency:** - 32 billion parameters, based on the Qwen3 Transformer architecture - Designed for deployment on NVIDIA GPUs (Ampere, Hopper, Turing) - Achieves performance comparable to DeepSeek R1 and O3-mini at less than 5% of the inference cost **Use Case:** Ideal for applications requiring reliable, thoughtful, and safe responses—such as advanced chatbots, research assistants, and enterprise AI systems. **Access & Usage:** Available on Hugging Face with support for Hugging Face Transformers and vLLM. **Cite:** [Wang et al., 2025 — RLBFF: Binary Flexible Feedback](https://arxiv.org/abs/2509.21319) 👉 *Note: The GGUF version (mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF) is a user-quantized variant. The original model is available at nvidia/Qwen3-Nemotron-32B-RLBFF.*

Links

https://huggingface.co/mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF

Tags

nvidia.qwen3-nemotron-32b-rlbff

The **nvidia/Qwen3-Nemotron-32B-RLBFF** is a large language model based on the Qwen3 architecture, fine-tuned by NVIDIA using Reinforcement Learning from Human Feedback (RLHF) for improved alignment with human preferences. With 32 billion parameters, it excels in complex reasoning, instruction following, and natural language generation, making it suitable for advanced tasks such as code generation, dialogue systems, and content creation. This model is part of NVIDIA’s Nemotron series, designed to deliver high performance and safety in real-world applications. It is optimized for efficient deployment while maintaining strong language understanding and generation capabilities. **Key Features:** - **Base Model**: Qwen3-32B - **Fine-tuning**: Reinforcement Learning from Human Feedback (RLBFF) - **Use Case**: Advanced text generation, coding, dialogue, and reasoning - **License**: MIT (check Hugging Face for full details) 👉 [View on Hugging Face](https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF) *Note: The GGUF version hosted by DevQuasar is a quantized variant for efficient local inference. The original, unquantized model is available at the link above.*

Links

https://huggingface.co/DevQuasar/nvidia.Qwen3-Nemotron-32B-RLBFF-GGUF

Tags

parakeet-cpp-nemotron-3.5-asr-streaming-0.6b

Multilingual (40+ locales), prompt-conditioned, cache-aware streaming FastConformer RNN-T, 0.6B. Q8_0 GGUF for the parakeet-cpp backend (C++/ggml port of NVIDIA NeMo). Byte-identical to NeMo at WER 0 offline and streaming, about 2.5x faster than NeMo on CPU with no GPU. Select a language with the request "language" field (for example en, de, es, ja-JP), or leave it empty for automatic detection. License OpenMDW-1.1.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

privacy-filter-nemotron

privacy-filter-nemotron-q8

nemotron-3-nano-omni-30b-a3b-reasoning-apex

l3.3-ms-nevoria-70b

nvidia_llama-3_3-nemotron-super-49b-v1

thedrummer_valkyrie-49b-v1

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

llama-3.1-nemotron-70b-instruct-hf

l3.1-70blivion-v0.1-rc1-70b-i1

l3.1-nemotron-sunfall-v0.7.0-i1

nvidia_llama-3.1-8b-ultralong-4m-instruct

nvidia_llama-3.1-nemotron-nano-4b-v1.1

nvidia_acereason-nemotron-14b

nvidia_nemotron-research-reasoning-qwen-1.5b

qwen3-nemotron-32b-rlbff-i1

nvidia.qwen3-nemotron-32b-rlbff

parakeet-cpp-nemotron-3.5-asr-streaming-0.6b