LocalAI - Models

vllm-omni-z-image-turbo

Z-Image-Turbo via vLLM-Omni - A distilled version of Z-Image optimized for speed with only 8 NFEs. Offers sub-second inference latency on enterprise-grade H800 GPUs and fits within 16GB VRAM. Excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

Links

https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

Tags

vllm-omni-wan2.2-t2v

Wan2.2-T2V-A14B via vLLM-Omni - Text-to-video generation model from Wan-AI. Generates high-quality videos from text prompts using a 14B parameter diffusion model.

Links

https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers

Tags

vllm-omni-wan2.2-i2v

Wan2.2-I2V-A14B via vLLM-Omni - Image-to-video generation model from Wan-AI. Generates high-quality videos from images using a 14B parameter diffusion model.

Links

https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers

Tags

vllm-omni-qwen3-omni-30b

Qwen3-Omni-30B-A3B-Instruct via vLLM-Omni - A large multimodal model (30B active, 3B activated per token) from Alibaba Qwen team. Supports text, image, audio, and video understanding with text and speech output. Features native multimodal understanding across all modalities.

Links

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

Tags

vllm-omni-qwen3-tts-custom-voice

Qwen3-TTS-12Hz-1.7B-CustomVoice via vLLM-Omni - Text-to-speech model from Alibaba Qwen team with custom voice cloning capabilities. Generates natural-sounding speech with voice personalization.

Links

https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Tags

kalomaze_qwen3-16b-a3b

A man-made horror beyond your comprehension. But no, seriously, this is my experiment to: measure the probability that any given expert will activate (over my personal set of fairly diverse calibration data), per layer prune 64/128 of the least used experts per layer (with reordered router and indexing per layer) It can still write semi-coherently without any additional training or distillation done on top of it from the original 30b MoE. The .txt files with the original measurements are provided in the repo along with the exported weights. Custom testing to measure the experts was done on a hacked version of vllm, and then I made a bespoke script to selectively export the weights according to the measurements.

Links

Tags

sparse-llama-3.1-8b-2of4

This is the 2:4 sparse version of Llama-3.1-8B. On the OpenLLM benchmark (version 1), it achieves an average score of 62.16, compared to 63.19 for the dense model—demonstrating a 98.37% accuracy recovery. On the Mosaic Eval Gauntlet benchmark (version v0.3), it achieves an average score of 53.85, versus 55.34 for the dense model—representing a 97.3% accuracy recovery.

Links

Tags

hermes-3-llama-3.1-8b:vllm

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. It is designed to focus on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The model uses ChatML as the prompt format, opening up a much more structured system for engaging the LLM in multi-turn chat dialogue. It also supports function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

Links

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

Tags

hermes-3-llama-3.1-70b:vllm

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. It is designed to focus on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The model uses ChatML as the prompt format, opening up a much more structured system for engaging the LLM in multi-turn chat dialogue. It also supports function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

Links

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B

Tags

hermes-3-llama-3.1-405b:vllm

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. It is designed to focus on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The model uses ChatML as the prompt format, opening up a much more structured system for engaging the LLM in multi-turn chat dialogue. It also supports function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

Links

https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-405B

Tags

orca-agent-v0.1-i1

**Model Name:** Orca-Agent-v0.1 **Base Model:** Qwen3-14B **Repository:** [Danau5tin/Orca-Agent-v0.1](https://huggingface.co/Danau5tin/Orca-Agent-v0.1) **License:** Apache 2.0 **Use Case:** Multi-Agent Orchestration for Complex Code & System Tasks --- ### 🔍 **Overview** Orca-Agent-v0.1 is a powerful **task orchestration agent** designed to manage complex, multi-step workflows—especially in code and system administration—without directly modifying code. Instead, it acts as a strategic planner that coordinates a team of specialized agents. --- ### 🛠️ **Key Features** - **Intelligent Task Breakdown:** Analyzes user requests and decomposes them into focused subtasks. - **Agent Coordination:** Dynamically dispatches: - *Explorer agents* to understand the system state. - *Coder agents* to implement changes with precise instructions. - *Verifier agents* to validate results. - **Context Management:** Maintains a persistent context store to track discoveries across steps. - **High Performance:** Achieves **18.25% on TerminalBench** when paired with Qwen3-Coder-30B, nearing the performance of a 480B model. --- ### 📊 **Performance** | Orchestrator | Subagent | Terminal Bench | |--------------|----------|----------------| | Orca-Agent-v0.1-14B | Qwen3-Coder-30B | **18.25%** | | Qwen3-14B | Qwen3-Coder-30B | 7.0% | > *Trained on 32x H100s using GRPO + curriculum learning, with full open-source training code available.* --- ### 📌 **Example Output** ```xml agent_type: 'coder' title: 'Attempt recovery using the identified backup file' description: | Move the backup file from /tmp/terraform_work/.terraform.tfstate.tmp to /infrastructure/recovered_state.json. Verify file existence, size, and permissions (rw-r--r--). max_turns: 10 context_refs: ['task_003'] ``` --- ### 📁 **Serving** - ✅ **vLLM:** `vllm serve Danau5tin/Orca-Agent-v0.1` - ✅ **SGLang:** `python -m sglang.launch_server --model-path Danau5tin/Orca-Agent-v0.1` --- ### 🌐 **Learn More** - **Training & Code:** [GitHub - Orca-Agent-RL](https://github.com/Danau5tin/Orca-Agent-RL) - **Orchestration Framework:** [multi-agent-coding-system](https://github.com/Danau5tin/multi-agent-coding-system) --- > ✅ *Note: The model at `mradermacher/Orca-Agent-v0.1-i1-GGUF` is a quantized version of this original model. This description reflects the full, unquantized version by the original author.*

Links

https://huggingface.co/mradermacher/Orca-Agent-v0.1-i1-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

vllm-omni-z-image-turbo

vllm-omni-wan2.2-t2v

vllm-omni-wan2.2-i2v

vllm-omni-qwen3-omni-30b

vllm-omni-qwen3-tts-custom-voice

kalomaze_qwen3-16b-a3b

sparse-llama-3.1-8b-2of4

hermes-3-llama-3.1-8b:vllm

hermes-3-llama-3.1-70b:vllm

hermes-3-llama-3.1-405b:vllm

orca-agent-v0.1-i1