Model Gallery

12 models from 1 repositories

Filter by type:

Filter by tags:

voxcpm-1.5

VoxCPM 1.5 is an end-to-end text-to-speech (TTS) model from ModelBest. It features zero-shot voice cloning and high-quality speech synthesis capabilities.

Repository: localaiLicense: apache-2.0

neutts-air

NeuTTS Air is the world's first super-realistic, on-device TTS speech language model with instant voice cloning. Built on a 0.5B LLM backbone, it brings natural-sounding speech, real-time performance, and speaker cloning to local devices.

Repository: localaiLicense: apache-2.0

vllm-omni-qwen3-tts-custom-voice

Qwen3-TTS-12Hz-1.7B-CustomVoice via vLLM-Omni - Text-to-speech model from Alibaba Qwen team with custom voice cloning capabilities. Generates natural-sounding speech with voice personalization.

Repository: localaiLicense: apache-2.0

vibevoice-cpp

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Repository: localaiLicense: mit

qwen3-tts-cpp

Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp). Native C++ text-to-speech with streaming output and zero-shot voice cloning (set `voice` to a 24kHz reference .wav). 24kHz mono, 11 languages with Mandarin dialects. Q8_0 (~0.95 GB talker).

Repository: localaiLicense: mit

qwen3-tts-cpp-0.6b-base-q4

Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~0.6 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-base

Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q8_0 (~2.0 GB talker). Higher-quality streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-base-q4

Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~1.2 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

omnivoice-cpp

OmniVoice (C++ / GGML) - native text-to-speech with voice cloning and voice design. 24kHz mono output, 646 languages, streaming synthesis. Q8_0 GGUFs (~945 MB total): 612M Qwen3 backbone + RVQ audio codec.

Repository: localaiLicense: apache-2.0

omnivoice-cpp-hq

OmniVoice (C++ / GGML), BF16 high-quality variant - text-to-speech with voice cloning and voice design. 24kHz mono, 646 languages, streaming. BF16 GGUFs (~1.6 GB total).

Repository: localaiLicense: apache-2.0

fish-speech-s2-pro

Fish Speech S2-Pro is a high-quality text-to-speech model supporting voice cloning via reference audio. Uses a two-stage pipeline: text to semantic tokens (LLaMA-based) then semantic to audio (DAC decoder).

Repository: localaiLicense: fish-audio-research-license

f5-tts-crispasr

F5-TTS v1 Base (SWivid, MIT) text-to-speech synthesized through the CrispASR backend. A 22-layer DiT flow-matching model with a built-in Vocos vocoder in a single self-contained GGUF (no separate codec). Auto-detected by CrispASR and runs end-to-end on CPU, producing 24 kHz mono audio. F5-TTS is a voice-cloning model with no built-in speaker: you must supply a reference clip and its transcript. Add `voice:` and `voice_text:` to the model's `options` (paths resolve against the model directory) before synthesizing; without a reference the model cannot generate audio. Default GGUF size ~945 MB (f16). Quantization below f16 is not recommended for flow-matching models. Synthesis runs a 32-step ODE solver and is compute-heavy on CPU (expect long generation times without a GPU-enabled CrispASR build).

Repository: localai