Qwen is a series of large language models developed by Alibaba Cloud. Most of the models in this series are open source, including Qwen2, Qwen-VL, Qwen-Audio, and Qwen2-Math. Recently, Alibaba Cloud has released Qwen3 TTS, an open-source text-to-speech model that delivers high-quality, natural-sounding speech synthesis. This model supports for voice clone, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. Currently it supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish and Italian.
ElevenLabs Got a Serious Competition #
When it comes to text-to-speech models, ElevenLabs has been the dominant player in the market, known for its high-quality voice synthesis capabilities. It provides wide range of services including text-to-speech, agents, dubbing, voice cloning, and more. But you have to pay for use thier services within the website or API. With the release of Qwen3 TTS as an open-source model, developers and businesses now can host their own text-to-speech services without relying on ElevenLabs. This will be more cost-effective and customizable solution for various applications.
Features of Qwen3 TTS #
There are two models in Qwen3 TTS: Qwen3 0.6 and Qwen3 1.7.
1.7B Model #
| Model | Features | Language Support | Streaming | Instruction Control |
|---|---|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | β | β |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | β | β |
| Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | β |
0.6B Model #
| Model | Features | Language Support | Streaming | Instruction Control |
|---|---|---|---|---|
| Qwen3-TTS-12Hz-0.6B-CustomVoice | Supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | β | |
| Qwen3-TTS-12Hz-0.6B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | β |
Core Technological Principles #
1. Dual-Track Streaming Architecture #
One of the standout innovations in Qwen3-TTS is its dual-track language model architecture, which enables extremely fast streaming speech synthesis.
- The model can start emitting audio just after one character of input.
- Full end-to-end latency can be as low as ~97 ms, suitable for real-time interactive applications like conversational agents or live narration.
This architecture breaks the typical βbatch, then speakβ paradigm in TTS and brings synthesis latency closer to realtime conversational speeds.
2. Multi-Codebook Speech Tokenizer #
Another core piece of Qwen3-TTS is the Qwen3-TTS-Tokenizer-12Hzβa custom multi-codebook discretizer optimized for speech.
- Reduces audio to discrete tokens while preserving paralinguistic information like tone, emotion, and prosody.
- Makes reconstruction lightweight and efficient, contributing to both high fidelity and low-latency decoding.
This contrasts with simpler tokenizers that may lose expressive detail or require heavier decoding.
3. Second Voice Cloning & Voice Design #
One of the most widely discussed capabilities in both the official blog and dev.to guide is voice cloning:
Voice Cloning #
- Qwen3-TTS can clone a speakerβs voice using just a few seconds (as little as 3 seconds) of reference audio.
- The cloned voice preserves timbre, prosody, and other unique speaker traits, making the synthetic speech feel highly personalized.
Voice cloning of this quality, especially at such low sample length, places Qwen3-TTS among the most capable freely available models in the space.
Voice Design #
Beyond cloning, Qwen3-TTS includes models geared toward voice design via natural language instruction.
- Developers and creators can instruct the model with descriptions like βdeep male voice with slight rasp and warmthβ or βenergetic, excited female narration.β
- This lets you create brand new voices from scratch without any reference audio, expanding creative possibilities.
How to Use Qwen3-TTS #
Demo on Hugging Face #
The quickest and easiest way to demo Qwen3-TTS is through the Hugging Face Spaces interface. You can input text and give a voice description to generate speech, also you can upload a short audio clip of someone talking, then have the model clone that voice to read your text. This works terrifyingly well for a free demo!
Local Installation #
If you have a latest python environment and a CUDA enabled GPU, you can install and run Qwen3-TTS locally by following the instructions.
- Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128- Install Qwen3-TTS
pip install qwen3-tts- Launch Demo Interface
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000