Qwen3 TTS: Open Source Text-to-Speech Model

Table of Contents

Qwen is a series of large language models developed by Alibaba Cloud. Most of the models in this series are open source, including Qwen2, Qwen-VL, Qwen-Audio, and Qwen2-Math. Recently, Alibaba Cloud has released Qwen3 TTS, an open-source text-to-speech model that delivers high-quality, natural-sounding speech synthesis. This model supports for voice clone, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. Currently it supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish and Italian.

ElevenLabs Got a Serious Competition
#

When it comes to text-to-speech models, ElevenLabs has been the dominant player in the market, known for its high-quality voice synthesis capabilities. It provides wide range of services including text-to-speech, agents, dubbing, voice cloning, and more. But you have to pay for use thier services within the website or API. With the release of Qwen3 TTS as an open-source model, developers and businesses now can host their own text-to-speech services without relying on ElevenLabs. This will be more cost-effective and customizable solution for various applications.

Features of Qwen3 TTS
#

There are two models in Qwen3 TTS: Qwen3 0.6 and Qwen3 1.7.

1.7B Model
#

Model	Features	Language Support	Streaming	Instruction Control
Qwen3-TTS-12Hz-1.7B-VoiceDesign	Performs voice design based on user-provided descriptions.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅	✅
Qwen3-TTS-12Hz-1.7B-CustomVoice	Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅	✅
Qwen3-TTS-12Hz-1.7B-Base	Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅

0.6B Model
#

Model	Features	Language Support	Streaming	Instruction Control
Qwen3-TTS-12Hz-0.6B-CustomVoice	Supports 9 premium timbres covering various combinations of gender, age, language, and dialect.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅
Qwen3-TTS-12Hz-0.6B-Base	Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅

Core Technological Principles
#

1. Dual-Track Streaming Architecture
#

One of the standout innovations in Qwen3-TTS is its dual-track language model architecture, which enables extremely fast streaming speech synthesis.

The model can start emitting audio just after one character of input.
Full end-to-end latency can be as low as ~97 ms, suitable for real-time interactive applications like conversational agents or live narration.

This architecture breaks the typical “batch, then speak” paradigm in TTS and brings synthesis latency closer to realtime conversational speeds.

2. Multi-Codebook Speech Tokenizer
#

Another core piece of Qwen3-TTS is the Qwen3-TTS-Tokenizer-12Hz—a custom multi-codebook discretizer optimized for speech.

Reduces audio to discrete tokens while preserving paralinguistic information like tone, emotion, and prosody.
Makes reconstruction lightweight and efficient, contributing to both high fidelity and low-latency decoding.

This contrasts with simpler tokenizers that may lose expressive detail or require heavier decoding.

3. Second Voice Cloning & Voice Design
#

One of the most widely discussed capabilities in both the official blog and dev.to guide is voice cloning:

Voice Cloning
#

Qwen3-TTS can clone a speaker’s voice using just a few seconds (as little as 3 seconds) of reference audio.
The cloned voice preserves timbre, prosody, and other unique speaker traits, making the synthetic speech feel highly personalized.

Voice cloning of this quality, especially at such low sample length, places Qwen3-TTS among the most capable freely available models in the space.

Voice Design
#

Beyond cloning, Qwen3-TTS includes models geared toward voice design via natural language instruction.

Developers and creators can instruct the model with descriptions like “deep male voice with slight rasp and warmth” or “energetic, excited female narration.”
This lets you create brand new voices from scratch without any reference audio, expanding creative possibilities.

How to Use Qwen3-TTS
#

Demo on Hugging Face
#

The quickest and easiest way to demo Qwen3-TTS is through the Hugging Face Spaces interface. You can input text and give a voice description to generate speech, also you can upload a short audio clip of someone talking, then have the model clone that voice to read your text. This works terrifyingly well for a free demo!

Local Installation
#

If you have a latest python environment and a CUDA enabled GPU, you can install and run Qwen3-TTS locally by following the instructions.

Install PyTorch with CUDA

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

Install Qwen3-TTS

pip install qwen3-tts

Launch Demo Interface

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000

ElevenLabs Got a Serious Competition #

Features of Qwen3 TTS #

1.7B Model #

0.6B Model #

Core Technological Principles #

1. Dual-Track Streaming Architecture #

2. Multi-Codebook Speech Tokenizer #

3. Second Voice Cloning & Voice Design #

Voice Cloning #

Voice Design #

How to Use Qwen3-TTS #

Demo on Hugging Face #

Local Installation #