Dia is a 1.6B parameter text-to-speech model designed to generate natural-sounding dialogue directly from transcripts. Developed by Nari Labs, it supports multi-speaker conversations, emotive expression, and non-verbal cues like laughter, coughing, or sighs.
🔧 Key Capabilities:
- Dialogue synthesis using
[S1]
,[S2]
tags for multi-speaker text. - Recognizes a wide range of non-verbal sounds (e.g.,
(laughs)
,(sighs)
,(claps)
). - Optional voice cloning using an audio prompt and transcript.
- Emotion and tone control via conditioning on reference audio (guide coming soon).
- Easily runs via a Gradio app or Python library.
Observations:
- Handles up to at least three speakers with distinct voices.
- Adjusting speed factors and using punctuation/emotive timing affects output realism.
- Useful for indie games, podcasts, or cutscenes where realistic character dialogue is needed.
- Still early stage, but a very promising open-source alternative to paid TTS like ElevenLabs.
Here’s a clean, step-by-step guide for installing and running Dia-1.6B locally:
Dia is a Python-based open-source project. You can either install it via pip or run it directly using the source code.
Option 1: Quick Install via pip (Recommended)
pip install git+https://github.com/nari-labs/dia.git
Then to run the demo UI:
python -m dia.app
Option 2: Clone and Run the Gradio App
# Clone the repo
git clone https://github.com/nari-labs/dia.git
cd dia
If you have uv
installed (fast setup):
uv run app.py
If you don’t have uv
, use a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
python app.py
Optional: Use in Python Code
import soundfile as sf
from dia.model import Dia
model = Dia.from_pretrained("nari-labs/Dia-1.6B")
text = "[S1] Welcome to Dia. [S2] It’s time to generate some fun dialogue. (laughs)"
output = model.generate(text)
sf.write("output.mp3", output, 44100)
Optional: to enable faster generation if your hardware supports torch.compile()
(PyTorch 2.0+).
output = model.generate(text, use_torch_compile=True)
Tips for Best Results
- Input should be 5–20 seconds worth of speech.
- Alternate
[S1]
and[S2]
tags. - Avoid overusing nonverbal tags like
(laughs)
or(coughs)
. - For consistent voice: use audio prompting or fix the random seed.
Notes:
- Make sure you have PyTorch 2.0+ and CUDA 12.6 for best GPU support.
- First run will download model weights and Descript Audio Codec.
- Needs 12–13 GB VRAM for full model.
- Only supports English text generation.
- GitHub for more.