How to Install Dia-1.6B Text-to-Speech Model Locally

Dia is a 1.6B parameter text-to-speech model designed to generate natural-sounding dialogue directly from transcripts. Developed by Nari Labs, it supports multi-speaker conversations, emotive expression, and non-verbal cues like laughter, coughing, or sighs.

🔧 Key Capabilities:

  • Dialogue synthesis using [S1], [S2] tags for multi-speaker text.
  • Recognizes a wide range of non-verbal sounds (e.g., (laughs), (sighs), (claps)).
  • Optional voice cloning using an audio prompt and transcript.
  • Emotion and tone control via conditioning on reference audio (guide coming soon).
  • Easily runs via a Gradio app or Python library.

Observations:

  • Handles up to at least three speakers with distinct voices.
  • Adjusting speed factors and using punctuation/emotive timing affects output realism.
  • Useful for indie games, podcasts, or cutscenes where realistic character dialogue is needed.
  • Still early stage, but a very promising open-source alternative to paid TTS like ElevenLabs.

Here’s a clean, step-by-step guide for installing and running Dia-1.6B locally:

Dia is a Python-based open-source project. You can either install it via pip or run it directly using the source code.


Option 1: Quick Install via pip (Recommended)

pip install git+https://github.com/nari-labs/dia.git

Then to run the demo UI:

python -m dia.app

Option 2: Clone and Run the Gradio App

# Clone the repo
git clone https://github.com/nari-labs/dia.git
cd dia

If you have uv installed (fast setup):

uv run app.py

If you don’t have uv, use a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .
python app.py

Optional: Use in Python Code

import soundfile as sf
from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B")

text = "[S1] Welcome to Dia. [S2] It’s time to generate some fun dialogue. (laughs)"
output = model.generate(text)

sf.write("output.mp3", output, 44100)

Optional: to enable faster generation if your hardware supports torch.compile() (PyTorch 2.0+).

output = model.generate(text, use_torch_compile=True)

Tips for Best Results

  • Input should be 5–20 seconds worth of speech.
  • Alternate [S1] and [S2] tags.
  • Avoid overusing nonverbal tags like (laughs) or (coughs).
  • For consistent voice: use audio prompting or fix the random seed.

Notes:

  • Make sure you have PyTorch 2.0+ and CUDA 12.6 for best GPU support.
  • First run will download model weights and Descript Audio Codec.
  • Needs 12–13 GB VRAM for full model.
  • Only supports English text generation.
  • GitHub for more.

Leave a Reply

x
Advertisements