How to Generate Conversational Speech Locally Using Sesame AI

AI voice generation has moved far beyond robotic tones. With the Conversational Speech Model (CSM) from Sesame AI, you can now generate realistic, expressive, multi-speaker conversations from text — entirely offline.

In this guide, I’ll show you how to use the open-source csm repo on GitHub to generate human-like audio with rich context and emotion. We’ll cover setup, running the basic demo, and using the core Python API to integrate this into your own projects.


What is CSM?

CSM is a speech generation model that converts text (and optionally audio context) into RVQ audio codes, which are then decoded into audio using a lightweight decoder called Mimi. It’s built on top of a LLaMA-based language backbone, enabling deep contextual understanding and expressive delivery.

Prerequisites

To run CSM locally, here’s what you’ll need:

Create a Hugging Face account and run huggingface-cli login to authenticate.


Installation

Let’s set everything up from GitHub:

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

To disable lazy Torch compilation for Mimi (important!):

export NO_TORCH_COMPILE=1

On Windows? Use pip install triton-windows instead of installing the default triton.


Quick Demo: Generate a 2-Person Dialogue

CSM includes a built-in script to generate a short conversation between two speakers:

python run_csm.py

It will synthesize audio of a back-and-forth interaction and save the result as a .wav file. This is great to verify your setup is working.


Custom Usage with Python: Generate Speech from Text

Now let’s use the CSM API to build your own speech from any custom text.

Step 1: Load the Model

from generator import load_csm_1b
import torch

# Choose device: MPS (Mac), CUDA (GPU), or CPU fallback
device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
generator = load_csm_1b(device=device)

This loads the 1B parameter CSM model and initializes it on your selected hardware.


Step 2: Generate Audio from Text

import torchaudio

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,                 # Speaker ID (can be any int)
    context=[],                # No context provided here
    max_audio_length_ms=10000  # Max duration in ms
)

torchaudio.save("hello.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Output: A file hello.wav with expressive, natural-sounding speech saying your input.


Contextual Speech Generation (Multi-Speaker Dialogues)

CSM performs best when given audio context of a prior conversation. This helps it generate a more realistic, emotionally appropriate response.

Here’s how you do it:

Step 1: Prepare Example Audio Files

You’ll need 4 short .wav files (e.g., recorded clips of a conversation) like:

utterance_0.wav
utterance_1.wav
utterance_2.wav
utterance_3.wav

Each file should be 1–5 seconds long, matching the example transcripts.


Step 2: Define Context Segments

from generator import Segment

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    return torchaudio.functional.resample(audio_tensor.squeeze(0), sample_rate, generator.sample_rate)

Now create segments from this data:

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]

Step 3: Generate Contextual Response

audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10000,
)

torchaudio.save("response.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

🎧 The resulting response.wav will match the tone, flow, and emotion of the previous dialogue — like a real conversation.


🛠️ Troubleshooting Tips

  • Tensor shape mismatch? Try reducing max audio length or trimming input clips.
  • Wrong voice quality? Ensure sample rate matches (resample to generator.sample_rate).
  • Slow on CPU? Use a CUDA GPU if available — generation is much faster.

🤖 Use Cases & Future Potential

With just a 1B parameter model, CSM already delivers:

  • Emotionally expressive voice agents
  • Voice cloning + synthesis pipelines
  • Local, real-time dialogue generation
  • Fine-tuning potential for custom speakers

Expect even more from larger models (like the demo version used at Sesame), which are likely coming soon.


Resources


Final Thoughts

CSM is the most impressive open-source conversational voice model available today. Whether you’re building a virtual assistant, dubbing characters, or just exploring voice AI — this tool gives you complete, local control.

And it’s just getting started.


Leave a Reply

x
Advertisements