
Training Data ElevenLabs' Mati Staniszewski: How Voice Becomes the Interface for Everything
137 snips
May 8, 2026 Mati Staniszewski, co-founder and CEO of ElevenLabs, builds audio AI for text-to-speech, speech-to-text, and voice agents. He recounts the company's origin solving dubbing and accessibility. He explains why audio was overlooked, ElevenLabs' early monetization and scaling choices, breakthroughs in emotionality and voice cloning, and how voice will become the primary interface for agents, robots, and next-gen computing.
AI Snips
Chapters
Transcript
Episode notes
Monetize Early To Preserve Independence
- Monetize early to stay financially independent and fund model development instead of raising indefinitely.
- ElevenLabs launched product revenue quickly and then raised when ambitions required large model training budgets.
Building The Full Audio Stack From Contextual TTS To Music
- ElevenLabs built a full audio stack: text-to-speech with contextual emotion, speech-to-text, translation/dubbing, real-time streaming, orchestration, and music generation.
- They prioritized combining models into voice agents and expanded into music to capture emotional nuance.
First Wow Moments Came From Voice Likeness And Laughter
- Early internal milestones included cloning Mati's accented voice and getting the model to laugh, which made interactions feel human.
- Viral demos (e.g., Javier Milei, Matthew McConaughey in Spanish/Portuguese) showcased cross-language, familiar-voice dubbing impact.

