Voice AI’s Moment
Voice AI is finally having its moment. We believe “2025 Is the Year of Voice AI Agents”. After years of sci-fi dreams and early voice assistants, recent advances have aligned to make voice a viable, even preferred, interface. The combination of ultra-low latency, human-like quality, and drastically lower costs has unlocked true voice-native product development. Founders and developers can now build real-time voice agents and assistants that feel natural to users – a leap from the stiff, scripted experiences of Siri or Alexa. In other words, the core pieces of the voice AI stack have matured in unison, setting the stage for an explosion of new voice-first products.
NFX recently published a great piece on Voice AI, and this article aims to share some of our observations in the space as well.
Infrastructure: Voice Tech as a Service
A new wave of voice AI infrastructure is empowering rapid iteration for application builders. Instead of stitching together speech recognition, synthesis, and telephony from scratch, developers can tap off-the-shelf cloud APIs to handle it all. More startups (like Vapi, Gabber, and Retell AI) are offering end-to-end voice AI engines, from real-time transcription and response generation to expressive speech output, accessible with a few API calls. Vapi’s platform has already powered tens of millions of AI-driven calls, with over 150,000 developers onboard and $8M in annual revenue. Retell AI, another rising player, offers a full-stack voice agent platform that enables developers to launch real-time conversational bots with just a few lines of code. Despite a lean team of seven, they’re generating $7.2M in revenue. Crucially, these platforms deliver low latency and high fidelity. Conversational AI now responds in under 300ms, making interactions feel instant and human-like. Anything slower (over a few seconds) feels “too slow” to today’s users, so infrastructure providers have optimized for speed and even support barge-in (interruptible speech) for natural dialogues.
This infrastructure layer has also driven costs down dramatically. Today’s service for streaming AI voices still costs $5-$13 per hour of generated speech, yet companies like Gabber are working to provide hyper-realistic, emotive voice cloning at $1 per hour. The pricing model for AI voice has plummeted to the point that usage at scale is finally affordable for startups and indie builders alike. With this robust infrastructure layer, we're seeing a surge of experimentation similar to the early mobile app boom; we anticipate costs dropping below $1 per hour this year, and the developer experience will improve, making it easier for app builders to create voice-first applications.
Foundation Models: Expressiveness at a Tipping Point
Behind the scenes, foundational voice models have reached a tipping point in expressiveness and realism. New generative speech models can produce voices nearly indistinguishable from human speech, complete with natural intonation, emotion, and even imperfection. Pioneers like ElevenLabs have shown that AI voices can read text in multiple languages with convincing tone and personality. And startups like Sesame have pushed the envelope further: their Conversational Speech Model (recently open-sourced) has crossed the uncanny valley of speech to incorporate lifelike pauses and chuckles, according to early testers. It’s designed to run in real-time with just 1B parameters, making it lightweight enough for edge deployment, though that smaller size can come with occasional hallucinations or loss of accuracy in trickier conversations. Startups like Cartesia are also pushing boundaries, building next-gen high-speed, high-fidelity voice foundation models. Their Sonic TTS model family can generate natural voices in as little as 40 milliseconds and even clone a voice from just three seconds of audio. With over 50,000 users and growing enterprise adoption, Cartesia is showing that fast, expressive voice AI is quickly moving from demo to default.
Not only are these voice models high-quality, they’re also API-accessible and adaptable. Developers can plug a model like ElevenLabs’ into their apps via simple SDKs, or fine-tune open models like Sesame’s for specific characters or use cases. Critically, modern voice AI isn’t limited to pre-scripted phrases – it can generate any response on the fly, guided by LLMs for intelligence. By pairing speech models with powerful LLMs, voice systems can understand free-form input and craft responses without needing hard-coded dialogue trees. This plug-and-play intelligence means a voice assistant can be stood up with general knowledge and conversational ability from day one.
Applications: Voice-First Products Come of Age
With robust infrastructure and lifelike models in place, a surge of voice-first applications is finally coming to market. These early products illustrate that voice AI is working in the real world, not just in research demos. A prime area of traction is enterprise "AI agents" that handle high-volume, repetitive calls. In customer service and sales, for instance, voice AI is already delivering clear ROI. Numeo AI’s voice agent in logistics is even outperforming human employees, negotiating freight deals faster and more effectively by leveraging instant data access and no emotional bias. When you can replace a $40k/year call center rep with an AI system that costs ~$4k/year and never sleeps, the value proposition writes itself. It’s no surprise we’re seeing rapid adoption in call centers and contact-intensive industries where voice AI delivers 10x better unit economics on day one.
Beyond these ROI-driven use cases, voice AI is unlocking entirely new product categories. Our portfolio company, Autograph, building a personal storytelling platform powered by generative AI, is one example. Autograph uses weekly voice recordings of a person to create a voice-cloned, conversational version of them, which is essentially an AI that captures your life stories and can speak to your family long after you’re gone. This kind of emotionally rich, voice-native experience simply wasn’t possible until now. It exemplifies how voice AI can create delight and intimacy, not just efficiency.
Similarly, we’re on the cusp of voice-guided education and therapy apps, immersive storytelling games with AI narration, and personal companions that converse with empathy. Early hits like Character.ai showed the demand for conversational agents that users form a connection with, even in text, adding authentic voice to the mix amplifies that effect. OpenAudio’s TTS platform, FishAudio, also offers voice cloning capabilities to enable users to generate natural-sounding speech in multiple languages. Its applications span content creation, customer support, education, and entertainment, providing tools for realistic voice synthesis and personalized audio experiences. Since launch, the company has grown revenue from zero to approximately $4 million, with current monthly revenue reaching the $5 million range. At the same time, monthly active users have surged from 50,000 in early January to around 400,000 today.
Looking Ahead
The voice AI stack has matured to a point where voice-first products can thrive. We’re witnessing the transition of voice AI from a long-held promise to a practical platform for solving problems and creating value. Real-time voice agents and assistants are moving from hype to habit.
For founders and investors, this convergence opens up a wide frontier of opportunities. Much like the smartphone revolution unlocked new startups, the voice AI revolution will give rise to companies that reimagine software with conversation at the center. Latency, quality, and cost have aligned in 2025, and voice is poised to become an essential modality across tech. The winners in this next era will be those who recognize that computers with a voice can unlock interactions (and markets) that screens never could. Now that voice AI is finally working, the race is on to build the defining applications of the voice-first future.