The text box was just the prototype

Every major interaction you've had with AI so far happened in a text box. You typed, it typed back. That was the prototype.

In a single 48-hour stretch this week, four separate announcements made it clear the industry has moved on. Apple revealed plans to turn Siri into a choosable frontend for competing AI services. Mistral shipped an open-source voice model that clones speakers from three seconds of audio. Google launched a music generation model that understands song structure. And IBM embedded ElevenLabs' voice stack into enterprise AI agents. The text box got AI onto your screen. Voice and audio are how it gets everywhere else.

MacRumors reported that iOS 27 will let users swap ChatGPT, Gemini, Claude, and others into Siri through a new Extensions system, expected to launch at WWDC on 8 June. This is the distribution story that matters most. Apple isn't building a better chatbot. It's building the marketplace where 2.2 billion devices choose between chatbots, and the battleground is voice, not text. When you ask Siri a question, you don't want a paragraph. You want a spoken answer that sounds like it came from someone who understood you.

Which is exactly where Mistral is pointing. TechCrunch reported that the company released Voxtral TTS, a 4-billion-parameter text-to-speech model with 90ms time-to-first-audio. That's fast enough for real-time conversation on a laptop. It clones voices from three seconds of audio, supports nine languages, and runs on consumer hardware. The open weights are on Hugging Face under a CC BY NC 4.0 licence. At $0.016 per 1,000 characters via API, Mistral is pricing voice as a commodity before the incumbents have finished pricing it as a premium.

Google's move is different in kind but identical in direction. Google announced Lyria 3 Pro, a music generation model that understands track structure (intros, verses, choruses, bridges) and accepts multimodal input including images to influence mood. Every output carries a SynthID watermark. This isn't about competing with Spotify. It's about making audio generation a native capability of the platform, available through the Gemini API and Vertex AI. When audio becomes an API call, every product team can add it.

The enterprise side is moving just as fast. IBM announced that ElevenLabs' voice capabilities are now embedded in watsonx Orchestrate, giving AI agents access to over 10,000 voices across 70 languages with PCI compliance, HIPAA support, and zero-retention mode. Banks and healthcare providers are the early adopters. Voice AI just got its SOC 2 moment.

What the pattern means

I think the companies that win the next phase of AI won't be the ones with the best text completions. They'll be the ones that sound the most human. Text was the minimum viable interface, good enough to prove the technology worked, limited enough to keep AI contained to a browser tab. Voice breaks that containment. It puts AI in your car, your earbuds, your phone calls, your enterprise call centre.

The economics follow. Apple is building the distribution layer. Mistral is commoditising the voice layer with open weights. Google is extending generation from text to structured audio. IBM and ElevenLabs are wiring voice into regulated enterprise workflows. Four companies, four layers of the stack, all betting that the future of AI interaction is spoken, not typed.

For builders, the question is whether your product still assumes a text box. If it does, you're building for the prototype era.

The text box was just the prototype

What the pattern means

Stay up to date

More news

Today in AI — 13 May 2026

The swap that doesn't add up

Today in AI — 12 May 2026