Google released Gemini 3.1 Flash TTS, a new text-to-speech model designed to be highly controllable rather than merely natural-sounding. It's available today in Vertex AI and Google AI Studio, aimed at developers building voice interfaces, audiobooks, games, and character-driven agents.
The differentiator is directability. You can prompt the model with explicit emotional and performance directions — 'pause dramatically', 'whisper the next line', 'laugh between words', 'speak in a panic' — and the voice responds. Under the hood, the model exposes a rich set of controls for pacing, prosody, breath, affect, and tonal shifts within a single utterance, rather than treating a paragraph as one flat read.
Practical implications:
Narration: audiobook producers can shape a character's read line-by-line without recording multiple takes.
Games and agents: NPCs and voice agents can react emotionally in real time, not just pick from a few preset voices.
Accessibility: screen readers can match the tone of the source content — calm for a long document, urgent for a security alert.
Latency is the other selling point. Flash is positioned at the cost- and speed-optimized tier of the Gemini family, so Google is pitching 3.1 Flash TTS as viable for interactive, sub-second voice experiences rather than batch audio generation only.
Combined with Gemini's existing live audio and real-time conversation features, 3.1 Flash TTS gives developers a much more expressive voice layer than what the previous generation of TTS models offered.