Monday, 18 May 2026
AI Daily
Front Page
The Big LabsFriday, 08 May 2026 · 3 min read

OpenAI Releases Three Realtime Voice Models with GPT-5 Reasoning

OpenAI shipped GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper on May 7, bringing GPT-5-class reasoning into live voice and translating across 70 languages in real time.

OpenAI GPT-Realtime-2 voice model launch announcement graphic
Source: sqmagazine.co.uk

OpenAI released a trio of real-time audio models on May 7, pushing GPT-5-class reasoning into live speech for the first time and making multi-language voice translation available directly through its developer API.

Three Models, Three Distinct Jobs

The release packages distinct capabilities into separate models rather than a single omnibus voice product, a decision that signals OpenAI's intent to let developers compose voice agents from purpose-built components.

GPT-Realtime-2 is the centerpiece. Unlike its predecessor, GPT-Realtime-1.5, it pairs speech understanding with the same reasoning tier that underlies GPT-5's text capabilities. That means the model can handle multi-step requests, execute tool calls, manage interruptions mid-sentence, and maintain coherence across a 128,000-token context window — long enough to sustain hour-long calls without losing track of earlier turns. OpenAI described the goal as building voice interfaces that "can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds," a framing that positions the model for enterprise voice agents rather than simple call-and-response assistants.

GPT-Realtime-Translate tackles live multilingual communication. It accepts speech input in more than 70 languages and routes output to 13 target languages — covering the bulk of global commercial communication — while maintaining conversational pace rather than the perceptible lag that has historically made machine translation awkward in live settings. The target markets are obvious: international customer support, cross-border sales, global events, and creator platforms distributing content to non-native audiences.

GPT-Realtime-Whisper handles streaming transcription. Where conventional speech-to-text systems wait for a speaker to finish a phrase before processing, this model runs inference continuously as speech arrives, producing captions, meeting notes, or live assistant prompts that update in real time. At $0.017 per minute it is the lowest-cost model in the trio, designed for high-volume transcription pipelines.

Pricing and API Access

All three models are available immediately through OpenAI's Realtime API, with a Playground interface for rapid prototyping. GPT-Realtime-2 is billed by token consumption — $32 per million audio input tokens, dropping to $0.40 for cached input, and $64 per million output tokens. The translation and transcription models carry per-minute pricing: $0.034 and $0.017 respectively.

The token-based billing for GPT-Realtime-2 is significant because it makes cost predictable relative to conversation complexity rather than raw audio duration, giving enterprise buyers a model that scales with value delivered. A short reasoning-heavy exchange will cost less than a lengthy one, unlike fixed-rate per-minute billing.

Why This Matters for Voice AI

The voice interface space has long lagged the rapid capability improvements seen in text models. Early conversational AI products, including earlier versions of ChatGPT's voice mode, tended to work best on bounded, simple queries. Injecting GPT-5-class reasoning into the audio pipeline removes a structural ceiling that has constrained the category.

For developers, the practical impact is a narrowing gap between what a voice agent can accomplish and what a text agent can. Tasks like booking an appointment while checking calendar constraints, walking through a technical support tree, or fielding complex financial questions — all of which require maintaining context and performing multi-step inference — become tractable for voice-native interfaces without routing audio through a text intermediary.

The three-model structure also addresses a persistent tension in voice AI product design: a single model optimized for general reasoning tends to be over-engineered and over-priced for use cases that only need fast transcription. Separating the tiers lets developers pay for capability matching actual requirements.

What to Watch

OpenAI noted that GPT-Realtime-2 includes built-in guardrails to prevent spam, fraud, and abuse — a response to concerns that increasingly capable voice models could be weaponized for automated phishing or social-engineering calls. How those guardrails interact with legitimate red-teaming and security-research workflows will be worth monitoring, particularly given the same day's separate announcement of GPT-5.5-Cyber for vetted security teams.

The translation model's 70-input to 13-output language asymmetry is also an implicit product roadmap signal. Expanding the output language count is the obvious next increment, and the competitive pressure from Google — whose I/O conference on May 19 is widely expected to include Gemini voice announcements — provides a clear external deadline.

#OpenAI#voice AI#GPT-Realtime-2#API#speech#translation

Sources

More from The Big Labs

See all