OpenAI Adds Real-Time Reasoning to Voice AI With GPT-Realtime-2

Voice AI has been fast for years. It's been getting decent at following instructions. What it hasn't been able to do, until now, is actually reason during a live conversation.

On May 7, 2026, OpenAI announced three new models for its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Each targets a different slice of the voice AI stack, but the model everyone is paying attention to is GPT-Realtime-2, which brings GPT-5-class reasoning into audio-in, audio-out interactions.

That's a meaningful shift. Previous realtime voice models were essentially fast pattern matchers that used separate transcription, reasoning, and synthesis steps under the hood, even when the latency was low enough to feel continuous. GPT-Realtime-2 handles reasoning within the audio loop itself. The model doesn't transcribe your speech, send it to a language model, then synthesize a response. It processes audio directly, which reduces compounding errors from the transcription step and allows it to carry context and reasoning across longer sessions.

Here's what's new, what each model actually does, and what the pricing means for anyone thinking about building voice-driven products.

What Is New in the GPT-Realtime API

The most significant technical upgrade in GPT-Realtime-2 is the reasoning integration. It's built on GPT-5-class reasoning, which means it can handle harder requests than its predecessor without breaking character or falling back to generic responses.

In practice, this surfaces in several specific ways. Preambles let agents verbalize what they're doing while executing tool calls. Instead of going silent for three seconds while the model looks up order status or checks availability, the agent can say something like "let me pull that up for you" before the result comes back. That sounds minor, but in voice interactions, silence is a significant usability problem. Users interpret dead air as a connection failure or a confused system. Preambles keep the conversation feeling continuous without requiring any special engineering from the developer.

Parallel tool calls let the model make multiple backend requests simultaneously and narrate what's happening. If a customer service agent needs to check both account status and recent order history to answer a question, GPT-Realtime-2 can execute those queries in parallel and summarize the combined results without requiring sequential chaining logic. This matters for latency in agentic voice applications, where multi-step tool use was previously a bottleneck.

Recovery behavior is more graceful than in previous versions. When the model can't complete a request, it surfaces the failure in a way that sounds natural rather than freezing or producing generic error language. Tone adjustment allows developers to specify vocal register: calm for support interactions, more upbeat for confirmations. Previous realtime voice models had limited ability to hold a consistent tone across a conversation.

The context window expanded from 32K tokens to 128K tokens. For voice AI, that number matters more than it might seem. A 32K context window is enough for short customer service calls but starts to constrain performance in longer sessions, complex agentic workflows where the model needs to hold state across many turns, or applications that inject large document contexts before the conversation starts. Reasoning effort is now configurable with five tiers: minimal, low (the default), medium, high, and xhigh. Minimal is appropriate for simple lookups and fast confirmations. The xhigh setting is what you'd use for complex multi-step reasoning, at the cost of slightly higher latency and token spend.

GPT-Realtime-Translate is a dedicated live translation model. It supports speech input in over 70 languages and outputs in 13 languages, and it maintains conversational pace while translating, meaning it doesn't introduce the multi-second lag that has historically made real-time translation awkward in live interactions. The 70-to-13 asymmetry reflects a real technical distinction. Listening comprehension scales with training data coverage. Production quality for voice output requires more careful calibration of pronunciation, prosody, and idiom. The 13-output-language selection reflects where OpenAI has confidence in production-quality output rather than experimental coverage.

GPT-Realtime-Whisper is a streaming speech-to-text model. Unlike traditional transcription, which produces a complete transcript after audio input ends, GPT-Realtime-Whisper transcribes incrementally as the speaker talks. This makes it suitable for real-time captioning at live events, meeting note generation while the meeting is happening, live broadcast subtitling, and accessibility applications that require on-screen text to appear as someone speaks rather than after they finish a sentence. At $0.017 per minute, it's the lowest-cost entry point in the new model family.

Performance, Pricing, and Architecture

OpenAI released two benchmark comparisons against GPT-Realtime-1.5: on Big Bench Audio, GPT-Realtime-2 scores 15.2% higher at maximum effort levels; on Audio MultiChallenge, it scores 13.8% higher. Benchmarks are useful directionally but don't always translate cleanly to production performance. The Zillow case study included in the announcement is more informative for production-grade evaluation.

Zillow uses realtime voice AI for some of its highest-friction customer interactions, specifically the adversarial scenarios where customers are frustrated, questions are complex, or the system needs to reason across multiple data points before responding. On Zillow's hardest adversarial benchmark, GPT-Realtime-2 achieved a 95% call-success rate, up from 69% on the previous model. That's a 26-point improvement on the interactions that matter most for customer experience. A 26-point gain on adversarial benchmarks tells you something specific: the reasoning improvement isn't just showing up on controlled tests. It's showing up in the exact situations where prior models broke down.

GPT-Realtime-2 uses token-based pricing: $32 per million audio-input tokens, $0.40 per million cached input tokens, and $64 per million audio-output tokens. For reference, one minute of audio is roughly 1,500 to 2,000 tokens depending on speaking pace. A ten-minute customer service call might use around 15,000 to 20,000 tokens combined for input and output, putting the cost at roughly $0.50 to $1.00 per call at standard token pricing. Cached input at $0.40 per million tokens applies to context that gets reused across multiple calls, which is relevant for applications that inject system prompts, knowledge bases, or customer profiles before each session. GPT-Realtime-Translate is priced at $0.034 per minute. GPT-Realtime-Whisper at $0.017 per minute.

When comparing GPT-Realtime-2 against alternatives, the key variable is what you're paying for on calls where the model has to reason across tools or complex context. Previous models were less expensive per token but often needed multiple turns to complete what a reasoning model handles in one. Whether GPT-Realtime-2's per-call cost is favorable depends on how often your application hits the kinds of multi-step, contextually complex interactions where GPT-5-class reasoning makes a material difference.

For developers moving from text-based agents to voice-based agents, GPT-Realtime-2 introduces some architecture shifts worth thinking through before building. The 128K context window changes how you handle session state. In text-agent applications, developers often manage context windows aggressively, summarizing earlier conversation history to stay under limits. For most customer service applications, a full session transcript will now fit within context. You can inject customer history, product knowledge bases, and policy documents before the session starts without worrying about context running out mid-call.

Parallel tool calls change how you structure your tool definitions. If your application previously required chained tool calls, where the output of one lookup feeds the input of the next, some of those chains can now be parallelized. Reasoning tiers give you a cost-optimization lever that didn't exist in the previous generation. A session that starts with simple identity verification at minimal effort, then moves into a complex billing dispute at high or xhigh effort, can now use different reasoning configurations within the same session. For high-volume applications, this distinction matters economically. Session-level tone configuration is most valuable for applications with distinct interaction modes, like a financial advisory tool that handles both routine confirmations and stressful account reviews.

Where This Leaves Voice AI Products

Customer service at scale is the most obvious application. Voice-driven support agents that can look up orders, process returns, handle billing questions, and escalate when appropriate have existed for years, but the combination of 128K context, parallel tool calls, and GPT-5 reasoning makes them substantially more reliable on complex interactions. The Zillow benchmark improvement is a preview of what teams building similar applications can expect when upgrading from earlier models.

Voice-controlled enterprise tools are another clear use case. Enterprise software is increasingly agent-first. Developers using platforms like those described in the OpenAI GPT-5.5 API expansion can now extend those workflows into voice interfaces where employees navigate systems, trigger actions, and get contextual answers by speaking rather than typing. The 128K context window makes it practical to inject large knowledge bases or user-specific data before the session starts.

Multilingual applications that previously required translation middleware can now route through a single API. A global customer service operation serving customers in 70 input languages can deploy GPT-Realtime-Translate as a single-vendor solution rather than stitching together transcription, translation, and synthesis from different providers. Education platforms that need interactive tutoring in voice form, accessibility tools for users who can't interact through text, media production workflows that require live captions, and creator tools that help people record and narrate content in real time all fit the new capability set.

OpenAI stated that GPT-Realtime-2 ships with guardrails designed to prevent misuse for spam and fraud. The system includes mechanisms that can halt conversations when violations of harmful content guidelines are detected during the interaction. Voice AI that can execute tool calls, access customer data, and carry on extended conversations with real people requires trust that the model won't be used in ways that damage users or the organizations deploying it. OpenAI's framing of explicit spam and fraud prevention suggests the guardrails are designed with the actual commercial use cases in mind.

Voice AI as a product category has been promising large returns for years. Practical limitations kept most deployments in the narrow, high-volume, low-complexity range: order status checks, appointment confirmations, simple FAQs. GPT-Realtime-2 moves the technical ceiling higher. The Zillow result (69% to 95% on adversarial benchmarks) is the most concrete signal of what that higher ceiling means in production. It doesn't mean voice AI becomes perfect. It means the scenarios where it previously broke down reliably are now more often handled well.

For anyone building in this space, the practical question is where their application sits on the complexity distribution. If most calls are simple and high-volume, the upgrade path from GPT-Realtime-1.5 is worth evaluating but not urgent. If a significant fraction of calls involve complex reasoning, multi-system tool calls, or extended sessions where context matters, the performance jump is substantial enough to warrant immediate testing. The addition of GPT-Realtime-Translate and GPT-Realtime-Whisper under the same API umbrella also changes the architecture of multilingual voice applications. Previously, combining transcription, translation, and voice response required integrating multiple providers. All three models are now available through OpenAI's Realtime API with consistent pricing, authentication, and latency expectations.

The 128K context window and reasoning tier system suggest OpenAI is designing GPT-Realtime-2 for applications that are increasingly hard to distinguish from human agent interactions. The next natural extensions are longer context windows still, finer control over reasoning effort at the turn level rather than the session level, and tighter integration between voice and multimodal inputs. The launch of GPT-Realtime-Translate with 70 input languages but only 13 output languages also points to an area of ongoing development. Expanding output language support requires investment in voice quality evaluation that is harder to scale quickly than comprehension coverage.

For a broader view of how current AI models compare across conversation, reasoning, and tool-use capabilities, the Claude vs GPT vs Gemini model comparison on AIntelligenceHub covers the current state of the model landscape across the major providers. The TechCrunch coverage of the launch includes additional developer context and positioning from OpenAI on the use cases they're prioritizing. GPT-Realtime-2 and its companion models are available now through the OpenAI Realtime API.

OpenAI Adds Real-Time Reasoning to Voice AI With GPT-Realtime-2

What Is New in the GPT-Realtime API

Performance, Pricing, and Architecture

Where This Leaves Voice AI Products

Get a weekly summary of our most popular articles

Comments

Related articles

The EU Just Rewrote Key Parts of Its AI Law. Here's What Changed.

NVIDIA and IREN Strike a $2.1 Billion Deal to Build 5 Gigawatts of AI Infrastructure

Meta Is Building 'Hatch,' an AI Agent for Instagram Shopping