OpenAI introduces three audio models for real-time developer applications
OpenAI aims to move beyond simple transcription with its latest API
OpenAI has introduced three sophisticated audio models for its developer platform to make voice-based software agents more conversational and efficient.
In an announcement on Thursday, the ChatGPT-maker unveiled GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.
This launch signifies a strategic move beyond simple transcription toward multimodal agents that can listen, translate, and act during live conversations.
The models are currently available for testing within the OpenAI developer playground, targeting creators of virtual assistants and customer service platforms.
The primary model, GPT-Realtime-2, is engineered to manage complex requests, handle interruptions, and maintain context over extended sessions.
Meanwhile, the translation model supports more than 70 languages, providing output in 13 specific languages for educational and support settings.
For live transcription, GPT-Realtime-Whisper generates captions and meeting notes as the speaker talks. Major firms, including Zillow, Priceline, and Deutsche Telekom, are already testing these systems, which promise improved accuracy and significantly lower latency during live interactions.
Pricing structures have been established for the new suite, with GPT-Realtime-2 starting at $32 per million audio input tokens.
This rollout comes as competition in the AI voice sector intensifies among major technology firms. OpenAI continues to focus on refining its multimodal capabilities to ensure more natural speech patterns in its software.
These developments are expected to influence the next generation of voice-enabled applications across various industries.
Following this release, the company is likely to explore further integrations of these models into its broader ecosystem later this year.