2 min Devops

OpenAI launches new speech models via API

OpenAI launches new speech models via API

OpenAI is introducing new speech-to-text and text-to-speech models via the app. This enables developers to build speech agents that are better attuned to tone and expression.

In recent months, OpenAI has launched several new tools, including Operator, Deep Research, Computer-Using Agents and the Responses API. These developments focused mainly on text-based agents. Now Neowin reports that OpenAI has developed new speech-to-text and text-to-speech audio models that are available via the API. These models allow developers to build more powerful, customizable and expressive speech agents.

OpenAI’s new audio models, gpt-4o-transcribe and gpt-4o-mini-transcribe, show significant improvements in word error rate, language recognition and accuracy compared to the company’s existing Whisper models. The company achieved this progress through reinforcement learning and extensive mid-training. Developers used diverse and high-quality audio data sets for this.

Better understanding of nuances

OpenAI claims that these new audio models are better able to understand nuances in speech. They also make fewer errors in speech recognition and provide more reliable transcriptions. This also applies to input audio with accents, background noises or varying speaking speeds.

The gpt-4o-mini-tts model is the latest text-to-speech model and offers improved controllability. Developers can now give the model instructions on how the text should be pronounced. For the time being, however, this model is limited to artificial, preset voices.

Pricing announced

The pricing of the models is as follows: gpt-4o-transcribe costs $6 per million audio input tokens, $2.50 per million text input tokens and $10 per million text output tokens. The gpt-4o-mini-transcribe costs $3, $1.25 and $5 per million tokens respectively. Finally, the gpt-4o-mini-tts costs $0.60 per million text input tokens and $12 per million audio output tokens. This amounts to the following estimated costs per minute:

gpt-4o-transcribe: approximately 0.6 cents per minute
gpt-4o-mini-transcribe: approximately 0.3 cents per minute  
gpt-4o-mini-tts: approximately 1.5 cents per minute

The OpenAI team indicated that it intends to continue investing in improving the intelligence and accuracy of the audio models. The company also wants to explore ways in which developers can use their own custom voices to create even more personalized experiences. This will be done in ways that are in line with OpenAI’s security standards.

These new audio models are now available to all developers via the APIs. OpenAI also announced that there is an integration with the Agents SDK, which makes building speech agents easier. For low-latency speech-to-speech experiences, OpenAI recommends using the Real-time API.