Estimate the cost of using gpt-realtime-1.5 and gpt-realtime-mini on Azure and OpenAI based on your expected conversation volume, duration, and turn count.
gpt-realtime supports low-latency, “speech in, speech out” conversational interactions. Unlike traditional speech pipelines that chain speech-to-text → LLM → text-to-speech, gpt-realtime processes audio natively — producing faster, more natural voice interactions with a single API call.
You can connect to the Realtime API via WebRTC, WebSocket, or SIP to send audio input and receive audio responses in real time.
The latest version, gpt-realtime-1.5, is available on both OpenAI and Azure OpenAI. A smaller, more affordable variant — gpt-realtime-mini— is also available for cost-sensitive voice applications. Both models are well-suited for building voice assistants, real-time translation systems, interactive customer support agents, telephony bots, and any application where users expect a natural, spoken conversation with an AI.
The Realtime API supports three connection protocols. In most cases, WebRTC is the recommended choice for real-time audio streaming thanks to its lower latency, built-in media handling, error correction, and peer-to-peer communication.
| Protocol | Best for | Latency | Complexity |
|---|---|---|---|
| WebRTC | Client-side apps (web, mobile) | Lowest (~50-100 ms) | Higher |
| WebSocket | Server-to-server, batch processing | Moderate (~100-300 ms) | Lower |
| SIP | Telephony integration | Varies | Highest |
SIP (Session Initiation Protocol) lets you route inbound VoIP calls directly into an AI-powered session, making it ideal for telephony integration and contact center automation.
Both gpt-realtime-1.5 and gpt-realtime-mini are billed per token across two modalities — audio and text — each with separate input, cached input, and output rates:
Audio input is tokenized at 10 tokens/second; audio output at 20 tokens/second. A small number of text tokens accompanies each audio response (~3 tokens/second of assistant speech).
The total cost of a conversation depends on its duration, number of turns, and the balance between user input and assistant output in each turn.
The Realtime API is ideal when your application requires spoken, interactive exchanges with sub-second latency. Common use cases include:
For text-only workloads, batch processing, or scenarios where request latency is less critical, standard GPT models like GPT-4.1 are more cost-effective.
Need provisioned throughput sizing?
Estimate how many PTUs you need for Azure OpenAI deployments.
| Category | Price / M Tokens | Tokens / Month (in M) | Cost / Month |
|---|---|---|---|
| Audio Input | $32.00 | 0.00 | $0.07 |
| Audio Input (cached) | $0.40 | 0.00 | $0.0016 |
| Audio Output | $64.00 | 0.00 | $0.12 |
| Text Input | $4.00 | 0.00 | $0.0029 |
| Text Input (cached) | $0.40 | 0.00 | $0.0009 |
| Text Output | $16.00 | 0.00 | $0.0043 |
Per Conversation
$0.20
Estimated Monthly Total
$0.20
Each turn sends the full conversation history as input. Previous messages are cached (shown with dashed borders).
Input: 680 tokens
(500 text + 180 audio tokens)
Input Conversation
Legend
Input: 1,274 tokens
(554 text + 720 audio tokens)
Input Conversation
Input: 1,868 tokens
(608 text + 1,260 audio tokens)
Input Conversation
Input: 2,462 tokens
(662 text + 1,800 audio tokens)
Input Conversation
Input: 3,056 tokens
(716 text + 2,340 audio tokens)
Input Conversation
The Realtime API charges per token across audio and text modalities. Costs accumulate with each turn in a conversation because the entire conversation history is sent as input for every response. Prompt caching reduces cost for previously seen tokens.
Prices per 1M tokens.
| Modality | Input | Cached Input | Output |
|---|---|---|---|
| Audio | $32.00 | $0.40 | $64.00 |
| Text | $4.00 | $0.40 | $16.00 |
| Image | $5.00 | $0.50 | — |
| Modality | Input | Cached Input | Output |
|---|---|---|---|
| Audio | $10.00 | $0.30 | $20.00 |
| Text | $0.60 | $0.06 | $2.40 |
| Image | $0.80 | $0.08 | — |
This calculator provides rough estimates based on simplified assumptions. Actual costs depend on conversation dynamics, voice activity detection behavior, caching efficiency, and token overhead.