What is gpt-realtime and how does it work?

gpt-realtime supports low-latency, 'speech in, speech out' conversational interactions. You can connect to the Realtime API via WebRTC, WebSocket, or SIP to send audio input and receive audio responses in real time — without chaining separate speech-to-text and text-to-speech steps. It powers use cases like voice assistants, live translation, telephony integration, and interactive agents.

How much does gpt-realtime-1.5 cost?

gpt-realtime-1.5 pricing is per 1 million tokens. Audio tokens cost $32 input, $0.40 cached input, and $64 output. Text tokens cost $4 input, $0.40 cached input, and $16 output. Costs depend on conversation length, number of turns, and the ratio of audio to text tokens.

How much does gpt-realtime-mini cost?

gpt-realtime-mini is a lower-cost realtime model priced per 1 million tokens. Audio tokens cost $10 input, $0.30 cached input, and $20 output. Text tokens cost $0.60 input, $0.06 cached input, and $2.40 output. It also supports image input at $0.80 per 1M tokens ($0.08 cached). It is significantly cheaper than gpt-realtime-1.5 while still supporting real-time audio conversations.

What is the difference between gpt-realtime-1.5 and gpt-realtime-mini?

gpt-realtime-mini is a smaller, more affordable variant of the realtime model family. It costs roughly 3x less for audio tokens and 6-7x less for text tokens compared to gpt-realtime-1.5. It is best suited for cost-sensitive voice applications where lower latency and reduced spend matter more than maximum model capability.

What are cached tokens and how do they reduce cost?

Cached tokens are input tokens from earlier turns in the same conversation that the model has already processed. Instead of being re-billed at the full input rate, cached tokens are charged at a significantly lower rate ($0.40 per 1M tokens for both audio and text). As conversations grow longer with more turns, the proportion of cached tokens increases, substantially reducing per-conversation costs.

How are audio tokens counted in the Realtime API?

Audio tokens are generated at a fixed rate based on audio duration. Audio input is tokenized at 10 tokens per second (1 token per 100 ms), while audio output is tokenized at 20 tokens per second (1 token per 50 ms). A 30-second conversation turn therefore produces approximately 300 input audio tokens and 600 output audio tokens.

Is gpt-realtime available on Azure?

Yes. Both gpt-realtime-1.5 and gpt-realtime-mini are available on Azure OpenAI as standard (pay-as-you-go) deployments. They follow the same pricing structure as OpenAI's direct offering. You can deploy them through the Azure AI Foundry portal and connect via WebRTC, WebSocket, or SIP.

Should I use WebRTC, WebSocket, or SIP with the Realtime API?

Use WebRTC for client-side apps (web and mobile) — it offers the lowest latency (~50-100 ms), built-in audio/video codec support, error correction for unreliable networks, and peer-to-peer communication. Use WebSocket for server-to-server or batch processing scenarios with moderate latency (~100-300 ms) and lower complexity. Use SIP (Session Initiation Protocol) for telephony integration, routing inbound VoIP calls directly into an AI-powered session.

When should I use gpt-realtime instead of a standard GPT model?

Use gpt-realtime when you need real-time, low-latency audio conversations — such as voice assistants, live customer support bots, real-time translation, or interactive voice agents. For text-only workloads, batch processing, or scenarios where latency is less critical, standard GPT models like GPT-4.1 are more cost-effective.

How does the input/output ratio affect realtime API costs?

The input/output ratio determines how much of each conversation turn is user input versus assistant output. A higher output ratio means the assistant speaks more per turn, generating more expensive output audio tokens (at $64 per 1M tokens) and output text tokens (at $16 per 1M tokens). Adjusting this ratio in the calculator helps model different conversation styles — from brief Q&A to longer assistant monologues.

GPT Realtime Pricing Calculator — Estimate gpt-realtime-1.5 & gpt-realtime-mini Costs

Conversation Parameters

Enter your expected usage.

Model

Conversations per Month

Avg. Duration (minutes)

Avg. Turns per Conversation

System Prompt Length (tokens)

User Input / Assistant Output Ratio

50% User Input (18s)50% Assistant Output (18s)

Estimated Cost

Based on 1 conversation/month, 3 min each, 5 turns.

Category	Price / M Tokens	Cost / Month
Audio Input	$32.00	$0.07
Audio Input (cached)	$0.40	$0.0016
Audio Output	$64.00	$0.12
Text Input	$4.00	$0.0029
Text Input (cached)	$0.40	$0.0009
Text Output	$16.00	$0.0043

Per Conversation

$0.20

Estimated Monthly Total

$0.20

Tokens Consumed per Turn

Each turn sends the full conversation history as input. Previous messages are cached (shown with dashed borders).

Turn 1

Input: 680 tokens

(500 text + 180 audio tokens)

Input Conversation

Instructions

500 text tokens

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Legend

Instructions (text)

User Message (audio)

Assistant Message (audio + text)

Cached

Turn 2

Input: 1,274 tokens

(554 text + 720 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 3

Input: 1,868 tokens

(608 text + 1,260 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 4

Input: 2,462 tokens

(662 text + 1,800 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 5

Input: 3,056 tokens

(716 text + 2,340 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

How the Calculation Works

The Realtime API charges per token across audio and text modalities. Costs accumulate with each turn in a conversation because the entire conversation history is sent as input for every response. Prompt caching reduces cost for previously seen tokens.

Assumptions

Speaking time is split 50/50 between user and assistant per turn.
Audio input tokens: 10 tokens/second (1 token per 100 ms).
Audio output tokens: 20 tokens/second (1 token per 50 ms).
Messages from previous turns are assumed to be fully cached.
Text output is estimated at ~3 tokens/second of assistant speaking time.

Pricing (Azure)

Prices per 1M tokens.

gpt-realtime-1.5

Modality	Input	Cached Input	Output
Audio	$32.00	$0.40	$64.00
Text	$4.00	$0.40	$16.00
Image	$5.00	$0.50	—

gpt-realtime-mini

Modality	Input	Cached Input	Output
Audio	$10.00	$0.30	$20.00
Text	$0.60	$0.06	$2.40
Image	$0.80	$0.08	—

Disclaimer

This calculator provides rough estimates based on simplified assumptions. Actual costs depend on conversation dynamics, voice activity detection behavior, caching efficiency, and token overhead.

What Is gpt-realtime?

gpt-realtime supports low-latency, “speech in, speech out” conversational interactions. Unlike traditional speech pipelines that chain speech-to-text → LLM → text-to-speech, gpt-realtime processes audio natively — producing faster, more natural voice interactions with a single API call.

You can connect to the Realtime API via WebRTC, WebSocket, or SIP to send audio input and receive audio responses in real time.

The latest version, gpt-realtime-1.5, is available on both OpenAI and Azure OpenAI. A smaller, more affordable variant — gpt-realtime-mini— is also available for cost-sensitive voice applications. Both models are well-suited for building voice assistants, real-time translation systems, interactive customer support agents, telephony bots, and any application where users expect a natural, spoken conversation with an AI.

Connecting to the Realtime API: WebRTC vs WebSocket vs SIP

The Realtime API supports three connection protocols. In most cases, WebRTC is the recommended choice for real-time audio streaming thanks to its lower latency, built-in media handling, error correction, and peer-to-peer communication.

Protocol	Best for	Latency	Complexity
WebRTC	Client-side apps (web, mobile)	Lowest (~50-100 ms)	Higher
WebSocket	Server-to-server, batch processing	Moderate (~100-300 ms)	Lower
SIP	Telephony integration	Varies	Highest

SIP (Session Initiation Protocol) lets you route inbound VoIP calls directly into an AI-powered session, making it ideal for telephony integration and contact center automation.

How Realtime API Pricing Works

Both gpt-realtime-1.5 and gpt-realtime-mini are billed per token across two modalities — audio and text — each with separate input, cached input, and output rates:

gpt-realtime-1.5 Pricing

Audio tokens: $32 per 1M input tokens, $0.40 per 1M cached input tokens, $64 per 1M output tokens.
Text tokens: $4 per 1M input tokens, $0.40 per 1M cached input tokens, $16 per 1M output tokens.

gpt-realtime-mini Pricing

Audio tokens: $10 per 1M input tokens, $0.30 per 1M cached input tokens, $20 per 1M output tokens.
Text tokens: $0.60 per 1M input tokens, $0.06 per 1M cached input tokens, $2.40 per 1M output tokens.

Audio input is tokenized at 10 tokens/second; audio output at 20 tokens/second. A small number of text tokens accompanies each audio response (~3 tokens/second of assistant speech).

Cached tokens: In multi-turn conversations, tokens from earlier turns are cached and re-billed at much lower rates — dramatically cheaper than the full input rate. More turns mean more caching and lower average cost per conversation.

The total cost of a conversation depends on its duration, number of turns, and the balance between user input and assistant output in each turn.

When to Use the Realtime API

The Realtime API is ideal when your application requires spoken, interactive exchanges with sub-second latency. Common use cases include:

Voice assistants: Build conversational agents that listen and respond in natural speech without noticeable delay.
Live translation & interpretation: Translate spoken language in near real time for meetings, calls, or customer service.
Interactive voice bots: Customer support bots, scheduling assistants, and IVR replacements that feel like talking to a human.
Telephony & contact centers: Route inbound VoIP calls via SIP directly into AI-powered sessions for automated phone support.
Accessibility tools: Real-time audio descriptions, read-aloud interfaces, and voice-driven navigation for users who prefer spoken interaction.

For text-only workloads, batch processing, or scenarios where request latency is less critical, standard GPT models like GPT-4.1 are more cost-effective.

Frequently Asked Questions

Turn 1

Input: 680 tokens

(500 text + 180 audio tokens)

Input Conversation

Instructions

500 text tokens

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Legend

Instructions (text)

User Message (audio)

Assistant Message (audio + text)

Cached

Turn 2

Input: 1,274 tokens

(554 text + 720 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 3

Input: 1,868 tokens

(608 text + 1,260 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 4

Input: 2,462 tokens

(662 text + 1,800 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

Turn 5

Input: 3,056 tokens

(716 text + 2,340 audio tokens)

Input Conversation

Instructions (cached)

500 text tokens

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message (cached)

54 text, 360 audio tokens (18s)

User Message (cached)

180 audio tokens (18s)

Assistant Message

54 text, 360 audio tokens (18s)

User Message

180 audio tokens (18s)

Assistant Message

360 audio, 54 text tokens (18s)

How the Calculation Works

Assumptions

Speaking time is split 50/50 between user and assistant per turn.
Audio input tokens: 10 tokens/second (1 token per 100 ms).
Audio output tokens: 20 tokens/second (1 token per 50 ms).
Messages from previous turns are assumed to be fully cached.
Text output is estimated at ~3 tokens/second of assistant speaking time.

Modality

Input

Cached Input

Output

Audio

$32.00

$0.40

$64.00

Text

$4.00

$0.40

$16.00

Image

$5.00

$0.50

—

Modality

Input

Cached Input

Output

Audio

$10.00

$0.30

$20.00

Text

$0.60

$0.06

$2.40

Image

$0.80

$0.08

—