What are Azure OpenAI Provisioned Throughput Units (PTUs)?

Provisioned Throughput Units (PTUs) are generic units of model processing capacity on Azure OpenAI. They let you reserve dedicated throughput for your deployments, ensuring consistent performance regardless of overall service load. PTUs are model-independent and can be allocated across Global, Data Zone, or Regional deployment types.

How do I calculate the number of PTUs I need?

To calculate PTU requirements, you need three inputs: requests per minute (RPM), average input tokens per request, and average output tokens per request. The effective tokens per minute is calculated as: Input TPM + (Output TPM × Output Ratio). This value is then divided by the model's Input TPM per PTU to get the raw PTU count, which is rounded up to the deployment increment and must meet the minimum deployment size.

What is the difference between Global, Data Zone, and Regional provisioned deployments?

Global Provisioned (GlobalProvisionedManaged) routes traffic across all Azure regions for highest availability and lowest minimum PTU. Data Zone Provisioned (DataZoneProvisionedManaged) keeps traffic within a geographic data zone such as the EU or US. Regional Provisioned (ProvisionedManaged) keeps all traffic in a single Azure region for data residency. Each type has different minimum PTU requirements and scale increments.

What is the output-to-input token ratio for PTU calculations?

Output tokens require more processing than input tokens. For GPT-5 models, one output token counts as 8 input tokens. For GPT-4.1, GPT-4o, o3, o4-mini, and most other recent models, one output token counts as 4 input tokens. Older models like GPT-4 may use a different ratio.

When should I use provisioned throughput instead of pay-as-you-go?

Consider provisioned throughput deployments when you have well-defined, predictable throughput and latency requirements — typically for production applications with known traffic patterns. Provisioned throughput is also recommended for real-time or latency-sensitive applications where consistent model processing time is important.

Does PTU quota guarantee capacity availability?

No. Quota limits the maximum number of PTUs that can be deployed in a subscription and region, but it does not guarantee capacity. Capacity is allocated at deployment time and is held for as long as the deployment exists. If service capacity is not available at the time of deployment, the deployment will fail. You can use the capacity API or the Microsoft Foundry portal to check real-time capacity availability.

What happens when my provisioned deployment exceeds capacity?

When utilization reaches 100%, the API returns a 429 HTTP status code with a retry-after-ms header indicating when to retry. This fast-fail response lets you redirect requests to another deployment, fall back to a standard (pay-as-you-go) deployment, or implement client-side retry logic. The 429 response continues until utilization drops below 100%.

Which models support provisioned throughput on Azure?

Provisioned throughput is available for Azure OpenAI models including GPT-5.4, GPT-5.2, GPT-5.1, GPT-5, GPT-5 mini, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o1, o3, o3 mini, and o4 mini. It also supports Azure DeepSeek models (DeepSeek-R1, DeepSeek-V3-0324, DeepSeek-R1-0528), Meta Llama (Llama-3.3-70B-Instruct), and Fireworks models (FW-GPT-OSS-120B, FW-Kimi-K2.5, FW-DeepSeek-V3.2, FW-MiniMax-M2.5) via Global Provisioned deployments.

PTU Calculator - Azure OpenAI Provisioned Throughput Estimator

What Are Provisioned Throughput Units (PTUs)?

Provisioned Throughput Units (PTUs) are generic units of model processing capacity that you purchase to power provisioned deployments on Microsoft Foundry. Unlike pay-as-you-go (standard) deployments where you pay per token, PTU deployments give you a reserved block of compute capacity that is allocated exclusively to your workloads — whether you use it or not.

PTU quota is managed per subscription and per region. Each quota defines the maximum number of PTUs that can be assigned to deployments in that subscription and region. Importantly, quota does not guarantee capacity — capacity is allocated at deployment time and held as long as the deployment exists. If capacity is unavailable when you create a deployment, the deployment will fail.

PTU reservations can be shared across a growing portfolio of models sold directly by Azure, including Azure OpenAI models (GPT-5.4, GPT-5.2, GPT-5.1, GPT-5, GPT-4.1, o3, o4-mini, and more), Azure DeepSeek models (DeepSeek-R1, DeepSeek-V3-0324, DeepSeek-R1-0528), Meta Llama (Llama-3.3-70B-Instruct), and Fireworks models (FW-GPT-OSS-120B, FW-Kimi-K2.5, FW-DeepSeek-V3.2, FW-MiniMax-M2.5). For example, if you have a 500 PTU reservation and use 300 for Azure OpenAI models, the remaining 200 can be used for DeepSeek-R1 and automatically share the reservation discount.

When to Use Provisioned Throughput

Choose provisioned throughput deployments when your application has well-defined, predictable throughput requirements — typically production workloads with known traffic patterns. Key scenarios include:

Latency-sensitive applications: PTU deployments deliver consistent model processing times because capacity is pre-allocated, unlike standard deployments which may experience variable latency under load.
High-throughput production workloads: If you process a large, steady volume of requests, PTUs often provide cost savings compared to per-token pricing.
Predictable capacity needs: When you can estimate your RPM, input tokens, and output tokens with reasonable accuracy using this calculator.

For exploratory workloads, variable traffic, or low-volume usage, standard (pay-as-you-go) deployments are usually more cost-effective.

Deployment Types Explained

When creating a provisioned deployment in Microsoft Foundry, you choose from three deployment types:

Global Provisioned Throughput (GlobalProvisionedManaged) — Routes traffic across all Azure regions for the highest availability and typically the lowest minimum PTU requirement. Best for workloads without strict data residency constraints.
Data Zone Provisioned Throughput (DataZoneProvisionedManaged) — Keeps all data processing within a geographic data zone (e.g., EU or US). Balances availability with data residency compliance.
Regional Provisioned Throughput (ProvisionedManaged) — Restricts all traffic to a single Azure region. Required when regulatory or compliance needs demand that data stays in one specific region. Typically has the highest minimum PTU deployment size.

New models are typically onboarded with Global Provisioned first, with Data Zone and Regional options following later. PTU quota and any reservations must match the region and deployment type (Global, Data Zone, or Regional) you intend to use.

Frequently Asked Questions