Cost Governance

Category Subcategory Recommendation Service Priority Reference
Cost Governance Cost Management Verify PTU cost savings vs pay as you pricing for Azure OpenAI and OpenAI models. Azure OpenAI 🔴 High link
Cost Governance Cost Management Ensure the right and cost effective model is in use, unless the use case demands a more expensive model. Azure OpenAI 🟡 Medium link
Cost Governance Cost Management Allocate provisioning quotas for each model based on expected workloads to prevent unnecessary costs. Azure OpenAI 🟡 Medium link
Cost Governance Cost Management Use the right deployment type, global deployment offers lower cost-per-token pricing on certain GPT models. Azure OpenAI 🔴 High link
Cost Governance Cost Management Choose the right hosting infrastructure, depending on your solution’s needs e.g. managed endpoints, AKS or Azure App Service. NA 🟡 Medium  
Cost Governance Cost Management Define and enforce a policy to automatically shutdown Azure AI Foundry and Azure Machine Learning compute instances. Azure AI Foundry 🔵 Low link
Cost Governance Cost Management Configure ‘Actual’ and ‘Forecasted’ Budget Alerts. Azure Cost Management 🟡 Medium link
Cost Governance Token Optimization Use prompt compression tools like LLMLingua or gprtrim Azure OpenAI 🟡 Medium link
Cost Governance Token Optimization Use tiktoken to understand token sizes for token optimizations in conversational mode Azure OpenAI 🔴 High link
Cost Governance Cost familiarization Understand difference in cost of base models and fine tuned models and token step sizes Azure OpenAI 🟡 Medium link
Cost Governance Batch processing Batch requests, where possible, to minimize the per-call overhead which can reduce overall costs. Ensure you optimize batch size Azure OpenAI 🔴 High link
Cost Governance Cost monitoring Set up a cost tracking system that monitors model usage and use that information to help inform model choices and prompt sizes Azure OpenAI 🟡 Medium link
Cost Governance Token limit Set a maximum limit on the number of tokens per model response (max_tokens and the number of completions to generate). Optimize the size to ensure it is large enough for a valid response Azure OpenAI 🟡 Medium link
Cost Governance Costing Model Evaluate usage of billing models - PAYG vs PTU. Start with PAYG and consider PTU when the usage is predictable in production since it offers dedicated memory and compute, reserved capacity, and consistent maximum latency for the specified model version Azure OpenAI 🔴 High link
Cost Governance Quota Management Consider Quota management practices. Use dynamic quota for certain use cases when your application can use extra capacity opportunistically or the application itself is driving the rate at which the Azure OpenAI API is called Azure OpenAI 🔴 High link
Cost Governance Cost estimation Develop your cost model, considering prompt sizes. Understanding prompt input and response sizes and how text translates into tokens helps you create a viable cost model Azure OpenAI 🟡 Medium link
Cost Governance Model selection Consider model pricing and capabilities when you choose models. Start with less-costly models for less-complex tasks like text generation or completion tasks and for complex tasks like language translation or content understanding, consider using more advanced models. Optimize costs while still achieving the desired application performance Azure OpenAI 🟡 Medium link
Cost Governance Usage Optimization Maximize Azure OpenAI price breakpoints like fine-tuning and model breakpoints like image generation to your advantage. Fine-tuning is charged per hour, use as much time as you have available per hour to improve results without slipping into the next billing period. The cost for generating 100 images is the same as the cost for 1 image Azure OpenAI 🟡 Medium link
Cost Governance Usage Optimization Remove unused fine-tuned models when they’re no longer being consumed to avoid incurring an ongoing hosting fee Azure OpenAI 🟡 Medium link
Cost Governance Token Optimization Create concise prompts that provide enough context for the model to generate a useful response. Also ensure that you optimize the limit of the response length. Azure OpenAI 🟡 Medium link