What is Amazon Bedrock?

Amazon Bedrock is a fully managed service for AI and Machine learning that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Bedrock simplifies AI deployment by providing scalable, pay-as-you-go access to computing resources and model customization options, allowing users to tailor AI functions to specific business needs and operational contexts such as text generation, image processing, and custom model training. Besides building and running models, Bedrock includes features for evaluating model performance, knowledge bases, responsible AI guardrails, agentic capabilities and prompt management.

In this guide, we’ll take you through the various factors that impact Bedrock pricing, compare the best and latest models (think Meta Llama versus Anthropic cage match), and offer strategies for getting more out of every dollar you spend on Bedrock.

How does Amazon Bedrock pricing work?

Bedrock pricing can be fairly complex, but there are four key factors driving your costs: compute, model, storage, and data transfer.

Costs are based on the compute power required to run AI models, the storage needed for datasets and custom models, and the volume of data transfer during operations. Additionally, different foundation models have their own pricing structures, so selecting the appropriate model is essential for optimizing cost savings.

A meme comparing the AWS experience of a beginner and an experienced user.

What is the Amazon Bedrock pricing model?

Currently, there are five Amazon Bedrock pricing models (and two additional paid services).

On-Demand Pricing:

This model is most flexible, charging users only for the resources they use without any long-term commitments. Charges are based on the number of input tokens processed and output tokens generated for text models, each image for image models, and input tokens for embeddings models. (A token, comprising of a few characters, refers to the basic unit of text that a model learns to understand the user input and prompt).

On-Demand pricing also supports cross-region model inference, allowing users to handle traffic spikes by utilizing AWS’s global infrastructure without additional cross-region charges. The price is calculated based on the region you made the request in, i.e. the source region.

This model is ideal for businesses with variable AI workload demands that cannot predict usage volume in advance.

Provisioned Throughput:

Designed for large consistent inference workloads requiring guaranteed throughput, this model allows users to purchase capacity ahead of time, measured in model units. Charges are incurred hourly, with options for 1-month or 6-month commitments. This model suits use cases with predictable, high-volume workloads and is the only option for accessing custom models.

Currently, Amazon Titan, Anthropic, Cohere, Meta Llama and Stability AI offer provisioned throughput pricing, ranging from $21.18 per hour per model unit for a 1-month commitment (Meta Llama) to $49.86 per hour per model unit for a 1-month commitment (Stability.ai).

Model Customization:

Amazon Bedrock enables the customization of foundation models to suit specific business needs and contexts. You can fine-tune models using labeled data or enhance them through continued pretraining with unlabeled data. The customization costs are calculated by the number of tokens processed and for model storage charged per month per model, factoring in the size of your training data and the number of training epochs—an epoch represents a complete pass through the model training dataset. For text-generation models, this customization requires a Provisioned Throughput plan, where you’re initially provided one model unit without a commitment term.
AWS services recommended for for generative AI model customization (image source: AWS)
Charges apply based on the hours of use. If higher throughput is needed beyond the first model unit, options for 1-month or 6-month commitments are available, allowing for more intensive usage at predictable costs.

Batch Mode:

This mode allows users to process large batches of prompts in a single go, significantly reducing costs. Responses are processed and stored in an Amazon S3 bucket, accessible at any time. Batch mode is priced at 50% less than On-Demand rates for selected foundation models, making it a cost-effective option for large-scale inference tasks.

Model Evaluation:

Amazon Bedrock offers flexible model evaluation options with no minimum usage commitments. Model evaluation is crucial for organizations that want to rigorously test and optimize their AI models before full deployment.

For automatic evaluations, users only pay for the model inference used, and algorithmic scores are provided at no additional cost. In contrast, human-based evaluations involve charges for model inference plus a fee of $0.21 per completed human task, where a task is defined as a human worker evaluating a single prompt and its responses.

These charges apply uniformly across all AWS Regions and are billed under the Amazon SageMaker line item, irrespective of the number of models or evaluation metrics used. There’s no separate fee for the workforce, as users supply their own evaluators. For more tailored needs, AWS provides customized pricing in private engagements through their expert evaluations team.

Custom Model Import

Custom Model Import in Amazon Bedrock lets you bring your existing model customizations into Bedrock’s fully managed environment, just like its hosted foundation models. You can import custom weights for supported architectures at no cost and serve your custom model using On-Demand mode without needing any control plane actions.

 You are only charged for model inference based on the number of model copies needed to handle your inference volume and the duration each model copy is active, billed in 5-minute increments. A model copy is a single instance of your imported model ready to serve inference requests. The price per model copy per minute varies depending on factors such as the architecture, context length, AWS Region, compute unit version (hardware generation), and is tiered by model copy size.

Amazon Bedrock Guardrails

Amazon Bedrock Guardrails provides additional customizable safeguards on top of the native protections of FMs, for blocking harmful content and filtering out hallucinated responses for RAG and summarization workloads. It also allows customers to customize safety, privacy, and truthfulness protections for their individual solutions.

How to reduce AWS Bedrock Costs: Best Practices

Some strategies you can use to reduce your AWS Bedrock costs include:

Optimize Prompts to Reduce Token Usage

Since Bedrock’s pricing for text models is based on the number of input and output tokens, minimizing token usage can lead to significant cost savings. Projecting and measuring tokens can be a challenge, due to complexities such as unpredictable user behavior (length and complexity of inputs), unpredictable model outputs, hidden tokens and contextual expansion. However, you can utilize libraries and tools like Hugging Face Tokenizers or OpenAI’s tiktoken to estimate the number of tokens in your prompts and expected responses before sending them to the model.

Some tips include:

  • Craft concise and clear prompts by eliminating unnecessary words or phrases.
  • Use precise language to reduce ambiguity, which can also improve model responses.
  • Set maximum token limits for the model’s output to control the length of responses.

Utilize Batch Mode for Large-Scale Inference

Batch Mode processing is priced at 50% less than On-Demand rates for selected models, making it ideal for large volumes of data that don’t require immediate responses.

  • Aggregate inference tasks and submit them as batch jobs.
  • Schedule batch processing during off-peak hours to maximize resource availability.
  • Store and manage batch responses in Amazon S3 for easy access and analysis.

Implement Efficient Data Preprocessing

Reducing the amount of data processed not only enhances model performance but also lowers operational costs. This is crucial when building an end-to-end GenAI pipeline.

  • Cleanse data to remove irrelevant or duplicate information before processing.
  • Use data compression techniques where applicable to minimize data size.
  • Normalize data formats to ensure consistency and reduce processing overhead.

Leverage Provisioned Throughput for Predictable Workloads

For applications with consistent usage patterns, Provisioned Throughput is ideal.

  • Analyze your application’s usage metrics to determine predictability.
  • Choose a 1-month or 6-month commitment based on your forecasted needs.
  • Monitor throughput utilization to adjust provisioned capacity as necessary.

Implement Best Practices for Storage & Data Transfer

Optimizing your data storage and data transfer strategies can lead to significant cost reductions in your AWS Bedrock usage.

  • Use appropriate storage classes like Amazon S3 Standard-Infrequent Access or S3 Glacier for infrequently accessed data & set up S3 Lifecycle policies to automatically transfer data to lower-cost storage tiers.
  • Keep transfers within the same region whenever possible and compress data before transfer.
  • Use VPC Endpoints or PrivateLink when transferring data between services within AWS.

Monitor Resource Usage and Set Up Cost Alerts

Continuous monitoring helps identify inefficiencies and prevents unexpected expenses. You can use a solution like nOps to track your resource utilization, analyze spending patterns, and identify cost drivers.
This image illustrates the concept of tokens in machine learning. It shows a text passage with highlighted token segments, which are the units of text processed by language models. The image also displays the total token count (40) and character count (204) for the text.

nOps was recently ranked #1 with five stars in G2’s cloud cost management category, and we optimize $1.5+ billion in cloud spend for our customers.

To find out to get complete cost visibility, allocate 100% of AWS costs, and start eliminating cloud waste, book a demo with one of our FinOps experts.

Select the Right Foundation Model

Different foundation models vary in capabilities and costs; choosing the most appropriate model is key for cost optimization.

  • Evaluate the specific requirements of your application (e.g., complexity, response time).
  • Test multiple models to compare performance against cost.
  • Opt for smaller or less complex models if they meet your application’s needs adequately

Understanding the differences compared for various model providers on Bedrock

Let’s take a look at how model price varies per model provider currently hosted on Amazon (or skip to the end of this section for a table summarizing the takeaways). Please note that pricing is updated frequently and changes based on region and model type.

AI21 Labs

AI21 Labs is an Israeli company specializing in Natural Language Processing (NLP). It’s known for its text generation models tailored for applications requiring complex content generation, such as automated writing assistants and content summarization tools.

AI21 Labs offers On-Demand pricing for their models, charging per 1,000 input and output tokens.

The image displays four graphs from nOps, visualizing cloud usage and costs. The top-left graph shows a significant decrease in compute costs due to optimized purchase types. The top-right graph indicates a slight increase in public IP address costs. The bottom-left graph shows a significant increase in NAT Gateway data transfer costs, while the bottom-right graph shows a slight increase in NAT Gateway hourly costs. The overall image provides insights into cloud cost breakdowns and trends as visualized by nOps.

Amazon Titan

Amazon Titan provides high-performance AI models that serve a broad spectrum of complex machine learning applications, including image recognition, natural language processing, and predictive analytics. Designed for high accuracy and processing speed, Titan models are suitable for enterprises that demand dependable and scalable AI solutions for essential business functions.

Amazon Titan offers both On-Demand and Provisioned Throughput pricing:

Text Models:

  • Amazon Titan Text Express: $0.0008 per 1,000 input tokens and $0.0016 per 1,000 output tokens.
  • Amazon Titan Text Embeddings V2: $0.00011 per 1,000 input tokens.

Multi-Modal Models:

  • Image Generation Models: Prices vary based on image size and quality.

Amazon Titan also provides options for model customization and Provisioned Throughput for applications requiring guaranteed performance; you can consult the complete list here.

Anthropic

Anthropic is an AI startup that specializes in developing AI models that prioritize safety and interpretability, focusing on ethical AI development. The company designs its models to minimize unpredictable behaviors and enhance the clarity of decision-making processes. Anthropic has developed the Claude family of large language models (LLMs) as a competitor to OpenAI’s ChatGPT and Google’s Gemini.

Anthropic offers On-Demand pricing across various AWS regions. For example:

The image shows a table displaying the pricing information for various AI21 Labs models. The table includes columns for the model name, price per 1,000 input tokens, and price per 1,000 output tokens. The models listed are Jamba 1.5 Large, Jamba 1.5 Mini, Jurassic-2 Mid, Jurassic-2 Ultra, and Jamba-Instruct. The pricing information is displayed in US dollars.

Cohere

Cohere, a Canadian multinational company specializes in LLMs designed for their ability to comprehend and generate human-like text, supporting diverse applications from chatbots to intricate document analysis.

Cohere provides On-Demand pricing tailored to different model capabilities:

  • Command Model: $0.0015 per 1,000 input tokens and $0.0020 per 1,000 output tokens.
  • Command-Light Model: $0.0003 per 1,000 input tokens and $0.0006 per 1,000 output tokens.
  • Embedding Models:
    • Embed – English: $0.0001 per 1,000 input tokens.
    • Embed – Multilingual: $0.0001 per 1,000 input tokens.

For businesses requiring custom solutions, Cohere offers model customization and Provisioned Throughput pricing options.

The image displays a table showing the pricing information for various Anthropic models. The table includes columns for the model name, price per 1,000 input tokens, and price per 1,000 output tokens. The models listed are Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku, Claude 2.1, Claude 2.0, and Claude Instant. The pricing information is displayed in US dollars and is applicable to the US East (N. Virginia) and US West (Oregon) regions. Additionally, a note indicates that Claude 3 Opus is currently only available in the US West (Oregon) region.

Meta Llama

Meta Llama specializes in AI models that are customized for integration with social media platforms and digital marketing tools. These models excel in generating engaging content, analyzing user sentiment, and automating interactions on digital platforms.

Meta Llama offers On-Demand pricing based on model size:

  • Llama 3.2 Instruct (1B): $0.0001 per 1,000 input and output tokens.
  • Llama 3.2 Instruct (11B): $0.00035 per 1,000 input and output tokens.
  • Llama 3.2 Instruct (90B): $0.002 per 1,000 input and output tokens.

Provisioned Throughput pricing is available for businesses needing guaranteed performance and scalability.

Mistral AI

Mistral AI is a French company specializing in AI products. The company focuses on producing open source large language models, emphasizing the foundational importance of free and open-source software, and positioning itself as an alternative to proprietary models.

It offers On-Demand pricing:

The image shows two people watching a model train. The text on the image says "Watching a model train"

Stability AI

Stability AI is an artificial intelligence company, best known for its text-to-image model Stable Diffusion. It’s notable for its development of creative AI models used in the arts and design fields for the creation of unique visual content, music, and interactive media experiences.

Stability AI charges on a per-image basis:

The image shows a table displaying the pricing information for various Mistral models. The table includes columns for the model name, price per 1,000 input tokens, and price per 1,000 output tokens. The models listed are Mistral 7B, Mistral 8B, Mistral Small (24.02), and Mistral Large (24.02). The pricing information is displayed in US dollars.
Provisioned Throughput pricing is also available for higher volume requirements. Currently, model customization (fine-tuning) is not supported for Stability AI models on Amazon Bedrock.

About nOps

The nOps all-in-one suite makes it easy to get better performance at lower costs on AWS.

Key features include:

Automated Cost Allocation: automatically allocate all of your AWS costs, regardless of where you are in your tagging journey.

Autoscaling Optimization: nOps continually reconfigures your preferred autoscaler (Cluster Autoscaler or Karpenter) to keep your workloads optimized at all times for minimal engineering effort.

Spot Savings: Automatically run your workloads on the optimal blend of On-Demand, Savings Plans and Spot instances, with automated instance selection & real-time instance reconsideration.

Join our customers using nOps to understand your cloud costs and leverage automation with complete confidence by booking a demo with one of our AWS experts.