Slash Your LLM API Costs by 90% with this Gemini Trick: Mastering Context Caching

Uncover a powerful way to slash your LLM API costs by up to 90% with Google's context caching. Learn how to implement this technique and leverage it for in-context learning, boosting efficiency and savings.

July 2, 2025

Discover a powerful technique that can slash your LLM API costs by up to 90% - context caching. This blog post explores how you can leverage this innovative approach to optimize your LLM usage, reduce overhead, and potentially replace more expensive solutions like retrieval-augmented generation. Learn how to implement context caching with Google's implementation and unlock significant cost savings for your projects.

Slashing LLM API Cost by Up to 90% with Context Caching
How Context Caching Works
Cost Savings with Context Caching
Implementing Context Caching with Google
Using Context Caching for In-Context Learning
Conclusion

Slashing LLM API Cost by Up to 90% with Context Caching

Context caching is a powerful technique that can significantly reduce the cost of using large language models (LLMs) through APIs. By caching the context of a document or system instructions, you can avoid the need for repeated API calls, leading to cost savings of up to 90%.

Major API providers like Google, Anthropic, and OpenAI have implemented context caching, with Google being the first to introduce it. Google's implementation is particularly flexible, allowing you to control the caching duration and easily update or delete the cached content.

Context caching is especially beneficial when working with large documents, such as PDFs or video transcripts, where the user will be interacting with the same content repeatedly. By caching the content once, you can provide the user with a seamless experience while drastically reducing the API costs.

In addition to cost savings, context caching can also serve as an alternative to retrieval-augmented generation (RAG) for smaller sets of documents, avoiding the overhead of vector stores, indexing, and storage.

Context caching can be used for in-context learning, where you can add information about a new library or package, and the language model can then use that cached content in subsequent calls. This can be particularly useful for tasks like generating code or providing summaries of technical documentation.

To get started with context caching, you'll need to install the necessary packages, such as the Google Generative AI package, and then create a cache for your content. Once the cache is created, you can use it in your subsequent API requests, and the language model will process the cached content along with the user's input.

By leveraging context caching, you can significantly reduce your LLM API costs while maintaining the benefits of powerful language models. This technique is a valuable tool in the arsenal of any developer working with large language models.

How Context Caching Works

Context caching is a technique that can significantly reduce the cost of using large language models (LLMs) by reducing the number of API calls required. Here's how it works:

Caching the Context: When you have a large document or set of documents that you need to use repeatedly, you can cache the content of those documents. This cached content is then used as part of the context for subsequent requests to the LLM.
Reduced API Calls: By using the cached context, you can reduce the amount of new content that needs to be sent to the LLM for each request. This means fewer API calls, which translates to lower costs.
Faster Response Times: Since the LLM doesn't need to process as much new content, the response times can be faster when using cached context.
Flexible Caching Duration: The caching duration can be configured, allowing you to control how long the cached content remains valid. This can be useful for scenarios where the content changes over time.
Potential Replacement for Retrieval Augmented Generation: In some cases, context caching can be used as an alternative to retrieval augmented generation (RAG), especially when dealing with a smaller set of documents. This can help avoid the overhead of vector stores and indexing.
In-Context Learning: Context caching can also be used for in-context learning, where you can add information about a new library or topic to the cached context, allowing the LLM to use that information in subsequent calls.

Overall, context caching is a powerful technique that can significantly reduce the cost of using LLMs, especially when working with large documents or a consistent set of information. By leveraging this approach, you can optimize your LLM usage and maximize the value you get from these powerful AI models.

Cost Savings with Context Caching

Context caching is an effective technique that can significantly reduce the cost of using large language models (LLMs) through APIs. By caching the context or content that is repeatedly used in subsequent requests, you can avoid the need to send the same information to the model multiple times, resulting in substantial cost savings.

According to the information provided, using context caching can lead to a cost reduction of up to 75% compared to non-cached tokens. This applies to both text-based and multimodal tokens. The cost savings are particularly significant for Gemini 2.5 Pro, where the pricing structure changes for token usage beyond 200,000.

It's important to consider the storage cost associated with context caching, which is measured in terms of the number of tokens stored per hour. For Gemini 2.5 Flash, the storage cost is $4.5 per million tokens per hour, so it's crucial to carefully manage the duration for which you want to keep the cached content.

Overall, context caching is a powerful tool that can help you slash your LLM API costs by up to 90%, making it a valuable technique to leverage, especially when working with large documents or files that users interact with repeatedly.

Implementing Context Caching with Google

Context caching is a powerful technique that can significantly reduce the cost of using large language models (LLMs) by up to 90%. It works by caching the context or background information that the model needs to generate a response, reducing the number of tokens that need to be processed on each request.

Google's implementation of context caching is particularly flexible, allowing you to control the caching duration and easily update or delete cached content as needed. Here's how to get started:

Install the Google Generative AI package: Start by installing the necessary Python package to interact with Google's LLM API.

!pip install google-generative-ai

Create a cache: Use the create_cache() function to store your context information in a cache. You can include system instructions, documents, or any other content you want the model to have access to.

from google.generative.v1 import GenerativeClient

client = GenerativeClient()
cache = client.create_cache(
    model_name="your-model-name",
    system_instruction="Your system instruction",
    contents="Your cached content"
)

Use the cached content: When making a request to the model, provide the cache configuration to have the cached content included in the context.

response = client.generate_text(
    prompt="Your user prompt",
    cache_config=cache.config
)

Manage your caches: You can list all available caches, update the caching duration, and delete caches as needed.

# List all caches
caches = client.list_caches()

# Update cache duration
cache.update_ttl(300)  # Set cache to expire in 300 seconds

# Delete a cache
cache.delete()

By leveraging context caching, you can significantly reduce the cost of using LLMs, especially when working with large documents or datasets. This technique can also be used as an alternative to retrieval-augmented generation (RAG) in certain scenarios, avoiding the overhead of vector stores and indexing.

Using Context Caching for In-Context Learning

Context caching can be a powerful technique for enabling in-context learning with large language models (LLMs). By caching relevant information, such as documentation or background knowledge, you can provide the LLM with additional context to draw upon during subsequent interactions.

In the example provided, the author demonstrates how to use context caching with the Google Generative AI API to enable in-context learning for creating MCP (Microservice Communication Protocol) servers. The key steps are:

Ingest the contents of a GitHub repository related to the MCPS Python package using the git-ingest library, which converts the repository into a single LLM-ready markdown file.
Create a cache of the repository contents using the create_cache() function in the Google Generative AI API client.
When the user prompts the LLM to "build a simple MCP server for reading and writing local files under temp MCP", the LLM can leverage the cached repository information to provide a detailed, step-by-step implementation.

By caching the relevant documentation and background knowledge, the LLM can draw upon this context to generate more informed and useful responses, without the overhead of additional API calls or vector store lookups. This approach can be particularly beneficial when working with smaller sets of documents, where the cost savings from reduced API usage can be significant.

The example also demonstrates how to manage the cache, including updating the time-to-live (TTL) and deleting the cache when necessary, to ensure efficient usage and cost control.

Overall, this section highlights the practical application of context caching for enabling in-context learning, which can be a valuable technique for developers working with LLMs and aiming to optimize their API usage and costs.

Conclusion

Context caching is a powerful technique that can significantly reduce the cost of using large language models (LLMs) through APIs. By caching the context of previous interactions, developers can avoid the need to send the full context with each new request, leading to substantial cost savings, often up to 75% or more.

The implementation of context caching by major providers like Google, Anthropic, and OpenAI offers developers more control and flexibility in managing their LLM usage. Google's approach, in particular, allows for fine-tuning the cache duration and easily updating or deleting cached content as needed.

Beyond cost savings, context caching can also serve as an alternative to retrieval-augmented generation (RAG) for smaller document sets, avoiding the overhead of vector stores and indexing. Additionally, it enables in-context learning, where new information can be added to the cache and utilized in subsequent interactions.

Overall, understanding and leveraging context caching is a crucial skill for developers working with LLMs, as it can dramatically improve the efficiency and cost-effectiveness of their applications.

FAQ

What is context caching?

How does context caching work?

How can context caching help reduce API costs?

How can context caching be used for in-context learning?

What are the key features of Google's context caching implementation?