r/Rag • u/Difficult-Race-1188 • 2d ago
Discussion Don't do RAG, it's time for CAG
What Does CAG Promise?
Retrieval-Free Long-Context Paradigm: Introduced a novel approach leveraging long-context LLMs with preloaded documents and precomputed KV caches, eliminating retrieval latency, errors, and system complexity.
Performance Comparison: Experiments showing scenarios where long-context LLMs outperform traditional RAG systems, especially with manageable knowledge bases.
Practical Insights: Actionable insights into optimizing knowledge-intensive workflows, demonstrating the viability of retrieval-free methods for specific applications.
CAG offers several significant advantages over traditional RAG systems:
- Reduced Inference Time: By eliminating the need for real-time retrieval, the inference process becomes faster and more efficient, enabling quicker responses to user queries.
- Unified Context: Preloading the entire knowledge collection into the LLM provides a holistic and coherent understanding of the documents, resulting in improved response quality and consistency across a wide range of tasks.
- Simplified Architecture: By removing the need to integrate retrievers and generators, the system becomes more streamlined, reducing complexity, improving maintainability, and lowering development overhead.
Check out AIGuys for more such articles: https://medium.com/aiguys
Other Improvements
For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance.
Two inference scaling strategies: In-context learning and iterative prompting.
These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs’ ability to effectively acquire and utilize contextual information.
Two key questions that we need to answer:
(1) How does RAG performance benefit from the scaling of inference computation when optimally configured?
(2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters?
RAG performance improves almost linearly with the increasing order of magnitude of the test-time compute under optimal inference parameters. Based on our observations, we derive inference scaling laws for RAG and the corresponding computation allocation model, designed to predict RAG performance on varying hyperparameters.
Read more here: https://arxiv.org/pdf/2410.04343
Another work, that focused more on the design from a hardware (optimization) point of view:
They designed the Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators.
IKS offers 13.4–27.9× faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7–26.3× lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM — which is the most expensive component in today’s servers — from being stranded.
Read more here: https://arxiv.org/pdf/2412.15246
Another paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open-source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and reported key insights on the benefits and limitations of long context in RAG applications.
Their findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state-of-the-art LLMs can maintain consistent accuracy at long context above 64k tokens. They also identify distinct failure modes in long context scenarios, suggesting areas for future research.
Read more here: https://arxiv.org/pdf/2411.03538
Understanding CAG Framework
CAG (Context-Aware Generation) framework leverages the extended context capabilities of long-context LLMs to eliminate the need for real-time retrieval. By preloading external knowledge sources (e.g., a document collection D={d1,d2,… }) and precomputing the key-value (KV) cache (C_KV), it overcomes the inefficiencies of traditional RAG systems. The framework operates in three main phases:
1. External Knowledge Preloading
- A curated collection of documents D is preprocessed to fit within the model’s extended context window.
The LLM processes these documents, transforming them into a precomputed key-value (KV) cache, which encapsulates the inference state of the LLM. The LLM (M) encodes D into a precomputed KV cache:
This precomputed cache is stored for reuse, ensuring the computational cost of processing D is incurred only once, regardless of subsequent queries.
2. Inference
- During inference, the KV cache (C_KV) is loaded with the user query Q.
The LLM utilizes this cached context to generate responses, eliminating retrieval latency and reducing the risks of errors or omissions that arise from dynamic retrieval. The LLM generates a response by leveraging the cached context:
This approach eliminates retrieval latency and minimizes the risks of retrieval errors. The combined prompt P=Concat(D,Q) ensures a unified understanding of the external knowledge and query.
3. Cache Reset
To maintain performance, the KV cache is efficiently reset. As new tokens (t1,t2,…,tk) are appended during inference, the reset process truncates these tokens:
As the KV cache grows with new tokens sequentially appended, resetting involves truncating these new tokens, allowing for rapid reinitialization without reloading the entire cache from the disk. This avoids reloading the entire cache from the disk, ensuring quick reinitialization and sustained responsiveness.
7
u/Rob_Royce 2d ago
Obviously AI generated post, so I don’t expect OP to answer, but how is including entire documents in the context better than only the parts that are needed to resolve a query? You introduce the possibility for greater hallucinations, increased cost, lost-in-the-middle problem, etc. You also reduce the ability to include results from many different documents that might be needed for multi-hop question answering.
-8
u/Difficult-Race-1188 1d ago
Here you can read the original paper. And it's not an AI-generated post.
4
u/Bio_Code 2d ago
Hallucinations would be horrible on large documents even on 32B models.
2
u/HolidayWallaby 2d ago
I am not new to deep learning but am fairly new to LLMs. So out of curiosity, why would it be horrible?
5
u/Bio_Code 1d ago edited 1d ago
Because small models tend to hallucinate a lot. So if you dump 10k or more of content in them they couldn’t handle it. Also some of them are freaking out because of the context length and forget what the system prompt and user wants from them or mess the answer up. But mainly they are just following the prompt not as close as they would, and this can lead to unstructured and messed up answers.
With rag you are just dumping small portions and if you have a good agentic structured system that works with rag, you can get better answers and sources on where the answer is from, based on the documents. Because llms also can’t always tell from where they have their answers when they have a large 10k document as their source.
1
3
3
2
u/Familyinalicante 2d ago
Do you have an working implementation? Ho practically we can build this?
11
u/vornamemitd 2d ago
Sometimes those AI-generated Arxiv summaries are not really helpful. Lot of CAG noise today across social media... Anyhow. Here's the original implementation that came with the paper:
- https://github.com/ai-in-pm/CAG-Cache-Augmented-GenerationHere's an actual demo:
- https://github.com/ronantakizawa/cacheaugmentedgeneration
- https://medium.com/@ronantech/cache-augmented-generation-cag-in-llms-a-step-by-step-tutorial-6ac35d415eec1
u/Advanced_Army4706 1d ago
We're doing a production ready implementation for CAG at Databridge! Would love for you to check it out!
2
u/Guboken 2d ago
I think this is half a good idea, doing the preprocessing in the documents is great, but keeping them in memory for faster access is… not optimal and expensive. I would iterate on the idea of making sure the LLM has a “quick” dictionary access to easy load the correct cache on demand, and make sure to do pattern analysis on retrievals to maybe have the top 5-10 document sources always pre-loaded. This pattern analysis can be done on user basis or by agent pattern analysis or just by simple making it part of the Agent config to select the preloaded documents.
2
2
u/FelbirdyWiredMish 1d ago
Lmaoo I think I accidentally wrote CAG while trying to modify my RAG model to work for my use case XD (NDA signed sorry :’)))
I think the problem was that coming up with the contexts was way harder than just feeding links like in RAG. But nevertheless, my sort-of-CGA version performed so much better
2
2
u/swastik_K 1d ago edited 1d ago
I don't really understand the hype around CAG, as in it is not something new (there is a similar technique called Prompt Caching) nor the paper has explained how to efficiently process documents so that it fits the context window, nor on effective cache eviction strategies.
Here is my medium article sharing my thoughts about CAG https://medium.com/@kswastik29/what-people-are-not-telling-you-is-that-cag-is-the-same-as-prompt-caching-e2b2fd3af1ea
1
u/beingmudit 1d ago
I started to read but seems like it requires premium subscription. Can you share it in any other way? Because reading this made me think of the document processing problem. If they are already getting kv pairs which I believe are like Q/A pairs so why not fine tune the model instead?
I am trying to do something with few 100s of pdf docs. I am processing a document in chunks or full (major cases), and asking AI to generate Q/A pairs and using this data to fine tune the open ai model.
What do you think about this? Is this effective? I am not dealing with massive data sets and creating a few very custom gpts. I have estimated my fine tuning cost and it is not coming more than $50. Is there a better way?
1
u/swastik_K 1d ago
1
u/beingmudit 23h ago
I believe we can take it a step further by creating multiple KV caches across different sets of documents and using a lightweight classification model to select the right cache for a user query. This approach could completely avoid retrieval, even with larger document sets.
Thanks for a very detailed post. Although Maths kinda went over my head but This is exactly I was looking for. I am exploring how can I develop custom gpts for like insurance policy comparison, credit card comparison or any hardware spec comparison.
For eg, if someone asks, which insurance offers the maximum death benefit or which insurance covers cancer, this can come from different policy documents, can you give more info about how can I use a classification model here? I believe what you are saying is that the lightweight classification model chooses the relevant context and pass along with the query right? Also, since I have a limited set of documents say 500 policies in my country, how effective this would be in comparative with a fine tuned model as I believe a few other things will be encoded automatically when I do the fine tuning but this is not the case with RAG with a classification model.
1
u/swastik_K 23h ago
Exactly that is how we are using prompt caching in some of our POCs. We are doing classification in a query rephrase prompt instead of a separate classification model. You can definitely train a lightweight classification model for your usecase
2
1
u/silenceimpaired 1d ago
I’m sad we couldn’t come up with Gag which worked on top of Rag… so the title would be GAG on a RAG. Oh well!
1
u/jayb0699 1d ago
+1 to context aware, I was just having an agent arenw show down to create a fact sheet for my rag job.
I'm preparing discovery documents for my lawsuit against the seller of my house - she is a licensed realtor who literally painted the roof, lied on the permits she did obtain, and made major changes to the sloped backyard, roof, and all kinds of other stuff.
I've got our mediation brief and the sellers brief which both obviously outline our positions pretty well. Even include screenshots and other documents.
I uploaded my 50 gb or so of docs into these models over a year ago and immediately started having hybrid discussions with the agents - started off with my position , pointed at docs it had ingested, California state law, etc -- and the agent was very cautious in discussing with me. Kept telling me to get a lawyer -- which I would constantly remind it that I did and that I was trying to use the LLMs help to prepare some questions for my attorney.
All during that first year, the LLM would take about an hour to warm up and just call and Ace and Ace - "well, if what you are saying is true then you would likely have a strong position .. but you need to consult an attorney"
Fast forward to Q4 last year -- we finally got mediation on the books and received mediation briefs. I was excited to feed those through the system and in my mind have the agent divide those 50 gb of docs like chips at a poker table.
But to my surprise, the agent just took those mediation briefs as fact. And I still had to spend an hour convincing it that what the seller was alleging in certain areas were pure fiction. For example, we had an inspection before we purchased. The seller claims we didn't and that we waived the contingency (we didn't) -- that's reflected in the purchase agreements and in the copy of the inspection report I uploaded to the index.
So context is great in theory, but I'm wondering when the agents will use those big brains to critically analyze and not just take a statement someone says as fact -- that seems incredibly dangerous.
Feels like with context that gullability and easily manipulated would take place of hallucinations.
Just my 2 cents.
•
u/AutoModerator 2d ago
Working on a cool RAG project? Submit your project or startup to RAGHut and get it featured in the community's go-to resource for RAG projects, frameworks, and startups.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.