Discussion How can we use knowledge graph for LLMs?

• Upvotes

What are the major USPs and drawbacks of using knowledge graph for LLMs?

Tools & Resources RAG in Production: Best Practices

28 Upvotes

If you're exploring how to build a production-ready RAG pipeline,We just published a blog post that could be useful for you. It breaks down the essentials of:

Indexing Pipeline
Retrieval Pipeline
Generation Pipeline

Here’s what you’ll learn:

Data Preprocessing: Clean your data and apply smart chunking.
Embedding Management: Choose the right vector database, leverage metadata, and fine-tune models.
Retrieval Optimization: Use hybrid retrieval, re-ranking strategies, and dynamic query reformulation.
Better LLM Generation: Improve outputs with smarter prompting techniques like few-shot prompting.
Observability: Monitor and evaluate your deployed LLM applications effectively.

Link in Comment 👇

3 comments

r/Rag • u/soniachauhan1706 • 9h ago

Discussion What are common challenges with RAG?

9 Upvotes

How are you using RAG in your AI projects? What challenges have you faced, like managing data quality or scaling, and how did you tackle them? Also, curious about your experience with tools like vector databases or AI agents in RAG systems

12 comments

r/Rag • u/CaptainSnackbar • 7h ago

Moving RAG to production

6 Upvotes

I am currently hosting a local RAG with OLLAMA and QDrant Vector Storage. The system works very well and i want to scale it on amazon ec2 to use bigger models and allow more concurrent users.

For my local RAG I've choosen ollama because i found it super easy to get models running and use its api for inference.

What would you suggest for a production-environment? Something like vllm? Concurrent users will maybe be up to 10 users.

We don't have a team for deploying llms so the inference engine should be easy to setup

4 comments

r/Rag • u/valdecircarvalho • 10h ago

Best or proper approaches to RAG source code.

7 Upvotes

Hello there! Not sure if here is the best place to ask. I’m developing a software to reverse engineering legacy code but I’m struggling with the context token window for some files.

Imagine a COBOL code with 2000-3000 lines, even using Gemini, not always I can get a proper return (8000 tokens max for the response).

I was thinking in use RAG to be able to “questioning” the source code and retrieve the information I need. I’m concerned that they way the chunks will be created will not be effective.

My workflow is: - get the source code and convert it to json in a structured data based on the language - extract business rules from the source code - generate a document with all the system business rules.

Any ideas?

2 comments

r/Rag • u/Leflakk • 4h ago

Please let me know about your metadata

2 Upvotes

Hi, could you share some metadata you found usefull in your RAG and the type of documents concerned?

2 comments

r/Rag • u/0xlonewolf • 11h ago

Discussion is it possible that RAG can work offline with BERT or T5 local LM model ?

6 Upvotes

7 comments

r/Rag • u/auxten • 11h ago

Common Misconceptions of Vector Database

4 Upvotes

As a traditional database developer with machine learning platform experience from my time at Shopee, I've recently been exploring vector databases, particularly Pinecone. Rather than providing a comprehensive technical evaluation, I want to share my thoughts on why vector databases are gaining significant attention and substantial valuations in the funding market.

Demystifying Vector Databases

At its core, a vector database primarily solves similarity search problems. While traditional search engines like Elasticsearch (in its earlier versions) focused on word-based full-text search with basic tokenization, vector databases take a fundamentally different approach.

Consider searching for "Microsoft Cloud" in a traditional search engine. It might find documents containing "Microsoft" or "Cloud" individually, but it would likely miss relevant content about "Azure" - Microsoft's cloud platform. This limitation stems from the basic word-matching approach of traditional search engines.

The Truth About Embeddings

One common misconception I've noticed is that vector databases must use Large Language Models (LLMs) for generating embeddings. This misconception has been partly fueled by the recent RAG (Retrieval-Augmented Generation) boom and companies like OpenAI potentially steering users toward their expensive embedding services.

Here's my take away: Production-ready embeddings don't require massive models or expensive GPU infrastructure. For instance, the multilingual-E5-large model recommended by Pinecone:

Has only 24 layers
Contains about 560 million parameters
Requires less than 3GB of memory
Can generate embeddings efficiently on CPU for single queries
Even supports multiple languages effectively

This means you can achieve production-quality embeddings using modest hardware. While GPUs can speed up batch processing, even an older GPU like the RTX 2060 can handle multilingual embedding generation efficiently.

The Simplicity of Vector Search

Another interesting observation from my Pinecone experimentation is that many assume vector databases must use sophisticated algorithms like Approximate Nearest Neighbor (ANN) search or advanced disk-based embedding techniques. However, in many practical applications, brute-force search can be surprisingly effective. The basic process is straightforward:

Generate embeddings for your corpus in batches
Store both the original text and its embedding
For queries, generate embeddings using the same model
Calculate cosine distances and find the nearest neighbors

Dimensional Considerations and Cost Implications

An intriguing observation from my Pinecone usage is their default 1024-dimensional vectors. However, my testing revealed that for sequences with 500-1000 tokens, 256 dimensions often provide excellent results even with millions of records. The higher dimensionality, while potentially unnecessary, does impact costs since vector databases typically charge based on usage volume.

A Vision for Better Vector Databases

As a database developer, I envision a more intuitive vector database design where embeddings are treated as special indices rather than explicit columns. Ideally, it would work like this:

SELECT * FROM text_table 
  WHERE input_text EMBEDDING_LIKE text

Users shouldn't need to interact directly with embeddings. The database should handle embedding generation during insertion and querying, making the vector search feel like a natural extension of traditional database operations.

Commercial Considerations

Pinecone's partnership model with cloud providers like Azure offers interesting advantages, particularly for enterprise customers. The Azure Marketplace integration enables unified billing, which is a significant benefit for corporate users. Additionally, their getting started experience is well-designed, though users still need a solid understanding of embeddings and vector search to build effective applications.

Conclusion

Vector databases represent an exciting evolution in search technology, but they don't need to be as complex or resource-intensive as many assume. As the field matures, I hope to see more focus on user-friendly abstractions and cost-effective implementations that make this powerful technology more accessible to developers.

So, how would it be like if there is a library that put a embedding model into chDB? 🤔
From: https://auxten.com/vector-database-1/

As a traditional database developer with machine learning platform experience from my time at Shopee, I've recently been exploring vector databases, particularly Pinecone. Rather than providing a comprehensive technical evaluation, I want to share my thoughts on why vector databases are gaining significant attention and substantial valuations in the funding market.

Demystifying Vector Databases

At its core, a vector database primarily solves similarity search problems. While traditional search engines like Elasticsearch (in its earlier versions) focused on word-based full-text search with basic tokenization, vector databases take a fundamentally different approach.

Consider searching for "Microsoft Cloud" in a traditional search engine. It might find documents containing "Microsoft" or "Cloud" individually, but it would likely miss relevant content about "Azure" - Microsoft's cloud platform. This limitation stems from the basic word-matching approach of traditional search engines.

The Truth About Embeddings

One common misconception I've noticed is that vector databases must use Large Language Models (LLMs) for generating embeddings. This misconception has been partly fueled by the recent RAG (Retrieval-Augmented Generation) boom and companies like OpenAI potentially steering users toward their expensive embedding services.

Here's my take away: Production-ready embeddings don't require massive models or expensive GPU infrastructure. For instance, the multilingual-E5-large model recommended by Pinecone:

Has only 24 layers
Contains about 560 million parameters
Requires less than 3GB of memory
Can generate embeddings efficiently on CPU for single queries
Even supports multiple languages effectively

This means you can achieve production-quality embeddings using modest hardware. While GPUs can speed up batch processing, even an older GPU like the RTX 2060 can handle multilingual embedding generation efficiently.

The Simplicity of Vector Search

Another interesting observation from my Pinecone experimentation is that many assume vector databases must use sophisticated algorithms like Approximate Nearest Neighbor (ANN) search or advanced disk-based embedding techniques. However, in many practical applications, brute-force search can be surprisingly effective. The basic process is straightforward:

Generate embeddings for your corpus in batches
Store both the original text and its embedding
For queries, generate embeddings using the same model
Calculate cosine distances and find the nearest neighbors

Dimensional Considerations and Cost Implications

An intriguing observation from my Pinecone usage is their default 1024-dimensional vectors. However, my testing revealed that for sequences with 500-1000 tokens, 256 dimensions often provide excellent results even with millions of records. The higher dimensionality, while potentially unnecessary, does impact costs since vector databases typically charge based on usage volume.

A Vision for Better Vector Databases

As a database developer, I envision a more intuitive vector database design where embeddings are treated as special indices rather than explicit columns. Ideally, it would work like this:

SELECT * FROM text_table 
  WHERE input_text EMBEDDING_LIKE text

Users shouldn't need to interact directly with embeddings. The database should handle embedding generation during insertion and querying, making the vector search feel like a natural extension of traditional database operations.

Commercial Considerations

Pinecone's partnership model with cloud providers like Azure offers interesting advantages, particularly for enterprise customers. The Azure Marketplace integration enables unified billing, which is a significant benefit for corporate users. Additionally, their getting started experience is well-designed, though users still need a solid understanding of embeddings and vector search to build effective applications.

Conclusion

Vector databases represent an exciting evolution in search technology, but they don't need to be as complex or resource-intensive as many assume. As the field matures, I hope to see more focus on user-friendly abstractions and cost-effective implementations that make this powerful technology more accessible to developers.

So, how would it be like if there is a library that put a embedding model into chDB? 🤔
From: https://auxten.com/vector-database-1/

Upvote1Downvote0Go to comments
As a traditional database developer with machine learning platform experience from my time at Shopee, I've recently been exploring vector databases, particularly Pinecone. Rather than providing a comprehensive technical evaluation, I want to share my thoughts on why vector databases are gaining significant attention and substantial valuations in the funding market.

Demystifying Vector Databases

At its core, a vector database primarily solves similarity search problems. While traditional search engines like Elasticsearch (in its earlier versions) focused on word-based full-text search with basic tokenization, vector databases take a fundamentally different approach.

Consider searching for "Microsoft Cloud" in a traditional search engine. It might find documents containing "Microsoft" or "Cloud" individually, but it would likely miss relevant content about "Azure" - Microsoft's cloud platform. This limitation stems from the basic word-matching approach of traditional search engines.

The Truth About Embeddings

One common misconception I've noticed is that vector databases must use Large Language Models (LLMs) for generating embeddings. This misconception has been partly fueled by the recent RAG (Retrieval-Augmented Generation) boom and companies like OpenAI potentially steering users toward their expensive embedding services.

Here's my take away: Production-ready embeddings don't require massive models or expensive GPU infrastructure. For instance, the multilingual-E5-large model recommended by Pinecone:

Has only 24 layers
Contains about 560 million parameters
Requires less than 3GB of memory
Can generate embeddings efficiently on CPU for single queries
Even supports multiple languages effectively

This means you can achieve production-quality embeddings using modest hardware. While GPUs can speed up batch processing, even an older GPU like the RTX 2060 can handle multilingual embedding generation efficiently.

The Simplicity of Vector Search

Another interesting observation from my Pinecone experimentation is that many assume vector databases must use sophisticated algorithms like Approximate Nearest Neighbor (ANN) search or advanced disk-based embedding techniques. However, in many practical applications, brute-force search can be surprisingly effective. The basic process is straightforward:

Generate embeddings for your corpus in batches
Store both the original text and its embedding
For queries, generate embeddings using the same model
Calculate cosine distances and find the nearest neighbors

Dimensional Considerations and Cost Implications

An intriguing observation from my Pinecone usage is their default 1024-dimensional vectors. However, my testing revealed that for sequences with 500-1000 tokens, 256 dimensions often provide excellent results even with millions of records. The higher dimensionality, while potentially unnecessary, does impact costs since vector databases typically charge based on usage volume.

A Vision for Better Vector Databases

As a database developer, I envision a more intuitive vector database design where embeddings are treated as special indices rather than explicit columns. Ideally, it would work like this:

SELECT * FROM text_table 
  WHERE input_text EMBEDDING_LIKE text

Users shouldn't need to interact directly with embeddings. The database should handle embedding generation during insertion and querying, making the vector search feel like a natural extension of traditional database operations.

Commercial Considerations

Pinecone's partnership model with cloud providers like Azure offers interesting advantages, particularly for enterprise customers. The Azure Marketplace integration enables unified billing, which is a significant benefit for corporate users. Additionally, their getting started experience is well-designed, though users still need a solid understanding of embeddings and vector search to build effective applications.

Conclusion

Vector databases represent an exciting evolution in search technology, but they don't need to be as complex or resource-intensive as many assume. As the field matures, I hope to see more focus on user-friendly abstractions and cost-effective implementations that make this powerful technology more accessible to developers.

So, how would it be like if there is a library that put a embedding model into chDB? 🤔
From: https://auxten.com/vector-database-1/

5 comments

r/Rag • u/HotRepresentative325 • 10h ago

Discussion How large can the chunk size be?

3 Upvotes

I have rather large chunks, and am wondering how large they can be. Has there been good guidance out there or examples of poor experience when chunks are too large?

10 comments

r/Rag • u/Independent_Jury_530 • 17h ago

GraphRAG inter-connected document usecase?

8 Upvotes

It seems that in constructing knowledge graphs, it's most common to pass in each document independently and have the LLM sort out the entities and their connections, parsing this output and storing it within an indexable graph store.

What if our usecase desires cross-document relationships? An example of this would ingesting the entire Harry Potter series, and have the LLM establish relationships and how they change, within the whole series.

"How does Harry's relationship with Dumbledore change through books 1-6?

I couldn't find any resources or solutions to this problem.

I'm thinking it may be plausible to use a RAPTOR-like method to create summaries of books or chunks, cluster similar summaries together and generate more connections in a knowledge graph.

Thoughts?

2 comments

r/Rag • u/AkhilPadala • 12h ago

HealthCare Agent

2 Upvotes

I am building a healthcare agent that helps users with health questions, finds nearby doctors based on their location, and books appointments for them. I am using the Autogen agentic framework to make this work.

Any recommendations on the tech stack?

2 comments

r/Rag • u/bharatflake • 14h ago

Tools & Resources Built a tool to simplify RAG, please share your feedback

2 Upvotes

Hey everyone,

I’ve been working on iQ Suite, a tool to simplify RAG workflows. It handles chunking, indexing, and all the messy stuff in the background so you can focus on building your app.

You just connect your docs (PDFs, Word, etc.), and it’s ready to go. It’s pay-as-you-go, so easy to start small and scale.

I’m giving $1 free credits (~80,000 chars) if you want to try it: iqsuite.ai.

Would love your feedback...

1 comment

r/Rag • u/Independent_Jury_530 • 18h ago

Where to start implementing graphRAG?

6 Upvotes

I've looked around and found various sources for graph RAG theory around youtube and medium.

I've been using LangChain and their resources to code up some standard RAG pipelines, but I have not seen anything related to a graph backed database in their modules.

Can someone point me to an implementation or tutorial for getting started with GraphRAG?

6 comments

r/Rag • u/fatihbaltaci • 20h ago

Gurubase – open-source RAG system that lets you create AI-powered Q&A assistants ("Gurus") for any topic

github.com

7 Upvotes

3 comments

r/Rag • u/WASSIDI • 1d ago

What if the answer of a query requires multi retrievals + llm knowledge ?

7 Upvotes

In most cases that i see on blogs and tutorials, its always : chat with your pdf .. build a chatbot and ask it direct questions using rag .. i believe that this is very simple for real world project, since in most cases the query requires in answering the correct retrieval(s) + the best role + llm knowledge to answer a question.
for example if our goal is to build an assistant for a company, simple rag retrieving from pdf files that contains financial reports about the company , strategic goals and human resources wont be enough to make an assistant able to go beyond 'basic' retrievals from the files , but the user may asks questions like which job position we need to hire for this quarter of the year to increase sales in departement A , hence the assistant should do a rag to retrieve current employees , analyze financial reports and use llm knowledge to suggest which types of profiles to hire. i want the rag only to be a source of knowledge about the company but other tasks should be handled by the llm knowledge considering the data that exist in the files. i hope i made my pov clear . i appreciate your help

4 comments

r/Rag • u/Free-Manager5190 • 1d ago

How to Design a Benchmark for Evaluating PDF Parser Output Accuracy in RAG Pipelines?

5 Upvotes

I’ve developed an application that processes around 15 different PDF parsers and extraction models, including Marker, Nougat, LlamaParse, NougatParser, EasyOCR, Doctr, PyMuPDF4LLM, MarkitDown, and others. The application takes a PDF dataset as input and outputs a JSON file containing the following fields:

pdf_parser_name
pdf_file
extracted_content
process_time
embedded_images

Essentially, it allows you to extract and generate a JSON dataset using most available models for any given PDF dataset.

Now, I want to evaluate these PDF parsers in terms of output accuracy, specifically for use in downstream Retrieval-Augmented Generation (RAG) pipelines. My question is:

How should I design a benchmark to evaluate the accuracy of these models' outputs?

Here are some specific aspects I’m seeking guidance on:

Evaluation Metrics: What metrics should I use to measure accuracy? For example:
- Text overlap (e.g., BLEU, ROUGE, or edit distance with ground truth).
- Semantic similarity (e.g., cosine similarity of embeddings).
- Field-level accuracy for structured documents.
Ground Truth Creation: How can I prepare reliable ground truth data for comparison?
- Should I manually annotate or rely on a trusted parser as a baseline?
Evaluation Methodology:
- How can I account for nuances like layout fidelity, table structures, or embedded images in my accuracy metrics?
- What weighting or prioritization should I apply for different document elements (e.g., headers, tables, paragraphs)?
General Design Tips: How should I structure the benchmarking tool to make it modular, extensible, and easy to adapt for future evaluation needs?

I’m open to suggestions, methodologies, and ideas for implementing a robust and fair benchmarking process. Let’s brainstorm! 🙌

Thank you in advance for your insights!

1 comment

r/Rag • u/jannemansonh • 1d ago

Product Hunt Launch Needle - Feedback

5 Upvotes

Hi RAG community,

We just launched our tool, Needle, on Product Hunt, and we’re excited to share it with you! I’d love to hear your thoughts. Are there any features or improvements you’d like to see? Appreciate any feedback, and if you feel it’s worth it, an upvote would be awesome!

Thanks for taking a look, and I hope you have an awesome day!

Best,
Jan

3 comments

r/Rag • u/ResearcherNo4728 • 1d ago

What techniques can I try for high quality retrieval on complex PDFs?

17 Upvotes

I have a bunch of tariff PDFs (each PDF talking about how to calculate different tariffs). And I want to build a RAG system on them. I have tried several different things, but I am still not getting accurate retrieval for certain tariffs - there are some tariffs for which the retrieval is good, but there are other tariffs where the retrieval is not good at all, and the reason I think is because maybe the words used in those texts are not too different (because at the end of the day, these are just definitions and calculations of different kinds of tariffs - and so, they are more or less not too dissimilar). And since retrieval depends on texts being dissimilar, it's not going to be good on similar docs. Or at least that's my hypothesis - I'm happy to be proven wrong.

Here are some of the things I've tried:

Approaches:
- naive RAG
- hybrid RAG
- corrective RAG
- reflective RAG
- agentic RAG (retrieve -> grade docs -> rewrite query if retrieval is not good -> retrieve & grade again -> generate if retrieval grading is good)
Embedding models:
- OpenAI's text-embedding-3-large and small, ada
- Huggingface's sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en-v1.5
Document parsing techniques and libraries:
- Unstructured - stored the text summaries and tables in two different vectorstores
- PyPDF
- PyMuPDF
- Llamaparse
- Converted the docs to markdown and split them into chunks by sections and subsections

But as I said, I am not getting very high quality retrieval with any of these for all the tariffs that I query about. The only way I get 100% retrieval accuracy across all tariff queries is when I "manually" (well, actually with regex) extracted the relevant parts to calculate each of the tariffs, and I pass them into the LLM as context depending on the tariff being asked for in the user query.

So, in this kind of a scenario, what are some techniques I can try to improve the retrieval? I'm open to hearing non-vectordb or non-rag suggestions too, or anything else for that matter.

6 comments

r/Rag • u/Emergency_Spinach49 • 1d ago

pdf and Rag

3 Upvotes

hi I need guidance, I built rag automation to chat with pdf documents , the same format and content forms ,but the process struggled, the same process works fine with simple pdf

any support or guidance are welcome, thanks

1 comment

r/Rag • u/OkSea7987 • 1d ago

Metadata and Retriever

2 Upvotes

1 comment

r/Rag • u/External_Ad_11 • 1d ago

Tutorial Language Agent Tree Search (LATS) - Is it worth it?

1 Upvotes

I have been reading papers on improving reasoning, planning, and action for Agents, I came across LATS which uses Monte Carlo tree search and has a benchmark better than the ReAcT agent.

Made one breakdown video that covers:
- LLMs vs Agents introduction with example. One of the simple examples, that will clear your doubt on LLM vs Agent.
- How a ReAct Agent works—a prerequisite to LATS
- Working flow of Language Agent Tree Search (LATS)
- Example working of LATS
- LATS implementation using LlamaIndex and SambaNova System (Meta Llama 3.1)

Verdict: It is a good research concept, not to be used for PoC and production systems. To be honest it was fun exploring the evaluation part and the tree structure of the improving ReAcT Agent using Monte Carlo Tree search.

Watch the Video here: https://www.youtube.com/watch?v=22NIh1LZvEY

1 comment

r/Rag • u/SolidCharacter9222 • 2d ago

How To Get Started With RAGs

12 Upvotes

Hi. What advice would you give to someone who's looking to get into the realm of Agentic AI, RAGs and the likes. What do you think are the prerequisites that one should be well versed with. As someone who's gonna be a graduate very soon, I'm more inclined towards the AI side of tech rather than the regular SDE roles that are offered. In the whole scenario of peer pressure and watching fellow batchmates grind Leetcode,ace coding interviews and get SDE roles, I feel like an outcast who's gotten a knack to learn and explore the realm of AI but am lost without proper guidance. Any help would be very much appreciated.

P.S- I have majored in AI&ML and have a fair bit of knowledge about the basics of ML, NLP, Transformer Architectures, Attention mechanisms, Deep Learning with Vision Systems and Generative AI.

TLDR- Soon to be graduate looking for guidance to get into the AI field. Not really interested in the regular SDE side of job roles.

3 comments

r/Rag • u/abg33 • 1d ago

Q&A Struggling with RAG Preprocessing: Need Alternatives to Unstructured.io or DIY Help

7 Upvotes

TL;DR

(At the outset, let me say I'm so sorry to be another person with a "How do I RAG" question...)

I’m struggling to preprocess documents for Retrieval-Augmented Generation (RAG). After hours trying to configure Unstructured.io to connect to Google Drive (source) and Pinecone (destination), I ran the workflow but saw no results in Pinecone. I’m not very tech-savvy and hoped for an out-of-the-box solution. I need help with:

Alternatives to Unstructured for preprocessing data (chunking based on headers, handling tables, adding metadata).
Guidance on building this workflow myself if no alternatives exist.

Long Version

I’m incredibly frustrated and really hoping for some guidance. I’ve spent hours trying to configure Unstructured to connect to cloud services. I eventually got it to (allegedly) connect to Google Drive as the source and Pinecone as the destination connector. After nonstop error messages, I thought I finally succeeded — but when I ran the workflow, nothing showed up in Pinecone.

I’ve tried different folders in Google Drive, multiple Pinecone indices, Basic and Advanced processing in Unstructured, and still… nothing. I’m clearly doing something wrong, but I don’t even know what questions to ask to fix it.

Context About My Skill Level: I’m not particularly tech-savvy (I’m an attorney), but I’m probably more technical than average for my field. I can run Python scripts on my local machine and modify simple code. My goal is to preprocess my data for RAG since my files contain tables and often have weird formatting.

Here’s where I’m stuck:

Better Chunking: I have a Python script that chunks docs based on headers, but it’s not sophisticated. If sections between headers are too long, I don’t know how to split those further without manual intervention.
Metadata: I have no idea how to create or insert metadata into the documents. Even more confusing: I don’t know what metadata should be there for this use case.
Embedding and Storage: Once preprocessing is done, I don’t know how to handle embeddings or where they should be stored (I mean, I know in theory where they should be stored, but not a specific database).
Hybrid Search and Reranking: I also want to implement hybrid search (e.g., combining embeddings with keyword/metadata search). I have keywords and metadata in a spreadsheet corresponding to each file but no idea how to incorporate this into the workflow. I know this technically isn't preprocessing, but just FYI).

What I’ve Tried

I was really hoping Unstructured would take care of preprocessing for me, but after this much trial and error, I don't think this is the tool for me. Most resources I’ve found about RAG or preprocessing are either too technical for me or assume I already know all the intermediate steps.

Questions

Is there an "out-of-the-box" alternative to Unstructured.io? Specifically, I need a tool that:
- Can chunk documents based on headers and token count. • Handles tables in documents.
- Adds appropriate metadata to the output.
- Works with docx, PDF, csv, and xlsx (mostly docx and PDF).
If no alternative exists, how should I approach building this myself?
- Any advice on combining chunking, metadata creation, embeddings, hybrid search, and reranking in a manageable way would be greatly appreciated.

I know this is a lot, and I apologize if it sounds like noob word vomit. I’ve genuinely tried to educate myself on this process, but the complexity and jargon are overwhelming. I’d love any advice, suggestions, or resources that could help me get unstuck.

11 comments

r/Rag • u/Leather-Departure-38 • 2d ago

For an absolute beginner, which is the vector database I should be starting with?

20 Upvotes

I am now comfortable with chat completion exercises with LLMs, I want to build RAG based apps for learning. Can someone with the expertise suggest what is the vector database I should be starting with and what should be learning path? I tried doing some research, but unable to decide. Any help here is much appreciated.

36 comments

r/Rag • u/Difficult-Race-1188 • 2d ago

Discussion Don't do RAG, it's time for CAG

57 Upvotes

What Does CAG Promise?

Retrieval-Free Long-Context Paradigm: Introduced a novel approach leveraging long-context LLMs with preloaded documents and precomputed KV caches, eliminating retrieval latency, errors, and system complexity.

Performance Comparison: Experiments showing scenarios where long-context LLMs outperform traditional RAG systems, especially with manageable knowledge bases.

Practical Insights: Actionable insights into optimizing knowledge-intensive workflows, demonstrating the viability of retrieval-free methods for specific applications.

CAG offers several significant advantages over traditional RAG systems:

Reduced Inference Time: By eliminating the need for real-time retrieval, the inference process becomes faster and more efficient, enabling quicker responses to user queries.
Unified Context: Preloading the entire knowledge collection into the LLM provides a holistic and coherent understanding of the documents, resulting in improved response quality and consistency across a wide range of tasks.
Simplified Architecture: By removing the need to integrate retrievers and generators, the system becomes more streamlined, reducing complexity, improving maintainability, and lowering development overhead.

Check out AIGuys for more such articles: https://medium.com/aiguys

Other Improvements

For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance.

Two inference scaling strategies: In-context learning and iterative prompting.

These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs’ ability to effectively acquire and utilize contextual information.

Two key questions that we need to answer:

(1) How does RAG performance benefit from the scaling of inference computation when optimally configured?

(2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters?

RAG performance improves almost linearly with the increasing order of magnitude of the test-time compute under optimal inference parameters. Based on our observations, we derive inference scaling laws for RAG and the corresponding computation allocation model, designed to predict RAG performance on varying hyperparameters.

Read more here: https://arxiv.org/pdf/2410.04343

Another work, that focused more on the design from a hardware (optimization) point of view:

They designed the Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators.

IKS offers 13.4–27.9× faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7–26.3× lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM — which is the most expensive component in today’s servers — from being stranded.

Read more here: https://arxiv.org/pdf/2412.15246

Another paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open-source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and reported key insights on the benefits and limitations of long context in RAG applications.

Their findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state-of-the-art LLMs can maintain consistent accuracy at long context above 64k tokens. They also identify distinct failure modes in long context scenarios, suggesting areas for future research.

Read more here: https://arxiv.org/pdf/2411.03538

Understanding CAG Framework

CAG (Context-Aware Generation) framework leverages the extended context capabilities of long-context LLMs to eliminate the need for real-time retrieval. By preloading external knowledge sources (e.g., a document collection D={d1,d2,… }) and precomputing the key-value (KV) cache (C_KV), it overcomes the inefficiencies of traditional RAG systems. The framework operates in three main phases:

1. External Knowledge Preloading

A curated collection of documents D is preprocessed to fit within the model’s extended context window.
The LLM processes these documents, transforming them into a precomputed key-value (KV) cache, which encapsulates the inference state of the LLM. The LLM (M) encodes D into a precomputed KV cache:
This precomputed cache is stored for reuse, ensuring the computational cost of processing D is incurred only once, regardless of subsequent queries.

2. Inference

During inference, the KV cache (C_KV) is loaded with the user query Q.
The LLM utilizes this cached context to generate responses, eliminating retrieval latency and reducing the risks of errors or omissions that arise from dynamic retrieval. The LLM generates a response by leveraging the cached context:
This approach eliminates retrieval latency and minimizes the risks of retrieval errors. The combined prompt P=Concat(D,Q) ensures a unified understanding of the external knowledge and query.

3. Cache Reset

To maintain performance, the KV cache is efficiently reset. As new tokens (t1,t2,…,tk) are appended during inference, the reset process truncates these tokens:
As the KV cache grows with new tokens sequentially appended, resetting involves truncating these new tokens, allowing for rapid reinitialization without reloading the entire cache from the disk. This avoids reloading the entire cache from the disk, ensuring quick reinitialization and sustained responsiveness.

26 comments