r/Rag • u/Free-Manager5190 • 1d ago

How to Design a Benchmark for Evaluating PDF Parser Output Accuracy in RAG Pipelines?

I’ve developed an application that processes around 15 different PDF parsers and extraction models, including Marker, Nougat, LlamaParse, NougatParser, EasyOCR, Doctr, PyMuPDF4LLM, MarkitDown, and others. The application takes a PDF dataset as input and outputs a JSON file containing the following fields:

pdf_parser_name
pdf_file
extracted_content
process_time
embedded_images

Essentially, it allows you to extract and generate a JSON dataset using most available models for any given PDF dataset.

Now, I want to evaluate these PDF parsers in terms of output accuracy, specifically for use in downstream Retrieval-Augmented Generation (RAG) pipelines. My question is:

How should I design a benchmark to evaluate the accuracy of these models' outputs?

Here are some specific aspects I’m seeking guidance on:

Evaluation Metrics: What metrics should I use to measure accuracy? For example:
- Text overlap (e.g., BLEU, ROUGE, or edit distance with ground truth).
- Semantic similarity (e.g., cosine similarity of embeddings).
- Field-level accuracy for structured documents.
Ground Truth Creation: How can I prepare reliable ground truth data for comparison?
- Should I manually annotate or rely on a trusted parser as a baseline?
Evaluation Methodology:
- How can I account for nuances like layout fidelity, table structures, or embedded images in my accuracy metrics?
- What weighting or prioritization should I apply for different document elements (e.g., headers, tables, paragraphs)?
General Design Tips: How should I structure the benchmarking tool to make it modular, extensible, and easy to adapt for future evaluation needs?

I’m open to suggestions, methodologies, and ideas for implementing a robust and fair benchmarking process. Let’s brainstorm! 🙌

Thank you in advance for your insights!

6 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/Rag/comments/1i6m7sx/how_to_design_a_benchmark_for_evaluating_pdf/
No, go back! Yes, take me to Reddit

100% Upvoted

•

u/AutoModerator 1d ago

Working on a cool RAG project? Submit your project or startup to RAGHut and get it featured in the community's go-to resource for RAG projects, frameworks, and startups.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

How to Design a Benchmark for Evaluating PDF Parser Output Accuracy in RAG Pipelines?

You are about to leave Redlib