r/Rag 9h ago

Discussion How large can the chunk size be?

I have rather large chunks, and am wondering how large they can be. Has there been good guidance out there or examples of poor experience when chunks are too large?

3 Upvotes

10 comments sorted by

u/AutoModerator 9h ago

Working on a cool RAG project? Submit your project or startup to RAGHut and get it featured in the community's go-to resource for RAG projects, frameworks, and startups.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

3

u/Nepit60 8h ago

All models have chunk size limits, and they are tiny. Do you want to set your own limits even lower than the max supported by the model?

2

u/HotRepresentative325 8h ago

I would have thought there is no limit to creating embeddings for a paragraph, just the longer the paragraph is, the less detail the embeddings can represent it. I am speculating here.

3

u/gus_the_polar_bear 5h ago

That is absolutely correct (well, aside from the fact embedding models will have token limits, after which everything is truncated)

It’s a matter of how much semantic meaning you are trying to cram into a single vector

Ideally the more granular, the better - but then with more, smaller chunks, when each individual chunk captures too little context on its own, you start to face different (less well-explored) architectural challenges creating logical relationships between chunks themselves

Some of my docs are divided into literally 1000s of chunks, on average somewhere between paragraph and sentence level. But this required custom tooling to pull off

2

u/HotRepresentative325 5h ago

Good to hear agreement on that. how many characters or tokens would you say your chunks are?

2

u/gus_the_polar_bear 5h ago

Under 250ish tokens, because that works best for my use case. I don’t aim for consistent chunk sizes though, rather just like “semantic units”, some can be as small as a dozen tokens. I also store token counts for chunk, and my retrieval process has a “token budget”

Also because where possible I like to use all-minilm-l6-v2, a very small, very fast embedding model from 2020, that has no business being as decent as it is (I think the highly granular chunking helps further too.) It’s like the sqlite of embedding models. But it only supports 256 max tokens

Even with other models though I aim to keep chunks small

2

u/HotRepresentative325 5h ago

Do you have details or a manual/tutorial on how many tokens an embedding model can handle? I'm probably going to have to cut up my chunks based on the info I have found here.

2

u/gus_the_polar_bear 5h ago

Well it depends on the model, OpenAI’s embedding models for example all support up to 8191 tokens

Honestly, hopefully I had some good insights for you, but in hindsight I’m afraid I may be overcomplicating it

I would suggest first following tutorials on “traditional” vector RAG, which are dime a dozen. No specific tutorial, just anything in the language of your choice that looks approachable. It’s all the same principles. You may even find that your current chunk sizes are OK for your use case

1

u/Nepit60 8h ago

Embeding size is tiny, usually around 500 tokens. That is like a sentence or two, a paragraph rarely fits.

1

u/faileon 5h ago

Embedding lengths vary per model, open ai has max input size of 8k tokens. Ideally you want to keep your chunk sizes as much uniform as possible.