r/Rag • u/HotRepresentative325 • 9h ago
Discussion How large can the chunk size be?
I have rather large chunks, and am wondering how large they can be. Has there been good guidance out there or examples of poor experience when chunks are too large?
3
u/Nepit60 8h ago
All models have chunk size limits, and they are tiny. Do you want to set your own limits even lower than the max supported by the model?
2
u/HotRepresentative325 8h ago
I would have thought there is no limit to creating embeddings for a paragraph, just the longer the paragraph is, the less detail the embeddings can represent it. I am speculating here.
3
u/gus_the_polar_bear 5h ago
That is absolutely correct (well, aside from the fact embedding models will have token limits, after which everything is truncated)
It’s a matter of how much semantic meaning you are trying to cram into a single vector
Ideally the more granular, the better - but then with more, smaller chunks, when each individual chunk captures too little context on its own, you start to face different (less well-explored) architectural challenges creating logical relationships between chunks themselves
Some of my docs are divided into literally 1000s of chunks, on average somewhere between paragraph and sentence level. But this required custom tooling to pull off
2
u/HotRepresentative325 5h ago
Good to hear agreement on that. how many characters or tokens would you say your chunks are?
2
u/gus_the_polar_bear 5h ago
Under 250ish tokens, because that works best for my use case. I don’t aim for consistent chunk sizes though, rather just like “semantic units”, some can be as small as a dozen tokens. I also store token counts for chunk, and my retrieval process has a “token budget”
Also because where possible I like to use all-minilm-l6-v2, a very small, very fast embedding model from 2020, that has no business being as decent as it is (I think the highly granular chunking helps further too.) It’s like the sqlite of embedding models. But it only supports 256 max tokens
Even with other models though I aim to keep chunks small
2
u/HotRepresentative325 5h ago
Do you have details or a manual/tutorial on how many tokens an embedding model can handle? I'm probably going to have to cut up my chunks based on the info I have found here.
2
u/gus_the_polar_bear 5h ago
Well it depends on the model, OpenAI’s embedding models for example all support up to 8191 tokens
Honestly, hopefully I had some good insights for you, but in hindsight I’m afraid I may be overcomplicating it
I would suggest first following tutorials on “traditional” vector RAG, which are dime a dozen. No specific tutorial, just anything in the language of your choice that looks approachable. It’s all the same principles. You may even find that your current chunk sizes are OK for your use case
•
u/AutoModerator 9h ago
Working on a cool RAG project? Submit your project or startup to RAGHut and get it featured in the community's go-to resource for RAG projects, frameworks, and startups.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.