Readit News logoReadit News
Posted by u/bhavnicksm 10 months ago
Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAGgithub.com/bhavnicksm/cho...
I built Chonkie because I was tired of rewriting chunking code for RAG applications. Existing libraries were either too bloated (80MB+) or too basic, with no middle ground.

Core features:

- 21MB default install vs 80-171MB alternatives

- 33x faster token chunking than popular alternatives

- Supports multiple chunking strategies: token, word, sentence, and semantic

- Works with all major tokenizers (transformers, tokenizers, tiktoken)

- Zero external dependencies for basic functionality

Technical optimizations:

- Uses tiktoken with multi-threading for faster tokenization

- Implements aggressive caching and precomputation

- Running mean pooling for efficient semantic chunking

- Modular dependency system (install only what you need)

Benchmarks and code: https://github.com/bhavnicksm/chonkie

Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?

mattmein · 10 months ago
Also check out https://github.com/D-Star-AI/dsRAG/ for a bit more involved chunking strategy.
cadence- · 10 months ago
This looks pretty amazing. I will take it for a spin next week. I want to make a RAG that will answer questions related to my new car. The manual is huge and it is often hard to find answers in it, so I think this will be a big help to owners of the same car. I think your library can help me chunk that huge PDF easily.
andai · 10 months ago
How many tokens is the manual?
simonw · 10 months ago
Would it make sense for this to offer a chunking strategy that doesn't need a tokenizer at all? I love the goal to keep it small, but "tokenizers" is still a pretty huge dependency (and one that isn't currently compatible with Python 3.13).

I've been hoping to find an ultra light-weight chunking library that can do things like very simple regex-based sentence/paragraph/markdown-aware chunking with minimal additional dependencies.

parhamn · 10 months ago
Across a broad enough dataset (char count / 4) is very close to the actual token count in english -- we verified across millions of queries. We had to switch to using an actual tokenizer for chinese and other unicode languages, as that simple formula misses the mark for context stuffing.

The more complicated stuff is the effective bin-packing problem that emerges depending on how much different contextual sources you have.

jimmySixDOF · 10 months ago
For a Regex approach take a look at the work from Jina.ai who among other things have a chunk/tokenizer [1] and now it's part of a bigger API service [2] also they developed an interesting late interaction (aka ColBERT like) chunking system that fits certain use cases. But the Regex is enough all by itself:

[1] https://gist.github.com/LukasKriesch/e75a0132e93ca989f8870c4...

[2] https://jina.ai/segmenter/

andai · 10 months ago
I made a rudimentary semantic chunking in just a few lines of code.

I just removed one sentence at a time from the left until there was a jump in the embedding distance. Then repeated for the right side.

bhavnicksm · 10 months ago
Thank you so much for giving Chonkie a chance! Just to note Chonkie is still in beta mode (with v0.1.2 running) with a bunch of things planned for it. It's an initial working version, which seemed promising enough to present.

I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!

Thanks again!

mixeden · 10 months ago
> Token Chunking: 33x faster than the slowest alternative

1) what

rkharsan64 · 10 months ago
There's only 3 competitors in that particular benchmark, and the speedup compared to the 2nd is only 1.06x.

Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...

bhavnicksm · 10 months ago
TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.

Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)

If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.

That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.

melony · 10 months ago
How does it compare with NLTK's chunking library? I have found that it works very well for sentence segmentation.
petesergeant · 10 months ago
> What other chunking strategies would be useful for RAG applications?

I’m using o1-preview for chunking, creating summary subdocuments.

bhavnicksm · 10 months ago
That's pretty cool! I believe a research paper called LumberChunker recently evaluated that to be pretty decent as well.

Thanks for responding, I'll try to make it easier to use something like that in Chonkie in the future!

petesergeant · 10 months ago
Ah, that's an interesting paper, and a slightly different approach to what I'm doing, but possibly a superior one. Thanks!
vlovich123 · 10 months ago
Out of curiosity where does the 21 MiB come from? The codebase clone is 1.2 MiB and the src folder is only 68 KiB.
ekianjo · 10 months ago
Dependencies in the venv?
Dowwie · 10 months ago
When would you ever want anything other than Semantic chunking? Cutting chunks into fixed lengths is fast, but it's arbitrarily encoding potentially dissimilar information.
samlinnfer · 10 months ago
How does it work for code? (Chunking code that is)
nostrebored · 10 months ago
Poorly, just like it does for text.

Chunking is easily where all of these problems die beyond PoC scale.

I’ve talked to multiple code generation companies in the past week — most are stuck with BM25 and taking in whole files.

potatoman22 · 10 months ago
What do they use BM25 for? RAG?
bhavnicksm · 10 months ago
Right now, we haven't worked on adding support for code -- some things like comments (#, //) have punctuations that adversely affect chunking, along with indentation and other issues.

But, it's on the roadmap, so please hold on!