Right now, some things are somewhat hard-coded to be Cloudflare compatible. If someone's willing, you can just deploy this without Cloudflare, but you'd need to dig into the code a little.
In the future releases, I'll make it possible to host it on VPCs and release a Dockerfile along with it, so that should help a little.
Thanks for checking the project out!
But Cloudflare is not self hosting!
I couldn't find the right words to describe this, in comparison to something like Github Gist. I suppose "Own-your-data" since the D1 db generated is yours completely.
Happy to change the branding to be more reflective of this!
I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!
Thanks again!
I have a particular max token length in mind, and I have a tokenizer like tiktoken. I have a string and I want to quickly find the maximum length truncation of the string that is <= target max token length.
Does chonkie handle this?
Is that what you meant?
Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...
Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)
If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.
That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.
I’m using o1-preview for chunking, creating summary subdocuments.
Thanks for responding, I'll try to make it easier to use something like that in Chonkie in the future!