Documind is an open-source tool that turns documents into structured data using AI.
What it does:
- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition
Just run npm install documind to get started.
1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
---
1: https://platform.openai.com/docs/models#how-we-use-your-data
2: https://github.com/axa-group/Parsr
If you inspect the source code, it's a verbatim copy. They literally just renamed the ZeroxOutput to DocumindOutput [2][3]
[1] https://github.com/getomni-ai/zerox
[2] https://github.com/DocumindHQ/documind/blob/main/core/src/ty...
[3] https://github.com/getomni-ai/zerox/blob/main/node-zerox/src...
It’s a pretty unethical behavior if what you describe is the full story and as a user of many open source projects how can one be aware of this type of behavior?
If there's any additional thing I can do, please let me know so I would make all amendements immediately.
I think both sides here can learn from this, copyright notices are technically not required but when some text references them it is very useful. The original author should have added one. The user of the code could also have asked about the copyright. If this were to go to court having the original license not making sense could create more questions than it should.
tl;dr: add a copyright line at the top of the file when you’re using the MIT license.
If you're looking for an all-in-one solution, little plug for our new platform that does the above and also allows you to create custom 'patterns' that get picked up via semantic search. Uses open-source models by default, can deploy into your internal network. www.datafog.ai. In beta now and onboarding manually. Shoot me an email if you'd like to learn more!
"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"
Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)
Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:
> * Chat Completions:
> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."
--
1: https://platform.openai.com/docs/models#how-we-use-your-data
Deleted Comment
https://news.ycombinator.com/item?id=42178413
You may wanna get ahead of this because the evidence is fairly damning. Failing to even give credit to the original project is a pretty gross move.
I made sure to copy and past the MIT license in Zerox exactly as it was into the folder of the code that uses it. I also included it in the main license file as well. If there's anything I could do to make corrections please let me know so I'd change that ASAP.
People are getting upset because this is not a nice thing to do. Attribution is significant. No one would care if you replaced all the names with the new ones in a fork because they would see commits that do that.
In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.
Do they work for Bills of Lading yet? When I tested a sample of these bills a few years back (2022 I think), the results were not good at all. But I honestly wouldn't be surprised if they'd massively improved lately.
Otherwise it seems like a prompt building tool, or am I missing something here?
I see someone opened an issue for it so will fix now.
However, if you process, say, 1 million documents, you could sample and review a small percentage of them manually (a power calculation would help here). Assuming your random sample models the "distribution" (which may be tough to define/summarize) of the 1 million documents, you could then extrapolate your accuracy onto the larger set of documents without having to review each and every one.
What I've noticed, that on scanned documents, where stamp-text and handwriting is just as important as printed text, Gemini was way better compared to chat gpt.
Of course, my prompts might have been an issue, but gemini with very brief and generic queries made significantly better results.
Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.
I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.
https://github.com/DocumindHQ/documind/blob/d91121739df03867...
Deleted Comment
The MIT license has just 2 conditions. They are pretty easy to read, and the fist one is:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
By replacing the license, you violate this very simple agreement.
I’ve also added a direct note acknowledging and linking back to the zerox project.