Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.
ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").
Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.
So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.
The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.
However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.
[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p
Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".
Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.
If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.
You then conflate the usefulness of some guessing skill with logic and statistics.
How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.
How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.
To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.
See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414
Odd timing, too given flash 2.0 release and its performance on this problem.
https://arxiv.org/abs/2311.06242
https://huggingface.co/blog/finetune-florence2
https://blog.roboflow.com/florence-2-ocr/
https://www.assemblyai.com/blog/florence-2-how-it-works-how-...
I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.
In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.
I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.
(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)
I cannot find the pricing page.
> We have hundreds of examples like this queued up, so let us know if you want some more!
Link to it then, let people verify.
I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.
[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...
- the quality of the original paper documents, and
- the language
I have non-English documents for which I'd love to have 99% accuracy!
Deleted Comment
I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.
I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:
(edit: line wrap messes it all up... still I think my version is better ;-)Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.
Ingesting PDFs and why Gemini 2.0 changes everything
https://news.ycombinator.com/item?id=42952605
> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.
https://news.ycombinator.com/item?id=42966958#42966959
The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.
Nice post and response to the previous one.
It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.
Except for a very special kind of bug:
https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
>Xerox scanners/photocopiers randomly alter numbers in scanned documents
Deleted Comment