Readit News logoReadit News
Posted by u/ritvikpandey21 a year ago
Why LLMs still have problems with OCRrunpulse.com/blog/why-llm...
Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.
michaelbuckbee · a year ago
I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

llm_trw · a year ago
This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

osigurdson · a year ago
It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.
ritvikpandey21 · a year ago
we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses
dyauspitr · a year ago
You can literally ask it to mark characters that are not clear instead of inferring them.
paulsutter · a year ago
Unit tests work remarkably well
gcanyon · a year ago
> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.

Terr_ · a year ago
I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.

However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.

[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p

TeMPOraL · a year ago
> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.

williamcotton · a year ago
On your epistemology, if you correctly guess the outcome of a random event then the statement, even if contingent on an event that did not yet occur, is still true. The same goes for every incorrect guess.

If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.

You then conflate the usefulness of some guessing skill with logic and statistics.

How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.

How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.

kaonwarb · a year ago
To consider: do we overestimate what we know about how we humans reach an answer? (Humans are very capable of intuitively reading scrambled text, for example, as long as the beginning and ending of each word remains correct.)
afro88 · a year ago
The correct way to handle it is to ask the user if it's not clear, like a real assistant would
nodamage · a year ago
I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.
ritvikpandey21 · a year ago
yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.
llm_trw · a year ago
It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414

amelius · a year ago
Maybe ask it to return the bounding box of every glyph.
jmartin2683 · a year ago
My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.

Odd timing, too given flash 2.0 release and its performance on this problem.

Buttons840 · a year ago
Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.
davidhs · a year ago
I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.
daveguy · a year ago
Sounds like a xerox.
coder543 · a year ago
I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.

jll29 · a year ago
In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

zeograd · a year ago
I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)

moffkalast · a year ago
When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?
ianhawes · a year ago
Surya is on par with cloud vision offerings.
ritvikpandey21 · a year ago
would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task
sumedh · a year ago
Are you targeting business or consumers?

I cannot find the pricing page.

patcon · a year ago
Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)
ahoka · a year ago
I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.
pbhjpbhj · a year ago
If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?
mdbmdb · a year ago
I would love to get access to that archive!
jeswin · a year ago
If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...

sgc · a year ago
99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.
rahimnathwani · a year ago

  99.8%+ on first pass
Even with the best OCR, and high resolution scans, you might not get this due to:

- the quality of the original paper documents, and

- the language

I have non-English documents for which I'd love to have 99% accuracy!

Deleted Comment

davedx · a year ago
Ha hi Jeswin! I was itching to reply to this post too, I wonder why…
jeswin · a year ago
Dave! Our sample sizes were large enough, and tables complex enough to opine on this.

I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.

bambax · a year ago
Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)
The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

jeswin · a year ago
I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.

Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.

password4321 · a year ago
As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605

h0l0cube · a year ago
FTA:

> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.

password4321 · a year ago
Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959

_ea1k · a year ago
That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.

dang · a year ago
I assume you mean the title of the current thread? I've attempted to make it less baity now.
mehulashah · a year ago
(CEO of Aryn here: https://aryn.ai)

Nice post and response to the previous one.

It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.

snthd · a year ago
>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.

Except for a very special kind of bug:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

>Xerox scanners/photocopiers randomly alter numbers in scanned documents

Deleted Comment