Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs

Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs github.com/Dicklesworthst...

Almost exactly 1 year ago, I submitted something to HN about using Llama2 (which had just come out) to improve the output of Tesseract OCR by correcting obvious OCR errors [0]. That was exciting at the time because OpenAI's API calls were still quite expensive for GPT4, and the cost of running it on a book-length PDF would just be prohibitive. In contrast, you could run Llama2 locally on a machine with just a CPU, and it would be extremely slow, but "free" if you had a spare machine lying around.

Well, it's amazing how things have changed since then. Not only have models gotten a lot better, but the latest "low tier" offerings from OpenAI (GPT4o-mini) and Anthropic (Claude3-Haiku) are incredibly cheap and incredibly fast. So cheap and fast, in fact, that you can now break the document up into little chunks and submit them to the API concurrently (where each chunk can go through a multi-stage process, in which the output of the first stage is passed into another prompt for the next stage) and assemble it all in a shockingly short amount of time, and for basically a rounding error in terms of cost.

My original project had all sorts of complex stuff for detecting hallucinations and incorrect, spurious additions to the text (like "Here is the corrected text" preambles). But the newer models are already good enough to eliminate most of that stuff. And you can get very impressive results with the multi-stage approach. In this case, the first pass asks it to correct OCR errors and to remove line breaks in the middle of a word and things like that. The next stage takes that as the input and asks the model to do things like reformat the text using markdown, to suppress page numbers and repeated page headers, etc. Anyway, I think the samples (which take less than 1-2 minutes to generate) show the power of the approach:

Original PDF: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

Raw OCR Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

LLM-Corrected Markdown Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

One interesting thing I found was that almost all my attempts to fix/improve things using "classical" methods like regex and other rule based things made everything worse and more brittle, and the real improvements came from adjusting the prompts to make things clearer for the model, and not asking the model to do too much in a single pass (like fixing OCR mistakes AND converting to markdown format).

Anyway, this project is very handy if you have some old scanned books you want to read from Archive.org or Google Books on a Kindle or other ereader device and want things to be re-flowable and clear. It's still not perfect, but I bet within the next year the models will improve even more that it will get closer to 100%. Hope you like it!

[0] https://news.ycombinator.com/item?id=36976333

In my experience, this works well but doesn't scale to all kinds of documents. For scientific papers; it can't render formulas. meta's nougat is the best model to do that. For invoices and records; donut works better. Both these models will fail in some cases so you end up running LLM to fix the issues. Even with that LLM won't be able to do tables and charts justice, as the details were lost during OCR process (bold/italic/other nuances). I feel these might also be "classical" methods. I have found vision models to be much better as they have the original document/image. Having prompts which are clear helps but still you won't get 100% results as they tend to venture off on their paths. I believe that can be fixed using fine tuning but no good vision model provides fine tuning for images. Google Gemini seems to have the feature but I haven't tried it. Few shots prompting helps keep the LLM from hallucinating, prompt injection and helps adhering to the format requested.

jszymborski · 2 years ago

Maybe a pipeline like:

1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...

2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.

3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells

4. For formulas (and formulas in tables), just use a ViT/CNN.

5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.

ozim · 2 years ago

I don't see how you make LLM improve tables where most of the time table is single word or single value that doesn't have continuous context like a sentence.

troysk · 2 years ago

How to segment the document without LLM?

I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.

With so many passes with images, the costs/time will be high with ViT being slower.

vintermann · 2 years ago

I agree that vision models that actually have access to the image are a more sound approach than using OCR and trying to fix it up. It may be more expensive though, and depending on what you're trying to do it may be good enough.

What I want to do is reading handwritten documents from the 18th century, and I feel like the multistep approach hits a hard ceiling there. Transkribus is multistep, but the line detecion model is just terrible. Things that should be easy, such as printed schemas, utterly confuse it. You simply need to be smart about context to a much higher degree than you need in OCR of typewriter-written text.

huijzer · 2 years ago

I also think it’s probably more effective. Every time hand-crafted tools are better than AI but then the model becomes bigger and AI wins. Think hand crafted image classification to full model or hand crafted language translation to full model.

In this case, the model can already do the OCR and becomes an order of magnitude cheaper per year.

troysk · 2 years ago

both openai and claude vision models are able do that for me. It is more expensive than tesseract which can run on cpu but I assume it will become similarly cheap in the near future with open models and as AI becomes ubiquitous.

ChadNauseam · 2 years ago

It's not OSS, but I've had good experiences with using MathPix's API for OCR for formulas

troysk · 2 years ago

nougat, donut are OSS. There are no OSS vision models but we will soon have them. MathPix API are also not OSS and I found them expensive compared to vision models.

Mathpix Markdown however is awesome and I ask LLMs to use that to denote formulas as latex is tricky to render in HTML because of things not matching. I don't know latex well so haven't gone deeper on it.

EarlyOom · 2 years ago

We've been trying to solve this with https://vlm.run: the idea is to combine the character level accuracy of an OCR pipeline (like Tesseract) with the flexibility of a VLM. OCR pipelines struggle with non-trivial text layouts and don't have any notion of document structure, which means there needs to be another layer on top to actually extract text content to the right place. At the other end of the spectrum, VLMs (like GPT4o) tend to perform poorly on things like dense tables (either hallucinating or giving up entirely) and complex forms, in addition to being much slower/more expensive. Part of the fix is to allow a 'manager' VLM to dispatch to OCR on dense, simple documents, while running charts, graphs etc. through the more expensive VLM pipeline.

troysk · 2 years ago

Maybe you could try extracting the text also using some pdf text extraction and use that also to compare. Might help fix numbers which tesseract gets wrong sometimes.

Dead Comment

kelsey98765431 · 2 years ago

Fantastic work is emerging in this field, and with the new release of the schnell model of the flux series we will have the downstream captioning datasets we need to produce a new SOTA vision model, which has been the last straggler in the various open llm augmentations. Most vision models are still based on ancient CLIP/BLIP captioning and even with something like LLAVA or the remarkable phi-llava, we are still held back by the pretained vision components which have been needing love for some months now.

Tessy and LLM is a good pipe, it's likely what produced SCHNELL and will soon be the reverse of this configuration, used for testing and checking while the LLM does the bulk of transcription via vision modality adaption. The fun part of that is that multi lingual models will be able to read and translate, opening up new work for scholars searching through digitized works. Already I have had success in this area with no development at all, after we get our next SOTA vision models I am expecting a massive jump in quality. I expect english vision model adapters to show up using LLAVA architecture first, this may put some other latin script languages into the readable category depending on the adapted model, but we could see a leapfrog of scripts becoming readable all at once. LLAVA-PHI3 already seems to be able to transcribe tiny pieces of hebrew with relative consistency. It also has horrible hallucinations, so there is very much an unknown limiting factor here currently. I was planning some segmentation experiments but schnell knocked that out of my hands like a bar of soap in a prison shower, I will be waiting for a distilled captioning sota to come before I re-evaluate this area.

Exciting times!

KTibow · 2 years ago

Is LLaVA-Phi better than Phi Vision?

edit: I think parent just doesn't know about Phi Vision, it appears to be a better model

jonathanyc · 2 years ago

It's a very interesting idea, but the potential for hallucinations reminds me of JBIG2, a compression format which would sometimes substitute digits in faxed documents: https://en.wikipedia.org/wiki/JBIG2#Character_substitution_e...

> In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines. Numbers printed on scanned (but not OCR-ed) documents had potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.

> In Germany the Federal Office for Information Security has issued a technical guideline that says the JBIG2 encoding "MUST NOT be used" for "replacement scanning".

I think the issue is that even if your compression explicitly notes that it's lossy, or if your OCR explicitly states that it uses an LLM to fix up errors, if the output looks like it could have been created by an non-lossy algorithm, users will just assume it that was. So in some sense it's better to have obvious OCR errors when there's any uncertainty.

spiderfarmer · 2 years ago

An OCR will always mix up characters so I don’t really see the issue here?

Nope. Most compression does not mix up characters the way JBIG2 does (see the article), and most OCR does not substitute plausible text in for text it fails to scan.

Let's say the text is "The laptop costs $1,000 (one thousand dollars)." but the image is blurry.

Normal compression will give you an image where "$1,000" is blurry. JBIG2 can give you an image where "$1,000" has been replaced by a perfectly-clear "$7,000."

Normal OCR will give you some nonsense like "The laptop costs $7,000 (one 1housand dollars)". The LLM can "fix this up" to something more plausible like "The laptop costs $2,000 (two thousand dollars)."

eigenvalue · 2 years ago

Yeah, that was a spectacularly bad idea of Xerox to enable that lossy compression by default!

geraldog · 2 years ago

This is a wonderful idea, but while I appreciate the venerable Tesseract I also think it's time to move on.

I personally use PaddlePaddle and have way better results to correct with LLMs.

With PPOCRv3 I wrote a custom Python implementation to cut books at word-level by playing with whitespace thresholds. It works great for the kind of typesetting found generally on books, with predictable whitespace threshold between words. This is all needed because PPOCRv3 is restricted to 320 x 240 pixels if I recall correctly and produces garbage if you downsample a big image and make a pass.

Later on I converted the Python code for working with the Rockchip RK3399Pro NPU, that is, to C. It works wonderfully. I used PaddleOCR2Pytorch to convert the models to rknn-api first and wrote the C implementation that cuts words on top of the rknn-api.

But with PPOCRv4 I think this isn't even needed, it's a newer architecture and I don't think it is bounded by pixel size restriction. That is, it will work "out of the box" so to speak. With the caveat that PPOCRv3 detection always worked better for me, PPOCRv4 detection model gave me big headaches.

320 x 48 pixels actually.

janalsncm · 2 years ago

Having tried this in the past, it can work pretty well 90% of the time. However, there are still some areas it will struggle.

Imagine you are trying to read a lease contract. The two areas which the LLM may be useless are numbers and names (names of people or places/addresses). There’s no way for your LLM to accurately know what the rent should be, or to know about the name of a specific person.

Agreed, this should not be used for anything mission critical unless you're going to sit there and carefully review the output by hand (although that is still going to be 100x faster than trying to manually correct the raw OCR output).

Where it's most useful to me personally is when I want to read some old book from the 1800s about the history of the Royal Navy [0] or something like that which is going to look really bad on my Kindle Oasis as a PDF, and the OCR version available from Archive.org is totally unreadable because there are 50 typos on each page. The ability to get a nice Markdown file that I can turn into an epub and read natively is really nice, and now cheap and fast.

[0] https://archive.org/details/royalnavyhistory02clowuoft/page/...

Why does it have to be 100% accurate?

If you get 90% of work done and you have to fix some numbers and names it still saves you time, isn't it?

choilive · 2 years ago

Theres some time savings, but not a ton.

If theres 30 fields on a document @ 90% accuracy - each field would still need to be validated by a human because you can't trust that it is correct. So the O(n) human step of checking each field is still there, and for fields that are long strings that are pseudo-random looking (think account numbers, numbers on invoices and receipts, instrumentation measurement values, etc.) there is almost no time savings because the mental effort to input something like 015729042 is about the same as verifying it is correct.

At 100% accuracy you remove that need altogether.

kevingadd · 2 years ago

Let's say you're OCRing a contract. Odds are good that almost every part of the contract is there for an important reason, though it may not matter to you. How many errors can you tolerate in the terms of a contract that governs i.e. your home, or the car you drive to work, or your health insurance coverage? Do you want to take a gamble on those terms that could - in the worst case - result in getting kicked out of your apartment or having to pay a massive medical bill yourself?

The important question is which parts are inaccurate. If it's messing up names and numbers but is 99.9% accurate for everything else, you can just go back and check all the names and numbers at the end. But if the whole thing is only 90% accurate, you now either recheck the whole document or you risk a 'must' turning into a 'may' in a critical place that undermines the whole document.

anonymoushn · 2 years ago

Have you tried using other OCR packages? I had to give up on Tesseract after every mode and model I tried read a quite plain image of "77" as "7" (and interestingly the javascript port reads it as "11"). Pic related: https://i.postimg.cc/W3QkkhCK/speed-roi-thresh.png

You know, I’ve really looked hard at what’s out there and haven’t been able to find anything else that’s totally free/open, that runs well on CPU, and which has better quality output than Tesseract. I found a couple Chinese projects but had trouble getting them to work and the documentation wasn’t great. If you have any leads on others to try I’d love to hear about them.

One of the benefits of this project is that it doesn’t seem to matter that much that there are mistakes in the OCR output as long as you’re dealing with words, where the meaning would be clear to a smart human trying to make sense of it and knowing that there are probable OCR errors. For numbers it’s another story, though.

kergonath · 2 years ago

> You know, I’ve really looked hard at what’s out there and haven’t been able to find anything else that’s totally free/open, that runs well on CPU, and which has better quality output than Tesseract. I found a couple Chinese projects but had trouble getting them to work and the documentation wasn’t great. If you have any leads on others to try I’d love to hear about them.

I did more or less the same, trying to solve the same problem. I ended up biting the bullet and using Amazon Textract. The OCR is much better than Tesseract, and the layout tool is quite reliable to get linear text out of 2-columns documents (which is critical for my use case).

I would be very happy to find something as reliable that would work on a workstation without relying on anyone’s cloud.

fred123 · 2 years ago

macOS Live Text is incredible. Mac only though

I ended up using EasyOCR. I assume it is too slow in CPU-only mode.

savikko · 2 years ago

I have some pretty good experiences with PaddleOCR but you may refer to this Chinese and badly documented ones.

For our use case PaddleOCR + LLM has been quite nice combo.

Most issues related to Tesseract will have to do with input DPI, often you need to crank that setting way up from its default.

IIRC Tesseract is trained on 300 DPI

kbyatnal · 2 years ago

"real improvements came from adjusting the prompts to make things clearer for the model, and not asking the model to do too much in a single pass"

This is spot on, and it's the same as how humans behave. If you give a human too many instructions at once, they won't follow all of them accurately.

I spend a lot of time thinking about LLMs + documents, and in my opinion, as the models get better, OCR is soon going to be a fully solved problem. The challenge then becomes explaining the ambiguity and intricacies of complex documents to AI models in an effective way, less so about the OCR capabilities itself.

disclaimer: I run a LLM document processing company called Extend (https://www.extend.app/).

saaaaaam · 2 years ago

Extend looks great - and your real estate play is very interesting. I’ve been playing around extracting key terms from residential leasehold (condominium-type) agreements. Interested to know if you’re doing this sort of thing?

sumedh · 2 years ago

Is there a pricing page?

Oras · 2 years ago

If anyone is looking to compare results visually, I have created an open source OCR visualiser to help identifying missing elements (especially in tables).

https://github.com/orasik/parsevision