Well, it's amazing how things have changed since then. Not only have models gotten a lot better, but the latest "low tier" offerings from OpenAI (GPT4o-mini) and Anthropic (Claude3-Haiku) are incredibly cheap and incredibly fast. So cheap and fast, in fact, that you can now break the document up into little chunks and submit them to the API concurrently (where each chunk can go through a multi-stage process, in which the output of the first stage is passed into another prompt for the next stage) and assemble it all in a shockingly short amount of time, and for basically a rounding error in terms of cost.
My original project had all sorts of complex stuff for detecting hallucinations and incorrect, spurious additions to the text (like "Here is the corrected text" preambles). But the newer models are already good enough to eliminate most of that stuff. And you can get very impressive results with the multi-stage approach. In this case, the first pass asks it to correct OCR errors and to remove line breaks in the middle of a word and things like that. The next stage takes that as the input and asks the model to do things like reformat the text using markdown, to suppress page numbers and repeated page headers, etc. Anyway, I think the samples (which take less than 1-2 minutes to generate) show the power of the approach:
Original PDF: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...
Raw OCR Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...
LLM-Corrected Markdown Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...
One interesting thing I found was that almost all my attempts to fix/improve things using "classical" methods like regex and other rule based things made everything worse and more brittle, and the real improvements came from adjusting the prompts to make things clearer for the model, and not asking the model to do too much in a single pass (like fixing OCR mistakes AND converting to markdown format).
Anyway, this project is very handy if you have some old scanned books you want to read from Archive.org or Google Books on a Kindle or other ereader device and want things to be re-flowable and clear. It's still not perfect, but I bet within the next year the models will improve even more that it will get closer to 100%. Hope you like it!
1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...
2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.
3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells
4. For formulas (and formulas in tables), just use a ViT/CNN.
5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.
I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.
With so many passes with images, the costs/time will be high with ViT being slower.
What I want to do is reading handwritten documents from the 18th century, and I feel like the multistep approach hits a hard ceiling there. Transkribus is multistep, but the line detecion model is just terrible. Things that should be easy, such as printed schemas, utterly confuse it. You simply need to be smart about context to a much higher degree than you need in OCR of typewriter-written text.
In this case, the model can already do the OCR and becomes an order of magnitude cheaper per year.
Mathpix Markdown however is awesome and I ask LLMs to use that to denote formulas as latex is tricky to render in HTML because of things not matching. I don't know latex well so haven't gone deeper on it.
Dead Comment
Tessy and LLM is a good pipe, it's likely what produced SCHNELL and will soon be the reverse of this configuration, used for testing and checking while the LLM does the bulk of transcription via vision modality adaption. The fun part of that is that multi lingual models will be able to read and translate, opening up new work for scholars searching through digitized works. Already I have had success in this area with no development at all, after we get our next SOTA vision models I am expecting a massive jump in quality. I expect english vision model adapters to show up using LLAVA architecture first, this may put some other latin script languages into the readable category depending on the adapted model, but we could see a leapfrog of scripts becoming readable all at once. LLAVA-PHI3 already seems to be able to transcribe tiny pieces of hebrew with relative consistency. It also has horrible hallucinations, so there is very much an unknown limiting factor here currently. I was planning some segmentation experiments but schnell knocked that out of my hands like a bar of soap in a prison shower, I will be waiting for a distilled captioning sota to come before I re-evaluate this area.
Exciting times!
edit: I think parent just doesn't know about Phi Vision, it appears to be a better model
> In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines. Numbers printed on scanned (but not OCR-ed) documents had potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.
> In Germany the Federal Office for Information Security has issued a technical guideline that says the JBIG2 encoding "MUST NOT be used" for "replacement scanning".
I think the issue is that even if your compression explicitly notes that it's lossy, or if your OCR explicitly states that it uses an LLM to fix up errors, if the output looks like it could have been created by an non-lossy algorithm, users will just assume it that was. So in some sense it's better to have obvious OCR errors when there's any uncertainty.
Let's say the text is "The laptop costs $1,000 (one thousand dollars)." but the image is blurry.
Normal compression will give you an image where "$1,000" is blurry. JBIG2 can give you an image where "$1,000" has been replaced by a perfectly-clear "$7,000."
Normal OCR will give you some nonsense like "The laptop costs $7,000 (one 1housand dollars)". The LLM can "fix this up" to something more plausible like "The laptop costs $2,000 (two thousand dollars)."
I personally use PaddlePaddle and have way better results to correct with LLMs.
With PPOCRv3 I wrote a custom Python implementation to cut books at word-level by playing with whitespace thresholds. It works great for the kind of typesetting found generally on books, with predictable whitespace threshold between words. This is all needed because PPOCRv3 is restricted to 320 x 240 pixels if I recall correctly and produces garbage if you downsample a big image and make a pass.
Later on I converted the Python code for working with the Rockchip RK3399Pro NPU, that is, to C. It works wonderfully. I used PaddleOCR2Pytorch to convert the models to rknn-api first and wrote the C implementation that cuts words on top of the rknn-api.
But with PPOCRv4 I think this isn't even needed, it's a newer architecture and I don't think it is bounded by pixel size restriction. That is, it will work "out of the box" so to speak. With the caveat that PPOCRv3 detection always worked better for me, PPOCRv4 detection model gave me big headaches.
Imagine you are trying to read a lease contract. The two areas which the LLM may be useless are numbers and names (names of people or places/addresses). There’s no way for your LLM to accurately know what the rent should be, or to know about the name of a specific person.
Where it's most useful to me personally is when I want to read some old book from the 1800s about the history of the Royal Navy [0] or something like that which is going to look really bad on my Kindle Oasis as a PDF, and the OCR version available from Archive.org is totally unreadable because there are 50 typos on each page. The ability to get a nice Markdown file that I can turn into an epub and read natively is really nice, and now cheap and fast.
[0] https://archive.org/details/royalnavyhistory02clowuoft/page/...
If you get 90% of work done and you have to fix some numbers and names it still saves you time, isn't it?
If theres 30 fields on a document @ 90% accuracy - each field would still need to be validated by a human because you can't trust that it is correct. So the O(n) human step of checking each field is still there, and for fields that are long strings that are pseudo-random looking (think account numbers, numbers on invoices and receipts, instrumentation measurement values, etc.) there is almost no time savings because the mental effort to input something like 015729042 is about the same as verifying it is correct.
At 100% accuracy you remove that need altogether.
The important question is which parts are inaccurate. If it's messing up names and numbers but is 99.9% accurate for everything else, you can just go back and check all the names and numbers at the end. But if the whole thing is only 90% accurate, you now either recheck the whole document or you risk a 'must' turning into a 'may' in a critical place that undermines the whole document.
One of the benefits of this project is that it doesn’t seem to matter that much that there are mistakes in the OCR output as long as you’re dealing with words, where the meaning would be clear to a smart human trying to make sense of it and knowing that there are probable OCR errors. For numbers it’s another story, though.
I did more or less the same, trying to solve the same problem. I ended up biting the bullet and using Amazon Textract. The OCR is much better than Tesseract, and the layout tool is quite reliable to get linear text out of 2-columns documents (which is critical for my use case).
I would be very happy to find something as reliable that would work on a workstation without relying on anyone’s cloud.
For our use case PaddleOCR + LLM has been quite nice combo.
This is spot on, and it's the same as how humans behave. If you give a human too many instructions at once, they won't follow all of them accurately.
I spend a lot of time thinking about LLMs + documents, and in my opinion, as the models get better, OCR is soon going to be a fully solved problem. The challenge then becomes explaining the ambiguity and intricacies of complex documents to AI models in an effective way, less so about the OCR capabilities itself.
disclaimer: I run a LLM document processing company called Extend (https://www.extend.app/).
https://github.com/orasik/parsevision