Readit News logoReadit News
vintermann · a year ago
The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.

We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.

liotier · a year ago
> We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.

Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.

einpoklum · a year ago
> Arbitrary nonsensical text require character recognition.

Are you sure? I mean, if it's printed text in a non-connected script, where characters repeat themselves (nearly) identically, then ok, but if you're looking at handwriting - couldn't one argue that it's _words_ that get recognized? And that's ignoring the question of textual context, i.e. recognizing based on what you know the rest of the sentence to be.

modeless · a year ago
VLMs seem to render traditional OCR systems obsolete. I'm hearing lately that Gemini does a really good job on tasks involving OCR. https://news.ycombinator.com/item?id=42952605

Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

bayindirh · a year ago
The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.

Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.

I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.

aidenn0 · a year ago
Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.

On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.

I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.

gopher_space · a year ago
I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.

For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.

It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.

[0] https://github.com/jnday/ocr_lol

tcascais · a year ago
I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol Wild seeing this commment right after waking up
cnity · a year ago
For self hosting check out Qwen-VL: https://github.com/QwenLM/Qwen-VL
vintermann · a year ago
Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.
dhon_ · a year ago
I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.
exikyut · a year ago
> Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

Maybe this is what the age of desktop AGI looks like.

chgs · a year ago
Wouldn’t an AI make assumptions and fix mistakes?

For example instead of

> The speiling standards were awful

It would produce

> The spelling standards were awful

yndoendo · a year ago
Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.

Even I used symbols for different means in a shorthand form when constructing an idea.

I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.

Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.

registeredcorn · a year ago
Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:

> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.

> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.

I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?

I will say, if nothing else, I can understand certain physical considerations. For example:

A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.

In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.

vintermann · a year ago
The lower case "e" in gothic cursive often looks like a lower case "r". If you see one of these: ſ maybe you think "ah, I know that one, that's an S!" and yes, it is, but some scribes when writing a capital H makes something that looks a LOT like it. You need context to disambiguate. Think of it as a cryptogram: if you see a certain squiggle in a context where it's clearly an "r", you can assume that the other squiggles that look like that are "r"s too. Familiarity with a scribe's hand is often necessary to disambiguate squiggles, especially in words such as proper names, where linguistic context doesn't help you a lot. And it's often the proper names which are the most interesting part of a document.

But yes, writers can change style too. Mercifully, just like we sometimes use all caps for surnames, so some writers would use antika-style handwriting (i.e. what we use today) for proper names in a document which is otherwise all gothic-style handwriting. But this certainly doesn't happen consistently enough that you can rely on it, and some writers have so messy handwriting that even then, you need context to know what they're doing.

cyanydeez · a year ago
The problem is payong experts to properly train a model is expensive, doubly when you want larger context.

Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.

Surpise: Garbage CEOs in, garbage intelligence out.

abrichr · a year ago
From https://www.ocr4all.org/guide/user-guide/introduction :

> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.

Looks like it's built on https://github.com/Calamari-OCR/calamari

seu · a year ago
Looks like a great project, and I don't want to nitpick, but...

https://www.ocr4all.org/about/ocr4all > Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

https://www.ocr4all.org/guide/setup-guide/quickstart > Quickstart > Open a terminal of your choice and enter the following command if you're running Linux (followed by a 6 line docker command).

How is that addressing the needs of non-technical users?

7bit · a year ago
Any end-user application that uses docker is not an end-user application. It does not matter if the end-user knows how to use docker or not. End-user applications should be delivered as SaaS/WebUi or a local binary (GUI or CLI). Period.
pbhjpbhj · a year ago
Application installation isn't a user level task. The application being ready for a user to use, and being easy to install are separate. You get your IT literate helper to install for you, then, if the program is easy for users to use you're golden.
lupusreal · a year ago
That's a very corporate mentality. Outside of an organizational context, installing applications certainly is a normal user level task. And for those users that have somebody help them, that somebody is usually just a younger person who's comfortable clicking 'Next' to get through an installer but certainly has no devops experience.
lupusreal · a year ago
"Silicate chemistry is second nature to us geochemists, so it's easy to forget that the average person probably only knows the formulas for olivine and one or two feldspars."
einpoklum · a year ago
s/non-technical users/technical users who are into docker and don't mind filling their computers with large files for no good reason/
lionkor · a year ago
Those are called "tech enthusiasts"
fny · a year ago
A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.

I wrote a simple CLI tool and more featured Python wrapper for it: https://github.com/fny/swiftocr

Moto7451 · a year ago
This has been one of my favorite features Apple added. When I’m in a call and someone shares a page I need the link to, rather than interrupt the speaker and ask them to share the link it’s often faster to screengrab the url and let Apple OCR the address and take me to the page/post it in chat.
jjice · a year ago
After getting an iPhone and exploring some of their API documentation after being really impressed with system provided features, I'm blown away by the stuff that's available. My app experience on iOS vs Android is night and day. The vision features alone have been insane, but their text recognition is just fantastic. Any image and even my god awful handwriting gets picked up without issue.

That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.

JeremyNT · a year ago
I can't comment on what Apple is doing here, but Google has an equivalent called "lens" which works really well and I use it in the way you suggest here.
ted_dunning · a year ago
Google photo app does really good OCR as well, actually.
eigenvalue · a year ago
I basically wrapped this in a simple iOS app that can take a PDF, turn it into images, and applies the native OCR to the images. It works shockingly well:

https://apps.apple.com/us/app/super-pdf-ocr/id6479674248

I probably should have just made it a free app so it would have gotten very popular, but oh well.

syntaxing · a year ago
How does it work with tables and diagrams? I have scanned pages with mixed media, like some are diagrams, I want to be able to extract the text but tell me where the diagrams are in the image with coordinates.
acheong08 · a year ago
I wonder if it's possible to reverse engineer that, rip it out, and put it on Linux. Would love to have that feature without having to use Apple hardware
maCDzP · a year ago
This seems to run locally?
criddell · a year ago
Just about everything beats Tesseract.
mometsi · a year ago
> How is this different from tesseract and friends?

The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

amelius · a year ago
I didn't have good results in tesseract, so I hope this is really different ;)

I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.

Moto7451 · a year ago
I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.

In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.

I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.

spigottoday · a year ago
I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.
bonefolder · a year ago
Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.
jjuliano · a year ago
If you are interested, I also made an AI assisted OCR API - https://github.com/kdeps/examples

It combines Tesseract (for images) and Poppler-utils (PDF). A local open-source LLMs will extract document segments intelligently.

It can also be extended to use one or multiple Vision LLM models easily.

And finally, it outputs the entire AI agent API into a Dockerized container.

Krasnol · a year ago
> Designed with usability in mind

Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces.

[...] https://www.ocr4all.org/guide/setup-guide/windows

------------------

I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.

eigenvalue · a year ago
I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:

https://github.com/Dicklesworthstone/llm_aided_ocr

This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.

TheNovaBomb · a year ago
What kind of accuracy have you reached with this pipeline of Tesseract+LLM? I imagine that there would be a hard limit as to what level the LLM could improve the OCR extract text from Tesseract, since its far from perfect itself.

Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.

That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.

https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_e...

https://huggingface.co/spaces/echo840/ocrbench-leaderboard

sgc · a year ago
Have you used your project on classical languages like Latin / Ancient Greek / Hebrew etc? Will the LLM fall flat in those cases, or be able to help?
eigenvalue · a year ago
I haven’t, but I bet it would work pretty well, particularly if you tweaked the prompts to explain that it’s dealing with Ancient Greek or whatever and give a couple examples of how to handle things.