Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.).
Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?
So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).
Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc
Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.
Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.
Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.
It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.
Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.
THE ANIMATE
AND THE INANIMATE
WILLIAM JAMES SIDIS
<img>A black-and-white illustration of a figure holding a book with the Latin phrase "ARTI et VERITATI" below it.</img>
BOSTON
RICHARD G. BADGER, PUBLISHER
THE GORHAM PRESS
Digitized by Google
I haven't see ANY errors in what it has done, which is quite impressive.
Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):
I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.
Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!
How does it compare to Datalab/Marker https://github.com/datalab-to/marker ? We evaluated many PDF->MD converters and this one performed the best, though it is not perfect.
As anecdotal evidence, it serves my complex-enough purposes very well - mathematics and code interspersed together. One of my "litmus test" papers is this old paper on a Fortran inverse-Laplace transform algorithm [1] that intersperses inline and display equations, and monospace code blocks, while requiring OCR from scratch, and very few models currently do a satisfactory job, i.e. in the following page transcribed by Marker,
It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.
Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.
Also, we extract the tables as HTML tables instead of markdown for complex tables.
I was more excited to hear about "structured Markdown" than the LLM OCR model, but the extent of it just seems to be tagging certain elements. It's useful in the LLM context but not as much outside of it.
I have a Shipibo (indigenous Peruvian language) to Spanish dictionary that I've been trying to translate into a Shipibo to English dictionary using a couple different llms but keep struggling with formatting (two columns, strange line breaks, but also both Shipibo and Spanish in the definitions make it difficult to grok). That all plus being pretty poorly scanned. May need to give this a try.
Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.
Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.
[1] https://imgur.com/a/8rJeHf8
It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.
Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.
Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):
Other than the fact it is ridiculously slow, this seems to be quite good at doing what it says it does.https://github.com/kordless/gnosis-ocr
Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!
https://imgur.com/a/Q7UYIfW
the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and $f(t)$ is mangled as "f~~t*!". The current model gets them both correct.
I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.
Also, we extract the tables as HTML tables instead of markdown for complex tables.