Readit News logoReadit News
PixelPanda · 8 months ago
Full disclaimer: I work at Nanonets

Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:

LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.

Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.

Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.

Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.

Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.

Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s

Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

RicoElectrico · 8 months ago
Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?
arkh · 8 months ago
So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).
wisdomseaker · 8 months ago
Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc
uselesswords · 8 months ago
Have you found it has better accuracy or scales with larger models? Or are the improvements, if any, marginal compared to the 3B VLM model?
gibsonf1 · 8 months ago
Does it hallucinate with the LLM being used?
michaelt · 8 months ago
Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.

Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.

Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.

[1] https://imgur.com/a/8rJeHf8

nattaylor · 8 months ago
The base model is Qwen2.5-VL-3B and the announcement says a limitation is "Model can suffer from hallucination"
generalizations · 8 months ago
Does it have a way to extract the images themselves, or is that still a separate process later?
j45 · 8 months ago
If you are after extracting images from pdfs there’s plenty of tools that do that just fine without LLMs.
kordlessagain · 8 months ago
I created a Powershell script to run this locally on any PDF: https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...

It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.

Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.

kordlessagain · 8 months ago
Reporting back on this, here's some sample output from https://www.sidis.net/animate.pdf:

  THE ANIMATE
  AND THE INANIMATE

  WILLIAM JAMES SIDIS

  <img>A black-and-white illustration of a figure holding a book with the Latin phrase "ARTI et VERITATI" below it.</img>

  BOSTON

  RICHARD G. BADGER, PUBLISHER

  THE GORHAM PRESS

  Digitized by Google
I haven't see ANY errors in what it has done, which is quite impressive.

Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):

  <table>
    <tr>
      <td>Chapter</td>
      <td>Page</td>
    </tr>
    <tr>
      <td>PREFACE</td>
      <td>3</td>
    </tr>
    <tr>
      <td>I. THE REVERSE UNIVERSE</td>
      <td>9</td>
    </tr>
    <tr>
      <td>II. REVERSIBLE LAWS</td>
      <td>14</td>
    </tr>
Other than the fact it is ridiculously slow, this seems to be quite good at doing what it says it does.

2pointsomone · 8 months ago
Very very interested!
kordlessagain · 8 months ago
Ok, I have it built but things came up and I'm testing this morning (probably still broken but the code is all there):

https://github.com/kordless/gnosis-ocr

el_don_almighty · 8 months ago
I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

pxc · 8 months ago
Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?
toledocavani · 8 months ago
Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me
mvac · 8 months ago
How does it compare to Datalab/Marker https://github.com/datalab-to/marker ? We evaluated many PDF->MD converters and this one performed the best, though it is not perfect.
nxobject · 8 months ago
As anecdotal evidence, it serves my complex-enough purposes very well - mathematics and code interspersed together. One of my "litmus test" papers is this old paper on a Fortran inverse-Laplace transform algorithm [1] that intersperses inline and display equations, and monospace code blocks, while requiring OCR from scratch, and very few models currently do a satisfactory job, i.e. in the following page transcribed by Marker,

https://imgur.com/a/Q7UYIfW

the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and $f(t)$ is mangled as "f~~t*!". The current model gets them both correct.

vikp · 8 months ago
Hi, author of marker here - I tried your image, and I don't see the issues you're describing with the newest version of marker (1.7.5).

I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.

wittjeff · 8 months ago
I am just getting started with my own cross-comparison, would appreciate your list of considered candidates if you have it handy.
ks2048 · 8 months ago
It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.
souvik3333 · 8 months ago
Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

mgr86 · 8 months ago
Have you considered XML. TEI, for example, is very robust and mature for marking up documents.
jtbayly · 8 months ago
What happens to footnotes?
starkparker · 8 months ago
I was more excited to hear about "structured Markdown" than the LLM OCR model, but the extent of it just seems to be tagging certain elements. It's useful in the LLM context but not as much outside of it.
agoose77 · 8 months ago
Feel free to check out MyST Markdown, which very much aims to specify "structured Markdown": https://mystmd.org
temp0826 · 8 months ago
I have a Shipibo (indigenous Peruvian language) to Spanish dictionary that I've been trying to translate into a Shipibo to English dictionary using a couple different llms but keep struggling with formatting (two columns, strange line breaks, but also both Shipibo and Spanish in the definitions make it difficult to grok). That all plus being pretty poorly scanned. May need to give this a try.
ZQ-Dev8 · 8 months ago
How's this compare with docling (https://github.com/docling-project/docling)?
constantinum · 8 months ago
It would be interesting to know how it compares with Llamaparse, LLMWhisperer, Marker, Reducto
prats226 · 8 months ago
Unfortunately my reducto account was disabled rigth after this launch. But would be uploading benchmarks for rest at https://idp-leaderboard.org/