Mistral OCR - Readit News

I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

lolinder · a year ago

> with LLM as a judge

For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

themanmaran · a year ago

We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

- Every document has ground truth text, a JSON schema, and the ground truth JSON.

- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

- Compare the predicted JSON against the ground truth JSON for accuracy.

In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

https://github.com/getomni-ai/benchmark

vikp · a year ago

Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)

None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

arthurcolle · a year ago

You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema

carlgreene · a year ago

Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried

vikp · a year ago

Thanks for sharing! I'm training some models now that will hopefully improve this and more :)

netdevphoenix · a year ago

LLM as a judge?

Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption

bfors · a year ago

Perhaps they already evaluated their LLM judge model (with another LLM)

Deleted Comment

ntkris · a year ago

This is awesome. Have you seen / heard of any benchmarks where the data is actually a structured JSON vs. markdown?

ChrisRob · a year ago

Thanks for the tip. Marker solved a table conversion without LLM that docling wasn't able to solve.

codelion · a year ago

Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.

DeathArrow · a year ago

>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?

boxed · a year ago

Why wouldn't hallucinations be agreed upon if they have roughly the same training data?

stavros · a year ago

I like the licensing options! Hopefully they make enough money to fund development.

It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:

https://i.imgur.com/jcwW5AG.jpeg

For the blocks in the center, it outputs:

> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.

This is perfect! But then the next one:

> Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.

This is really bad because

1/ a portion of the text of the previous bloc is repeated

2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")

3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)

(Correct text for the second bloc should be:

> Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)

layer8 · a year ago

> This is perfect!

Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....

There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.

Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.

MatthiasPortzel · a year ago

Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.

jorvi · a year ago

> Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

Or degree symbol. Although it should be able to figure out which to use according to the context.

TeMPOraL · a year ago

This is "reasoning model" stuff even for humans :).

raffraffraff · a year ago

It feels like, after the OCR step there should be language and subject matter detection, with a final sweep with a spelling / grammar checker that has the right "dictionary" selected. (That, right there, is my naivety on the subject, but I would have thought that the type of problem you're describing isn't OCR but classical spelling and grammar checking?)

bambax · a year ago

Another test with a text in English, which is maybe more fair (although Mistral is a French company ;-). This image is from Parliamentary debates of the parliament of New Zealand in 1854-55:

https://i.imgur.com/1uVAWx9.png

Here's the output of the first paragraph, with mistakes in brackets:

> drafts would be laid on the table, and a long discussion would ensue; whereas a Committee would be able to frame a document which, with perhaps a few verbal emundations [emendations], would be adopted; the time of the House would thus be saved, and its business expected [expedited]. With regard to the question of the comparative advantages of The-day [Tuesday]* and Friday, he should vote for the amendment, on the principle that the wishes of members from a distance should be considered on all sensations [occasions] where a principle would not be compromised or the convenience of the House interfered with. He hoped the honourable member for the Town of Christchurch would adopt the suggestion he (Mr. Forssith [Forsaith]) had thrown out and said [add] to his motion the names of a Committee.*

Some mistakes are minor (emnundations/emendations or Forssith/Forsaith), but others are very bad, because they are unpredictable and don't correspond to any pattern, and therefore can be very hard to spot: sensations instead of occasions, or expected in lieu of expedited... That last one really changes the meaning of the sentence.

spudlyo · a year ago

I want to rejoice that OCR is now a "solved" problem, but I feel like hallucinations are just as problematic as the kind of stuff I have to put up with tesseract -- both require careful manual proofreading for an acceptable degree of confidence. I guess I'll have to try it and see for myself just how much better these solutions are for my public domain archive.org Latin language reader & textbook projects.

qingcharles · a year ago

It depends on your use-case. For mine, I'm mining millions of scanned PDF pages to get approximate short summaries of long documents. The occasional hallucination won't damage the project. I realize I'm an outlier, and I would obviously prefer a solution that was as accurate as possible.

eMPee584 · a year ago

possibly doing both & diffing the output to spot contested bits?

thomasfromcdnjs · a year ago

Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?

rossant · a year ago

Your example doesn't seem that difficult to me.

Dead Comment

Kokichi · a year ago

All it ever does is hallucinate

owenpalmer · a year ago

This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.

Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.

Lots of potential here.

[0] https://docs.withorbit.com/

NalNezumi · a year ago

>a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

A tangent but this exact issue is what I was frustrated for a long time with pdf reader and reading science papers. Then I found sioyek that pops up a small window when you hover over links (references and equations and figures) and it solved it.

Granted, the pdf file must be in right format, so OCR could make this experience better. Just saying the UI component of that already exist

https://sioyek.info/

PerryStyle · a year ago

Zotero's PDF viewer also does this now. Being able to annotate PDFs and having a reference manager has been a life saver.

Thanks for the link! Good to know someone is working on something similar.

generalizations · a year ago

Wait does this deal with images?

ezfe · a year ago

The output includes images from the input. You can see that on one of the examples where a logo is cropped out of the source and included in the result.

Asraelite · a year ago

I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.

randomNumber7 · a year ago

I never thought driving a car is harder than editing a pdf.

pzo · a year ago

It's not about harder but about what error you can tolerate. Here if you have accuracy 99% for many applications it's enough. If you have 99% accuracy per trip of no crash during self driving then you gonna be dead within a year very likely.

For cars we need accuracy at least 99.99% and that's very hard.

toephu2 · a year ago

I've been able to edit PDFs (95%+ of them) accurately for the past 10 years...

Apofis · a year ago

Foxit PDF exists...

Beijinger · a year ago

Master PDF Editor?

raunakchowdhuri · a year ago

We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-gemini

A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.

shrisukhani · a year ago

Anecdotally, we also found Gemini Flash to be better.

hackernewds · a year ago

meanwhile, you're comparing it to the output of almost a trillion dollar company

stann · a year ago

The tagline boasts that it is "introducing the world’s best document understanding API". So, holding them to their marketing seems fair

HaZeust · a year ago

... And? We're judging it for the merits of the technology it purports to be, not the pockets of the people that bankroll them. Probably not fair - sure, but when I pick my OCR, I want to pick SOTA. These comparisons and announcements help me find those.

comparisons to more outputs coming soon!

kbyatnal · a year ago

We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

dml2135 · a year ago

One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.

PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.

Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?

But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).

golergka · a year ago

It really depends on their fault tolerance. I think there's a ton of useful applications where OCR would be 99.9%, 99%, and even 98% reliable. Skillful product manager can keep these limitations in mind and work around them.

jocoda · a year ago

... unavoidable "human in the loop" - depends imo.

From the comments here, it certainly seems that for general OCR it's not up to snuff yet. Luckily, I don't have great ambitions.

I can see this working for me with just a little care upfront preprocessing now that I know where it falls over. It casually skips portions of the document, and misses certain lines consistently. Knowing that I can do a bit massaging, and feed it what I know it likes, and then reassemble.

I found in testing that it failed consistently at certain parts, but where it worked, it worked extremely well in contrast to other methods/services that I've been using.

risyachka · a year ago

>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.

esafak · a year ago

They're going to become commoditized unless they add value elsewhere. Good news for customers.

techwizrd · a year ago

The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.

dontlikeyoueith · a year ago

AWS Textract works pretty well for this and is much cheaper than running LLMs.

yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.

yfontana · a year ago

I'm working on a projet that uses PaddleOCR to get bounding boxes. It's far from perfect, but it's open source and good enough for our requirements. And it can mostly handle a 150 MB single-page PDF (don't ask) without completely keeling over.

einpoklum · a year ago

An LLM with billions of parameters for extracting text from a PDF (which isn't even a rasterized image) really does not "solve OCR".

nextworddev · a year ago

Your customer includes Checkr? Impressive. Are they referencable?

wahnfrieden · a year ago

btw - what 'dark patterns' does portkey contain?

mvac · a year ago

Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].

Also the collab link in the article is broken, found a functional one [2] in the docs.

[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...

I've been searching relentlessly for something like this! I wonder why it's been so hard to find... is it the Chinese?

In any case, thanks for sharing.

thelittleone · a year ago

Have you had a chance to compare results from MinerU vs LLM such a Gemini 2.0 or anthropic's native PDF tool?

Yes, i have. The problem with using just an LLM is that while it reads and understands text, but it cannot reproduce it accurately. Additionaly the textbooks I've mentioned have many diagrams and illustrations in them (e.g. books on anatomy or biochemistry). I don't really care about extracting text from them, I just need them extracted as images alongside the text, and no LLM does that.

shekhargulati · a year ago

Mistral OCR made multiple mistakes in extracting this [1] document. It is a two-page-long PDF in Arabic from the Saudi Central Bank. The following errors were observed:

- Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.

I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].

1. https://rulebook.sama.gov.sa/sites/default/files/en_net_file... 2. https://shekhargulati.com/2025/03/05/claude-3-7-sonnet-is-go...