This is in the problem description of your pitch, and leads me to believe that tile.run has been solving this problem. Is that right?
> Coming Soon:
> - Improved accuracy
Can you expand more?
I have a large need for this sort of tooling, but accuracy is my primary concern.
On the accuracy point, given our work so far we believe we are best in class in terms of accuracy for document extraction. We've also set up a system of evaluations internally that allow us to keep iterating and improving (hence us mentioning that we want to continue working on it).
I can only speak to our experience. Once you get under the hood, you find that this is a hard problem to solve.
There are also a lot of workflows that involve documents in every sector and every function. In other words, the opportunity is massive.
For our product, our customers are either internal engineering teams or folks building products that require document extraction but don’t want to invest time in it.
The tool is totally self serve and does allow you to set up, upload and access documents.
We clearly need to call that out more so will add this to the landing page
Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.
You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .
The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.
Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.