Readit News logoReadit News
kapitalx commented on Ask HN: What are you working on? (September 2025)    · Posted by u/david927
kapitalx · 3 months ago
https://doctly.ai

We're building Doctly.ai - PDF Extraction with AI.

We started out with document conversions to Markdown but quickly realized that most use cases were for JSON conversion. We recently launched our "Extractor Studio" where you can have AI analyze a few sample variations of your documents and come up with a schema for you and publish it to an API endpoint.

We've built a technique on top of AI models that dramatically improves run to run consistency of JSON output.

Checkout the blog post here: https://medium.com/@abasiri/introducing-doctlys-extractor-st...

kapitalx commented on PDF to Text, a challenging problem   marginalia.nu/log/a_119_p... · Posted by u/ingve
bob1029 · 7 months ago
When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

kapitalx · 7 months ago
This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.
kapitalx commented on Show HN: Qwen-2.5-32B is now the best open source OCR model   github.com/getomni-ai/ben... · Posted by u/themanmaran
azinman2 · 9 months ago
News update: OCR company touts new benchmark that shows its own products are the most performant.
kapitalx · 9 months ago
To be fair, they didn't include themselves at all in the graph.
kapitalx commented on Show HN: Qwen-2.5-32B is now the best open source OCR model   github.com/getomni-ai/ben... · Posted by u/themanmaran
kapitalx · 9 months ago
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.
kapitalx · 9 months ago
In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(
kapitalx commented on Show HN: Qwen-2.5-32B is now the best open source OCR model   github.com/getomni-ai/ben... · Posted by u/themanmaran
ks2048 · 9 months ago
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.

For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

kapitalx · 9 months ago
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.

Deleted Comment

u/kapitalx

KarmaCake day1077September 9, 2009
About
Cofounder - Doctly.ai

Co-author of Principles of Chaos Engineering - https://principlesofchaos.org/

View Original