Readit News logoReadit News
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
dang · 2 months ago
> happy to run additional documents if people want to share examples

I've got one! The pdf of this out-of-print book is terrible: https://archive.org/details/oneononeconversa0000simo. The text is unreadably faint, and the underlying text layer is full of errors, so copy-paste is almost useless. Can your software extract usable text?

(I'll email you a copy of the pdf for convenience since the internet archive's copy is behind their notorious lending wall)

ritvikpandey21 · 2 months ago
Results look pretty good (with the exception of one very faint page) - check it out here! https://platform.runpulse.com/dashboard/extractions/public/f...
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
think4coffee · 2 months ago
Congrats on the launch! You mention that you're SOTA on benchmarks. Can you share your research, or share which benchmark you used?
ritvikpandey21 · 2 months ago
thanks! we benchmark against all the major players (azure doc intelligence, aws textract, google doc ai, frontier llms, etc). we have some public news coming out soon on this front, but we have a very rigorous dataset using both public and synthetic data focusing on the hardest problems in the space (handwriting, tables, etc).
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
mritchie712 · 2 months ago
> Why LLMs Suck at OCR

I paste screenshots into claude code everyday and it's incredible. As in, I can't believe how good it is. I send a screenshot of console logs, a UI and some HTML elements and it just "gets it".

So saying they "Suck" makes me not take your opinion seriously.

ritvikpandey21 · 2 months ago
yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
scottydelta · 2 months ago
AI models will eventually do this natively. This is one of the ways for models to continue to get better, by doing better OCR and by doing better context extraction.

I am already seeing this trend in the recent releases of the native models (such as Opus 4.5, Gemini 3, and especially Gemini 3 flash).

It's only going to get better from here.

Another thing to note is, there are over 5 startups right now in YC portfolio doing the same thing and going after a similar/overlapping target market if I remember correctly.

ritvikpandey21 · 2 months ago
yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
lajr · 2 months ago
Hey, congratulations on the launch. Just noticed a discrepancy in the financial 10K example:

There is a section near the start where there are 4 options: Large accelerated filer, Non-accelerated filer, Accelerated filer, or Smaller reporting company.

In this option, "Large accelerated filer" is checked on the PDF, but "Non-accelerated filer" is checked on the Markdown.

ritvikpandey21 · 2 months ago
thanks for the flag! have pointed this out will be pushing an update here shortly
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
mikert89 · 2 months ago
AI models will do all this natively
ritvikpandey21 · 2 months ago
we disagree! we've found llms by themselves aren't enough and suffer from pretty big failure modes like hallucination and inferring text rather than pure transcription. we wrote a blog about this [1]. the right approach so far seems to be a hybrid workflow that uses very specific parts of the language model architecture.

[1] https://www.runpulse.com/blog/why-llms-suck-at-ocr

ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
throw03172019 · 2 months ago
Congrats on launch! We have been using this for a new feature we are building in our SaaS app. It’s results were better than Datalab from our tests, especially in the handwriting category.
ritvikpandey21 · 2 months ago
thanks! appreciate the kind words
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
rtaylorgarlock · 2 months ago
Has docling improved? I had a bit of a nightmare integrating a docling pipeline earlier this year. Docs said it was VLM-ready, which I spent lots of hours finding out was not true, just to find a relevant github issue which would've saved me a ton of hours :/ allegedly fixed, but wow that burned me bigtime.
ritvikpandey21 · 2 months ago
our team has tested docling pretty extensively, works well for simpler text-heavy docs without complex layouts, but the moment you introduce tables or multi-column stuff it doesn't maintain layout well.
ritvikpandey21 commented on Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction    · Posted by u/sidmanchkanti21
asdev · 2 months ago
How is this different from Extend(Also YC)?
ritvikpandey21 · 2 months ago
we're more focused on the core extraction layer itself rather than workflow tooling. we train our own vision models for layout detection, ocr, and table parsing from scratch. the key thing for us is determinism and auditability, so outputs are reproducible run over run, which matters a lot for regulated enterprises.
ritvikpandey21 commented on     · Posted by u/ritvikpandey21
ritvikpandey21 · 4 months ago
DeepSeek AI just released DeepSeek-OCR, a new open-source model that aims to rethink text extraction through what it calls Context Optical Compression. The launch quickly caught attention on X and GitHub, with many celebrating another big step in open document AI.

At Pulse, we were curious how it performs on the kinds of messy, high-density documents that power real business workflows. So we ran DeepSeek-OCR through our standard evaluation suite: multi-page PDFs, handwritten forms, nested tables, and scanned statements. The results were promising in theory but inconsistent in practice.

u/ritvikpandey21

KarmaCake day182April 16, 2024View Original