nhirschfeld (u/nhirschfeld)

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

ideashower · 10 months ago

Is there something like this for handwritten documents? I know newer models have been really good at handwriting transcription.

nhirschfeld · 10 months ago

You'll need to use a different OCR engine. Look at easy ocr

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

maleldil · 10 months ago

The API is pretty nice and easy to get started, but I couldn't get good results with parsing scientific paper PDFs, unfortunately (including OCR). Are there plans to use other backends? Docling works alright, and LLMs like Gemini Flash are interesting too.

nhirschfeld · 10 months ago

Yes, there have already been several suggestions here for other backend etc.

You should try using a different PSM to see if you get better results.

If it's scientific texts specifically, look at grobid

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

skavi · 10 months ago

in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.

nhirschfeld · 10 months ago

You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

diarrhea · 10 months ago

> It just litters perfectly reasonable python code with async/await

Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.

Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!

That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.

nhirschfeld · 10 months ago

thats why Kreuzberg also exposes a sync API for you to consume.

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

dleeftink · 10 months ago

An oldy but goody for layout extraction is Cermine by Dominika Tkaczyk and colleagues[0]. Java required.

[0]: http://cermine.ceon.pl/about.html

nhirschfeld · 10 months ago

didnt know this!

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction github.com/Goldziher/kreu... · Posted by u/nhirschfeld

alex_suzuki · 10 months ago

Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.