Readit News logoReadit News
nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
ideashower · 10 months ago
Is there something like this for handwritten documents? I know newer models have been really good at handwriting transcription.
nhirschfeld · 10 months ago
You'll need to use a different OCR engine. Look at easy ocr
nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
maleldil · 10 months ago
The API is pretty nice and easy to get started, but I couldn't get good results with parsing scientific paper PDFs, unfortunately (including OCR). Are there plans to use other backends? Docling works alright, and LLMs like Gemini Flash are interesting too.
nhirschfeld · 10 months ago
Yes, there have already been several suggestions here for other backend etc.

You should try using a different PSM to see if you get better results.

If it's scientific texts specifically, look at grobid

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
skavi · 10 months ago
in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.
nhirschfeld · 10 months ago
You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.

nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
diarrhea · 10 months ago
> It just litters perfectly reasonable python code with async/await

Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.

Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!

That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.

nhirschfeld · 10 months ago
thats why Kreuzberg also exposes a sync API for you to consume.
nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
dleeftink · 10 months ago
An oldy but goody for layout extraction is Cermine by Dominika Tkaczyk and colleagues[0]. Java required.

[0]: http://cermine.ceon.pl/about.html

nhirschfeld · 10 months ago
didnt know this!
nhirschfeld commented on Show HN: Kreuzberg – Modern async Python library for document text extraction   github.com/Goldziher/kreu... · Posted by u/nhirschfeld
alex_suzuki · 10 months ago
Any experience with Paddle OCR? https://github.com/PaddlePaddle/PaddleOCR

Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.

nhirschfeld · 10 months ago
I haven't, testing it out is on my todo list for sure

u/nhirschfeld

KarmaCake day97December 30, 2021
About
Na'aman Hirschfeld: - linkedin: https://www.linkedin.com/in/nhirschfeld/ - email: nhirschfeld@gmail.com
View Original