You should try using a different PSM to see if you get better results.
If it's scientific texts specifically, look at grobid
There are alternative options to tesseract ofc.
Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.
Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!
That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.
Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.