Readit News logoReadit News
UglyToad commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
Animats · 21 days ago
Can you just ignore the index and read the entire file to find all the objects?
UglyToad · 20 days ago
Yes this is generally the fallback approach if finding the objects via the index (xref) fails. It is slightly slower but it's a one time cost, though I imagine it was a lot slower back when PDFs were first used on the machines of the time.
UglyToad commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
simonw · 21 days ago
I convert the PDF into an image per page, then dump those images into either an OCR program (if the PDF is a single column) or a vision-LLM (for double columns or more complex layouts).

Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.

UglyToad · 21 days ago
If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.

I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?

UglyToad commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
farkin88 · 21 days ago
Great rundown. One thing you didn't mention that I thought was interesting to note is incremental-save chains: the first startxref offset is fine, but the /Prev links that Acrobat appends on successive edits may point a few bytes short of the next xref. Most viewers (PDF.js, MuPDF, even Adobe Reader in "repair" mode) fall back to a brute-force scan for obj tokens and reconstruct a fresh table so they work fine while a spec-accurate parser explodes. Building a similar salvage path is pretty much necessary if you want to work with real-world documents that have been edited multiple times by different applications.
UglyToad · 21 days ago
You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.

What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.

However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.

[0]: https://github.com/UglyToad/PdfPig/pull/1102

UglyToad commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
JKCalhoun · 21 days ago
Yeah, PDF didn't anticipate streaming. That pesky trailer dictionary at the end means you have to wait for the file to fully load to parse it.

Having said that, I believe there are "streamable" PDF's where there is enough info up front to render the first page (but only the first page).

(But I have been out of the PDF loop for over a decade now so keep that in mind.)

UglyToad · 21 days ago
Yes, you're right there are Linearized PDFs which are organized to enable parsing and display of the first page(s) without having to download the full file. I skipped those from the summary for now because they have a whole chunk of an appendix to themselves.
UglyToad commented on Americans see their savings vanish in Synapse fintech crisis   cnbc.com/2024/11/22/synap... · Posted by u/hunter2_
simpaticoder · 9 months ago
If regulators don't act, then nothing will stop copycats from doing this again. The end result will be the loss of trust in new banks. The people that would benefit from this effect are established banks, so it may not be in the OG banks' interest to cooperate. I would be interested to hear a patio11 analysis of this situation.
UglyToad · 9 months ago
FWIW they are acting, these things just take a while, current phase of gathering comments ends December 2nd https://www.fdic.gov/news/press-releases/2024/fdic-proposes-...
UglyToad commented on GLP-1 for Everything   science.org/content/blog-... · Posted by u/etiam
SpicyLemonZest · 10 months ago
Again, I genuinely don't understand the point. There's a large and well-funded segment of the nutrition industry dedicated to solving the root causes - Weight Watchers alone has over a billion dollars in annual revenue. We just haven't invented a diet-based solution which works as well as GLP-1 agonists without requiring you to compromise on palatability and feel hungry all day.

It'll be great if we do, although I don't know of any promising research avenues and I lean towards the hypothesis that the average human metabolism is simply tuned to mild obesity under conditions of widespread food availability.

UglyToad · 10 months ago
The point, which seems to be routinely massively downvoted on here, is that both things can be true at once:

- these drugs are good and a paradigm shift in the treatment of obesity (and have other benefits)

- we must not lose sight of the need to address a thoroughly sick food industry that necessitate so many people needing to use these. Junk food advertising, lack of subsidies for fresh vegetables, HFCS, food deserts, etc.

Chile is experimenting with banning junk food ads to children and is seeing some early behaviour changes.

The point which people seem to be wilfully missing is that we can have both these drugs and advocate for cracking down on a food system that deliberately poisons everyone in society. Having everyone be on this drug because we shrug and say "free market innit" while big corps continue to feed us crap is not a solution, obviously.

UglyToad commented on C# almost has implicit interfaces   clipperhouse.com/c-sharp-... · Posted by u/mwsherman
neonsunset · a year ago
If anything, there is little reason to use a named delegate over the Func nowadays too. The contract in this case is implied by you explicitly calling a constructor or a factory method so a type confusion, that Go has, cannot happen.
UglyToad · a year ago
The idea with the named delegate would be if you need some way to:

    delegate Task<string> GetUserEmail(int userId);
This provides more guidance than taking in a:

    Func<int, Task<string>> getUserEmail
If you can annotate implementations of the delegate the tooling support becomes even nicer. Not all Funcs with the same shape have the same semantics, in my ideal C#-like language.

Edit: I completely forgot the main reason which is if using a DI container it can inject the named delegate for you correctly in the constructor. Versus only being able to register a single func shape per container.

UglyToad commented on C# almost has implicit interfaces   clipperhouse.com/c-sharp-... · Posted by u/mwsherman
jayd16 · a year ago
If you want to enforce things, use an interface. If you want to accept anything that fits use a delegate.

I'm not sure I understand your use case where you need to conflate the two. You want to enforce the contract but with arbitrary method names?

I suppose you could wire up something like this but it's a bit convoluted.

    interface IFoo {
     string F(String s);
    }
    
    class Bar {
     public string B(String s){
      return "";
     }
    }

    // internal class, perhaps in your test framework
    class BarContract : Bar, IFoo {
     public string F(string s) => B(s);
    }

UglyToad · a year ago
My aim is to use dependency injection to inject the minimal dependency and nothing more. Versus the grab bag every interface in a medium-complexity C# project eventually devolves into.

I've had this on my blogpost-to-write backlog for a year at this point but in every project I've worked on an interface eventually becomes a holding zone for related but disparate concepts. And so injecting the whole interface it becomes unclear what the dependency actually is.

E.g. you have some service that does data access for users, then someone adds some Salesforce stuff, or a notification call or whatever. Now any class consuming that service could be doing a bunch of different things.

The idea is basically single method interfaces without the overhead of writing the interface. Just being able to pass around free functions but with the superior DevX most C# tools offer.

I guess I want a more functional C# without having to learn F# which I've tried a few times and bounced off.

u/UglyToad

KarmaCake day1662May 3, 2019
About
Maintainer of https://github.com/UglyToad/PdfPig

Writes mainly C#

View Original