desgeeko (u/desgeeko)

desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure github.com/desgeeko/pdfsy... · Posted by u/desgeeko

nathan_f77 · 6 months ago

This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!

desgeeko · 6 months ago

Thank you very much!

desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure github.com/desgeeko/pdfsy... · Posted by u/desgeeko

adelpozo · 6 months ago

it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!

desgeeko · 6 months ago

Correct, PDFSyntax implements everything at the lowest level. You can ignore the HTML visualization and use it as an API to access PDF objects. Why? Because I started a very small tool as a week-end project and I got hooked reading the PDF Specification so it is becoming a general purpose PDF library for Python. I am not familiar with other libraries but I have the impression that mine implements things that are often overlooked in others, like incremental updates.

desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure github.com/desgeeko/pdfsy... · Posted by u/desgeeko

kevmo314 · 6 months ago

Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.

EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...

desgeeko · 6 months ago

Yes, I value simplicity and the interactivity offered by basic HTML and CSS is sufficient for my use case :)

desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure github.com/desgeeko/pdfsy... · Posted by u/desgeeko

tyilo · 6 months ago

Looks nice.

Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

desgeeko · 6 months ago

Thanks for noticing! You're right, I will fix that very soon.

desgeeko commented on Show HN: IPA, a GUI for exploring inner details of PDFs github.com/seekbytes/IPA... · Posted by u/nicolodev

AlanYx · a year ago

Does anyone have any recommendations for a good tool that allows both programmatic inspection and modification of PDF primitives. For example, let's say someone wants to iterate through every embedded image in a PDF and apply some form of signal processing to the images in-place, then re-save the PDF?

desgeeko · a year ago

My tool (PDFSyntax[1], mentioned in this thread) is a Python library that is able to both inspect and transform PDF files.

Depending on your transformation use case, you may write an incremental update with only a few bytes at the end of the original file instead of rewriting it entirely. To my knowledge this feature of the PDF specification is often overlooked and not a lot of libraries implements it.

It is a work in progress and I have not developed functions for images yet, though.

[1] https://github.com/desgeeko/pdfsyntax

desgeeko commented on Show HN: IPA, a GUI for exploring inner details of PDFs github.com/seekbytes/IPA... · Posted by u/nicolodev

svat · a year ago

This is cool!

Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:

- https://pdf.hyzyla.dev/

- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)

- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)

- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)

- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)

- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)

- https://www.o2sol.com/pdfxplorer/overview.htm

More?

desgeeko · a year ago

I am the author of PDFSyntax, thanks for mentioning it!

The HTML output is like a pretty print where you can read view objects and follow links to other objects.

Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...

desgeeko commented on So you want to modify the text of a PDF by hand (2020) gist.github.com/senderle/... · Posted by u/mutant_glofish

desgeeko · 2 years ago

If you want to continue this journey and learn more about PDF, you can read the anatomy of a file I documented recently: https://pdfsyntax.dev/introduction_pdf_syntax.html