Readit News logoReadit News
desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure   github.com/desgeeko/pdfsy... · Posted by u/desgeeko
nathan_f77 · 6 months ago
This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!
desgeeko · 6 months ago
Thank you very much!
desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure   github.com/desgeeko/pdfsy... · Posted by u/desgeeko
adelpozo · 6 months ago
it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!
desgeeko · 6 months ago
Correct, PDFSyntax implements everything at the lowest level. You can ignore the HTML visualization and use it as an API to access PDF objects. Why? Because I started a very small tool as a week-end project and I got hooked reading the PDF Specification so it is becoming a general purpose PDF library for Python. I am not familiar with other libraries but I have the impression that mine implements things that are often overlooked in others, like incremental updates.
desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure   github.com/desgeeko/pdfsy... · Posted by u/desgeeko
kevmo314 · 6 months ago
Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.

EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...

desgeeko · 6 months ago
Yes, I value simplicity and the interactivity offered by basic HTML and CSS is sufficient for my use case :)
desgeeko commented on Show HN: HTML visualization of a PDF file's internal structure   github.com/desgeeko/pdfsy... · Posted by u/desgeeko
tyilo · 6 months ago
Looks nice.

Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

desgeeko · 6 months ago
Thanks for noticing! You're right, I will fix that very soon.
desgeeko commented on Show HN: IPA, a GUI for exploring inner details of PDFs   github.com/seekbytes/IPA... · Posted by u/nicolodev
AlanYx · a year ago
Does anyone have any recommendations for a good tool that allows both programmatic inspection and modification of PDF primitives. For example, let's say someone wants to iterate through every embedded image in a PDF and apply some form of signal processing to the images in-place, then re-save the PDF?
desgeeko · a year ago
My tool (PDFSyntax[1], mentioned in this thread) is a Python library that is able to both inspect and transform PDF files.

Depending on your transformation use case, you may write an incremental update with only a few bytes at the end of the original file instead of rewriting it entirely. To my knowledge this feature of the PDF specification is often overlooked and not a lot of libraries implements it.

It is a work in progress and I have not developed functions for images yet, though.

[1] https://github.com/desgeeko/pdfsyntax

desgeeko commented on Show HN: IPA, a GUI for exploring inner details of PDFs   github.com/seekbytes/IPA... · Posted by u/nicolodev
svat · a year ago
This is cool!

Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:

- https://pdf.hyzyla.dev/

- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)

- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)

- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)

- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)

- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)

- https://www.o2sol.com/pdfxplorer/overview.htm

More?

desgeeko · a year ago
I am the author of PDFSyntax, thanks for mentioning it!

The HTML output is like a pretty print where you can read view objects and follow links to other objects.

Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...

desgeeko commented on So you want to modify the text of a PDF by hand (2020)   gist.github.com/senderle/... · Posted by u/mutant_glofish
desgeeko · 2 years ago
If you want to continue this journey and learn more about PDF, you can read the anatomy of a file I documented recently: https://pdfsyntax.dev/introduction_pdf_syntax.html

u/desgeeko

KarmaCake day330July 17, 2021View Original