I've been making my reports in self-contained HTML files[0] and it works out so much better than PDF. It is not constrained by paper sizes, and it lets me add some nifty features. For example, I recently added support for hiding columns in a table using exclusively CSS. The only downside is browsers can render things slightly differently, but for my use cases I don't need pixel-perfect identical rendering.
[0] Images are inlined base64-encoded, CSS/JS embedded with style and script tags. No external assets / no http requests.
Being constrained by page sizes is “a feature, not a bug” in most contexts. If I’m calling out numbers on the 3rd line of page 38 of a report, it helps if that’s consistent.
The only reason PDFs still have a job is: pixel perfect consistency; the built-in validity stuff (ensuring the document wasn't altered, etc.); or the customer doesn't need the other things, but isn't open to alternatives. Otherwise, PDF is just a major headache.
Also page-level consistency, and generally layouting in a printable format
Even with the same word document opened only in various MS Word versions (web, desktop, etc) you won't get consistent page numbers. And HTML tables work great on screen but don't print very well if they span more than what fits on a single sheet of paper
Not only can you embed the fonts, but you can make it interactive and output a PDF if you really wanted to. The HTML might grow if you embed enough JS, but on the other hand... some PDFs are insanely large.
The wider .NET ecosystem is lacking when trying to step out the mainline. I don't bother hunting for unused, partially implemented .NET libraries anymore and just call out to a process or API call when needing to get something done.
It's not ideal, but when there isn't a good option isn't available in .NET it's usually available in Python/npm. Typically I'll use background jobs when calling out of process for added resiliency/replayability and observability.
Not sure I agree. Also depends of the domain. The python ecosystem is of course a lot richer for anything AI. But try to open, manipulate and export spreadsheets. In python you pretty much need a different library for every excel file format (xls, xlsx, etc) and usually the more file formats a library can handle, the least capable it is (eg pandas). In .net you have libraries like spreadsheetgear that are super powerful, including their own excel calculation engine. I see nothing remotely close in python.
The point is that a good library usually exists for some language, which is not necessarily the one you are currently using.
IMHO, we don't lack good libraries in XY, we are lacking good interop. Going through REST or stdio is quite painful just to render PDF (or export spreadsheet, ...)
I'm using of a lot of ComfyUI Workflows, Custom Nodes, Image and Audio classifiers relying on PyTorch, supervision, ultralytics, MediaPipe, OpenCV, onnxruntime, pandas, numpy that says otherwise. There are some equivalents, but the ecosystems aren't playing in the same ballpark.
We've been using Aspose.PDF for the last 10 years or so in our C# platform, and paying for the license. It's expensive and buggy and has shite support, so a year or so back I decided to see if there was some other library or combination of libraries that could meet our needs. Basically, we needed:
* HTML to PDF
* Compress PDF
* Manual PDF generation
* Text extraction
* No browser engine or other weird dependencies
I researched every library I could find, and downloaded, integrated and tested anything that looked remotely promising.
At the end of all that, I reluctantly handed my company credit card back to Aspose. There simply wasn't any open-source or even just cheaper PDF library that I could actually make work, and all the other paid ones that did work were even more expensive.
I am in the same boat. Aspose has been the go to for Word and PDF documents. Will say, Adobe's PDF Services API offers a ton of interesting features but comes with a price tag and in my scenario, it's not HIPAA compliant.
Aspose is the library I’ve used commercially in the past, too. My experience was similar. The company I worked for at the time eventually charged more for PDF export as a paid add on. The software is very sticky so the people who truly needed pdf export directly paid, the rest relied on export to word then “printed” the pdf themselves.
I create PDF files from C# using LaTeX as an intermediate format. This works very reliable but sometimes takes a bit of tinkering until everything fits.
People here on HN recently recommended Typst as a replacement for LaTeX, but I haven't tried it myself yet.
It seems the PDFSharp rabbit hole goes even deeper than I've realized!
Latest MigraDoc & PDFSharp seem to have been updated and ported to .NET 6 after a lot of the forks happened, so it was unclear to me whether there's merits in looking at other, mostly abandoned forks.
I might add PdfSharpCore, though the use of SixLabors.ImageSharp and SixLabors.Fonts leads to a disqualification from the "quest", given their custom split license [1]
Edit: Actually, the license seems to turn into an Apache 2.0 license, when used with an open source licensed project and also as transitive dependency. Certainly a confusing license.
Edit: PSA - PdfSharpCore uses older releases of SixLabors.ImageSharp v1.0.4 and Fonts-1.0.0-beta17 which both were (and are still) distributed under plain Apache-2.0.
>Naturally, I first started looking for permissively licensed libraries, which could be used free of charge and without additional license requirements.
There is a lot of work in a good PDF library, expecting to get it for free feels unreasonable to me.
If you ever revisit alternatives, you might want to try YakPDF
It gives you:
- HTML → PDF without any browser engine
- PDF compression & optimization
- Simple API for manual PDF generation
- Text extraction
- No native dependencies and cheaper than Aspose
It’s not a full drop-in replacement for every Aspose feature, but it covers the core workflow you mentioned and is much lighter to integrate.
YakPDF (as far as I can tell) is an API and not a library that generates a PDF. If you're going to go that route, host https://github.com/gotenberg/gotenberg yourself and call it a day.
I needed this post a year ago when I was looking for this exact thing. I did end up going with Puppeteer because I needed it for something else that I couldn't avoid. I use a large list of flags with it to launch the most minimal version of headless Chrome that I can.
I am going to look into switching to MigraDoc and see if i can drop puppeteer
Having played around with MigraDoc for the past few weeks, I do still recommend it, as long as you don't need more complex layouts. Here's a short and certainly incomplete list of limitations that I've run into so far:
- No tables within other tables
- No multi-column page layouts
- No multi-section on the same page (new section = new page)
- No letter spacing
- MigraDoc doesn't know about the final spacing, so you can't adjust say the width of some table column automatically. Either calculate an estimated based on the text/content or space them equally.
- Can't shade (background color) only a selection of words in a text
- Lists can only have up to three different symbols
- List indentation can behave quite strange, due to tabstops
- No horizontal rule (can be emulated)
- There's a bug with bottom border of a paragraph
On the other hand, MigraDoc & PDFsharp as less than 1MB and plenty fast, so it's a great package, as long as you can build some workarounds to achieve the desired look.
I've been making my reports in self-contained HTML files[0] and it works out so much better than PDF. It is not constrained by paper sizes, and it lets me add some nifty features. For example, I recently added support for hiding columns in a table using exclusively CSS. The only downside is browsers can render things slightly differently, but for my use cases I don't need pixel-perfect identical rendering.
[0] Images are inlined base64-encoded, CSS/JS embedded with style and script tags. No external assets / no http requests.
https://developer.mozilla.org/en-US/docs/Web/CSS/Guides/Medi...
Even with the same word document opened only in various MS Word versions (web, desktop, etc) you won't get consistent page numbers. And HTML tables work great on screen but don't print very well if they span more than what fits on a single sheet of paper
It's not ideal, but when there isn't a good option isn't available in .NET it's usually available in Python/npm. Typically I'll use background jobs when calling out of process for added resiliency/replayability and observability.
IMHO, we don't lack good libraries in XY, we are lacking good interop. Going through REST or stdio is quite painful just to render PDF (or export spreadsheet, ...)
How do you handle deployment / packaging of multiple, different ecosystems?
Python and others have similar issues, with them having limitations as well
* HTML to PDF
* Compress PDF
* Manual PDF generation
* Text extraction
* No browser engine or other weird dependencies
I researched every library I could find, and downloaded, integrated and tested anything that looked remotely promising.
At the end of all that, I reluctantly handed my company credit card back to Aspose. There simply wasn't any open-source or even just cheaper PDF library that I could actually make work, and all the other paid ones that did work were even more expensive.
Dead Comment
People here on HN recently recommended Typst as a replacement for LaTeX, but I haven't tried it myself yet.
Do you use any library or are you just calling the standard TeX CLI tools?
IMHO the list is incomplete without it.
1: https://github.com/ststeiger/PdfSharpCore
Latest MigraDoc & PDFSharp seem to have been updated and ported to .NET 6 after a lot of the forks happened, so it was unclear to me whether there's merits in looking at other, mostly abandoned forks.
I might add PdfSharpCore, though the use of SixLabors.ImageSharp and SixLabors.Fonts leads to a disqualification from the "quest", given their custom split license [1]
Edit: Actually, the license seems to turn into an Apache 2.0 license, when used with an open source licensed project and also as transitive dependency. Certainly a confusing license.
[1] https://github.com/SixLabors/ImageSharp/blob/main/LICENSE
https://web.archive.org/web/20251104163604/https://codeload....
There is a lot of work in a good PDF library, expecting to get it for free feels unreasonable to me.
That's said, for many niche products you are correct.
- HTML → PDF without any browser engine - PDF compression & optimization - Simple API for manual PDF generation - Text extraction - No native dependencies and cheaper than Aspose
It’s not a full drop-in replacement for every Aspose feature, but it covers the core workflow you mentioned and is much lighter to integrate.
https://rapidapi.com/yakpdf-yakpdf/api/yakpdf (open via firefox)
edit: Stop spamming your own service.
I am going to look into switching to MigraDoc and see if i can drop puppeteer
Thanks for this great research!
Having played around with MigraDoc for the past few weeks, I do still recommend it, as long as you don't need more complex layouts. Here's a short and certainly incomplete list of limitations that I've run into so far:
- No tables within other tables
- No multi-column page layouts
- No multi-section on the same page (new section = new page)
- No letter spacing
- MigraDoc doesn't know about the final spacing, so you can't adjust say the width of some table column automatically. Either calculate an estimated based on the text/content or space them equally.
- Can't shade (background color) only a selection of words in a text
- Lists can only have up to three different symbols
- List indentation can behave quite strange, due to tabstops
- No horizontal rule (can be emulated)
- There's a bug with bottom border of a paragraph
On the other hand, MigraDoc & PDFsharp as less than 1MB and plenty fast, so it's a great package, as long as you can build some workarounds to achieve the desired look.