Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

pabs3 · 5 months ago

Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint.

https://weasyprint.org/

For a proprietary solution, try Prince XML:

https://www.princexml.com/

grounder · 5 months ago

WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.

I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.

Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826

stuaxo · 5 months ago

This is my experience and recommendation too.

rossdavidh · 5 months ago

+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.

jmyeet · 5 months ago

I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable.

Yes it costs money. So does developer time.

angst_ridden · 5 months ago

Agreed. Prince also has a lot of good features for headers, footers, page numbering, etc, that make it very powerful.

thenews · 5 months ago

https://stirlingpdf.io also uses weasyprint !!

alsetmusic · 5 months ago

There was a critical book that I read two years ago that is only available online. The web presentation is full of images of maps, artifacts, etc to help contextualize the content. No PDF converter tool has ever been up to the job of just extracting the text until this one. Thank you!

Semaphor · 5 months ago

I'll join the choir. We use weasyprint for ebooks and invoices and it's a joy to use. Massively new support for features over the last few years (partially thanks to some monetary sponsorships), it started pretty bare bones, and is now close to commercial solutions.

The maintainers are also very responsive, and helpful.

Amazing project

hulitu · 5 months ago

> Just print to PDF in a browser

I tried yesterday. With compliments to the moms of SWE who coded the functionality in firefox. Aparently puting the screen on a pdf page is an insurmontable task in 2025. (20 years ago was still doable). I had to make a screenshot and process the picture to print it.

carlosjobim · 5 months ago

Orion browser produces PDFs which are exactly what you see on screen.

jiehong · 5 months ago

Most website do not have a print CSS, so it doesn’t print that nicely in PDF.

But, I upvote weasyprint for that instead.

hulitu · 5 months ago

> Most website do not have a print CSS, so it doesn’t print that nicely in PDF.

Can't they just render the screen content in a pdf ? Seems easy for other programs to do this.

sureglymop · 5 months ago

Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.

chinathrow · 5 months ago

Yes, I did that for a recent small program. The @media print media query is powerful enough for most of the stuff I wanted to format nicely. Even page breaks are possible.

rcarmo · 5 months ago

These two are the only right answers if you want a reliable, reproducible, relatively low resource experience. Running a browser engine has always been hard to maintain in the long run for me.

sodimel · 5 months ago

+1 - Weasyprint is an excellent tool to make pdf from html content, and we're using it at work (with django) to export various documents.

bluebarbet · 5 months ago

Seconded. In my eccentric workflow, I use Weasyprint to convert HTML emails to more portable PDFs. A surprisingly successful experiment.

kappadi3 · 5 months ago

Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted. Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.

Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf

johnh-hn · 5 months ago

Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.

I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.

[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp

benoau · 5 months ago

Thirded, you can build this straight into your backend or into a microservice very easily.

You can also easily generate screenshots if that's more suitable than PDFs.

You can also easily use this to do stuff like jam a set of images into a HTML table and PDF or screenshot them in that format.

ChuckMcM · 5 months ago

You made me realize that tractor feed roll paper would be really great for printed web pages, no page breaks! Kinda like reading scrolls of yore.

Aachen · 5 months ago

Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:

    pandoc --self-contained input.html -o output.html

crazygringo · 5 months ago

Or, please do?

I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.

I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.

nine_k · 5 months ago

PDF is literally digital paper. HTML has logical structure, it can adapt to different displays, etc.

Sometimes you want one, sometimes, the other.

mr_mitm · 5 months ago

You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.

craftkiller · 5 months ago

I was excited to try this today, but this is unusable. It absolutely mangles the page.

  - It duplicated the headline, one in the correct place top-center but then a 2nd copy of the headline left-aligned below that.
  - It shrunk the width of the content of the page (in fact, it seems to have completely discarded the css for the #content selector)
  - It discarded the CSS for my code blocks, so now they are unreadable.
  - My images are no longer center-aligned
  - It added CSS that was not in the original document. For some reason, it addded hyphens: auto, overflow-wrap: break-word, text-rendering: optimizeLegibility, font-kerning: normal . None of those rules existed in the original document anywhere. Now my text is breaking mid-word with hyphens inserted.
  - It pointlessly HTML-escaped some characters (like every quotation mark in every paragraph). This didn't break anything, but just... why?

Implementing the same functionality is like less than 100 lines of python, so I'm just going to go that route. I've implemented it once before, but it was for a previous company so I no longer have access to that code, but its like 1 afternoon of scripting and doesn't randomly destroy your documents. I don't know how pandoc got this so wrong.

For context: the document I am attempting to process has no javascript. It is a simple Emacs Org document (similar to markdown) rendered to HTML and then processed with pandoc. The only external content was a couple of images.

Aachen · 5 months ago

Huh, that's a bummer! I only used it once myself to send colleagues draft versions of some markdown file that would later go on our blog, maybe it somehow helped that the source was markdown instead of html? Not sure, I'm sorry to hear of this disappointing experience :/

jasode · 5 months ago

Fyi... the preferred new syntax since 2022 is:

  --embed-resources --standalone.

https://github.com/rstudio/rmarkdown/issues/2382

https://pandoc.org/MANUAL.html#:~:text=Deprecated%20synonym%...

Aachen · 5 months ago

I noticed when trying it out for this comment, but then looked around when it was introduced and it seems recent (as in, an LTS distribution won't have it). Someone on stackoverflow said they get "unknown option --embed-resources". The old option will work for everyone and is also simpler, one instead of two parameters. People whose client supports the new option will see the upgrade suggestion when they run this. In the end I saw mainly downsides to mentioning the new rather than the old way

agedclock · 5 months ago

Pandoc would be my preferred tool. It is excellent at converting between other formats as well.

kelnos · 5 months ago

> Please don't turn nice formats into a format that's similar to screenshots of text

Converting HTML to PDF shouldn't result in an image wrapped in a PDF. Text will be preserved as text in the final PDF. (Unless the converter is garbage, of course.)

Aachen · 5 months ago

If you've ever copied text out of a PDF, you'll know it's not the original text anymore. Besides ligatures, you get broken sentences with extra hyphens inserted in wrong places (that were word/line breaks in the PDF-rendered version), if it'll properly let you select more than a few words at all. It works like "put these couple words at position x,y" and not (html's) semantic "here comes a heading" tag that helps people accessibly read your text, and if you're not suffering from any impairment or mobile devices with narrower screens than this particular render was designed for, it also lets you work with the document more easily. It's like you remove all HTML and keep only the CSS: all definitions of what's a section, sentence, emphasis, or caption are gone

I didn't mean literally an image, hence saying image-like. You get similar limitations to when using OCR, which seems very image-like to me

layer8 · 5 months ago

HTML+CSS+media files isn’t a nice format, and much less portable through time and space than PDF.

Aachen · 5 months ago

Not sure if I'm misreading your comment, but it's not plural files with all those formats separately

That's what the "self contained" option does: turn it into one nice file. Makes no difference if you copy example.pdf or example.html when both contain all images and styles (except one of them also contains the original semantic text)

TylerE · 5 months ago

Being (not so easily) edited is often a feature, not a bug.

craftkiller · 5 months ago

If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.

Aachen · 5 months ago

May I recommend .html in that case? You can embed scripts that control who can run it, having it fetch a decryption token from a server or require a decryption password with a safe password hashing algorithm of your choice

It's much more versatile than PDF and, if the algorithm decides the user is allowed to read the document, then the user gets to make use of all of the document's options like a better search function (PDF can't find words that are bro-

ken across lines because that information of what's a word is gone, transformed into coordinates of what characters need to go where). It's also much more readable on different screen sizes, as the user can resize the window to whatever is comfortable on a 27" screen, or fits on their pocket e-reader. You can even draw it on a canvas if you want to prevent people from extracting the decrypted strings (though it's evil, you have that option). There's only benefits!

PDF is the lazy way to half-ass a read-only document while screwing, ahem, making anyone using a mobile phone zoom, pan, and squint. Thankfully, phones are falling out of fash— wait, scratch that, I just heard text reflow is more relevant than ever as phone use continues to soar

ryandrake · 5 months ago

Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.

guywithahat · 5 months ago

I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex

pandoc input.html -o output.pdf --pdf-engine=<your engine>

moralestapia · 5 months ago

Please don't police what other people do.

Aachen · 5 months ago

If I were police, I could still not enforce that this is what they run until it's law. They're free to choose this option if they like the merits

Snawoot · 5 months ago

chrome --headless --disable-gpu --print-to-pdf https://example.com

piptastic · 5 months ago

same: google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com

ended up using headless chrome specifically to make sure javascript things rendered properly

hhthrowaway1230 · 5 months ago

Used this, sigh of relief, thank you

HPsquared · 5 months ago

Can Chromium do this?

Edit: it appears so- https://news.ycombinator.com/item?id=15131840

nine_k · 5 months ago

Yes, routinely works for me.

mmphosis · 5 months ago

Can Firefox do this?

with an elaborate script that relies on xdotool

andrehacker · 5 months ago

Yes, kind of...

/path/to/firefox --window-size 1700 --headless -screenshot myfile.png file://myfile.html

Easy, right ?

Used this for many years... but beware:

- caveat 1: this is (or was) a more or less undocumented function and a few years ago it just disappeared only to come back in a later release.

- caveat 2: even though you can convert local files it does require internet access as any references to icons, style sheets, fonts and tracker pixels cause Firefox to attempt to retrieve them without any (sensible) timeout. So, running this on a server without internet access will make the process hang forever.

jlokier · 5 months ago

Last time I explored this, Firefox rendered thin lines in subtly bordered tables as thick lines, so I had to use Chromium. But back then Chrome did worse at pagination than Firefox.

So I used Firefox for multi-page documents and Chromium for single-page invoices.

I spent a lot of time with different versions of both browsers, and numerous quirks made a very unpleasant experience.

Eventually I settled on Chromium (Ungoogled), which I use nowadays for invoices.

nine_k · 5 months ago

Why, Firefox has a headless mode. It can't just print a document via a simple CLI command, you have to go for Selenium (or maybe Playwright, I did not try it in that capacity). Foxdriver would work, but its development ceased.

lizimo · 5 months ago

If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/ We use it in production to generate reports, and it is amazing.

leephillips · 5 months ago

See https://lwn.net/Articles/1037577/ for a recent summary of what you can do with Typst.

RiverCrochet · 5 months ago

If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.

dredmorbius · 5 months ago

Shortcutting much of the discussion here (what are you goals / why would you do that / don't use format X): a key problem is that neither HTML (as published on today's Web) nor PDF are reliable as canonical document formats. Tagged-markup such as Markdown (or otherlightweight markup languages) or LaTeX (or other heavy markup languages) are far more robust. Markdown has its variants, but all are pretty simple and easy to produce. LaTeX is slightly more complex, but remains quite straightforward for simple works.

Once you've got an appropriate canonical version in any of these options, you have an embarassment of riches to convert to any given document format (what I call endpoints) you'd care for: PDF, HTML, RTF, DOCX, or many, many others. I generally reach for Pandoc first, which itself, yes, of course, often relies on additional tools/libraries to parse or generate endpoints, but is quite versatile.

You can simplify the intake of HTML by stripping out cruft. Readability, Beautiful Soup, or other HTML filtering tools can target the core content and metadata you most likely want.

Otherwise, think through what you're doing and why to more narrowly define your goals and tools. E.g., if you want a faithful printed representation of a mainstream-browser-rendered page (that is, Google Chrome), you'd probably do best to use its print-to-PDF options (mentioned several times here). If you want to extract core text, filtering out much of today's WWW cruft will be a high priority.

trollbridge · 5 months ago

I wrote a solution in 2010 that used headless Firefox with some plugins to generate a PDF and then had the graphic designer write print CSSes. It was driven by Perl and was a convenient way for non-programmers to design forms.

Unfortunately, that server and software stack is still around and still in production.

znpy · 5 months ago

> Unfortunately, that server and software stack is still around and still in production.

that means you did a good job.

Dwedit · 5 months ago

2010-era Firefox is probably plagued by security holes.