Launch HN: Onedoc (YC W24) – A better way to create PDFs

FYI: the open source state of the art in this area is Playwright (the successor to Puppeteer) with Paged.js (https://pagedjs.org/). I highly recommend that everyone check out and donate to paged.js, it's a fantastic project with lots to like. It certainly blows commercial alternatives like Prince XML out of the water.

That forms a solid foundation that I find it hard to imagine paying for. The things where you might still command a premium are basically safety mechanisms/CI checks/library components that ensure the PDF renders correctly in the presence of variable-length content, etc. as well as maybe PDF-specific features like metadata and fillable forms. Naive ways to format headers, footers, tables/grids/flexboxes etc. often fail in PDFs because of unexpected layout complications. So having a methodology, process, and validation system for ensuring that a mission critical piece of information appears on a PDF in the presence of these constraints could be attractive.

caesil · 2 years ago

I think https://github.com/diegomura/react-pdf is closer to what this company is doing.

In fact their open source library, https://github.com/OnedocLabs/react-print-pdf, seems like a higher-level library that sits above react-pdf. Reminds me a lot of the set of react-pdf based components I built for a corporate job where letting users create PDFs was a huge part of the value proposition.

They're solving a really cool problem, actually, because building out into certain difficult use cases like SVG support was a huge pain.

AugusteLef · 2 years ago

Exactly. We are aiming at offering a solution to build complex PDF design. Which means having 100% control over the layout (margin, header, footer), the style and also the content. That's why we integrated Tailwind, CharkraUI, Markdown, LaTeX, and also wanted to support SVG etc.

Titou325 · 2 years ago

We are currently experimenting with this approach. A good thing about paged.js is that we would be able to provide hot-reload and live preview of files without actually converting to PDF.

Your second point is very interesting, seems like some kind of .assert('text').isVisible() API. We may want to dig into that further!

rudasn · 2 years ago

Or maybe some visual diffing based on expected output, based on the template/layout/theme used, since you'd want to perform this check on every pdf generated in prod (that has real, sensitive data) , not just in CI or testing mode, if you're aiming for critical docs.

Cool project btw, congrats for the launch!

timvdalen · 2 years ago

(How) does it handle CMYK and print PDFs? I see images of printed books created by Paged.js, were these post-processed, or printed using a printer that does a best-effort RGB conversion?

ak217 · 2 years ago

I'm not sure - we don't do color correction on our PDFs because we don't have photos in them and color rendering is not mission critical - but paged.js is focused on the concern of layout for print media. I would imagine color rendering can be solved orthogonally to what paged.js does for you, as long as you specify the color data in CSS. I'm pretty sure paged.js will pass it through without messing with it, so you're good if the browser that Playwright/puppeteer is driving supports the correct color profile when emitting the PDF. I honestly don't know if browsers have sufficient support for that when emitting a PDF, though.

Overall you're right that color correction is another area where you could probably command a premium.

Mick-Jogger · 2 years ago

Isn't Playwright a testing framework, I am not sure how this solves the use-case that Onedoc is aiming for. I would be highly interested in some more background as we are evaluating alternative solutions to princeXML right now.

ak217 · 2 years ago

Playwright at its core is a headless browser driver. In this case, we are using it to tell the browser to generate a PDF.

May be this is just me but this looks extremely costly to me! It will cost $2,500 to generate 50,000 PDFs. Are edits/corrections additional cost?

jot · 2 years ago

It sounds like this is as advanced as DocRaptor[1]. They have what I consider to be the best PDF generation API, giving complete control over the documents you need to create. The pricing is similar.

If you'd rather do it for free weasyprint[2] is the best open source alternative.

Another more affordable option you might want to consider is Urlbox[3]. (Disclosure: I work on this)

Urlbox's rendering engine is based on Chrome. It's been refined over the last 11 years to render pages as images or PDFs[4] that look great. I was a customer for 5 years before I joined the team. Everything we'd tried before Urlbox was a disappointment.

Urlbox probably can't match the power of either Onedoc or DocRaptor, but pricing starts at less than $0.01 per document and drops significantly with scale. If your PDF looks great when saving as PDF in Chrome it should look identically brilliant with Urlbox.

[1]: https://docraptor.com [2]: https://weasyprint.org [3]: https://urlbox.com [4]: https://urlbox.com/html-to-pdf

Titou325 · 2 years ago

This is a good point, and we are still trying to figure out how to price things fairly. Depending on the type of PDF, whether it is a simple receipt or a large multi-pages report, associated costs are very different on our side. At this time, we rely on other proprietary software that we are aiming to replace but that incur high costs on our side as well.

Edits and corrections on generated PDFs is not provided as the PDFs are signed as-is, however you can attach the metadata to the PDF and rerender with the modifications.

mediaman · 2 years ago

As a point of reference on pricing, convertAPI charges $0.05 per document conversion at their most expensive tier, and with any level of fixed commitment ($80 - $300 per month) it goes down to $0.016-0.006 per document.

Their PDF conversion is pretty good (I use it for PPT/Word -> PDF conversion), though your product is obviously different and has different/better capabilities for programmatic PDF creation. Still, a reference point.

Pricing page: https://www.convertapi.com/prices

passion__desire · 2 years ago

Edits would be limited to certain pages but may spill over (e.g. tables) so the whole PDF need not be generated. Only edited pages can be inserted back to previously generated PDFs. Could be an optimization to reduce cost.

snadal · 2 years ago

I second this. Maybe I'm missing something in the value proposition, but we already generate PDFs from .docx/.html templates using open source libraries and Docker microservices.

Do not misunderstand. A Stripe for generating PDFs can be great, but for a small team, $0.50/PDF is way more than I can afford (after all, you can create a small number of PDFs without too much fuss). Maybe you are oriented towards large companies?

AugusteLef · 2 years ago

Indeed, and as you mentioned, open-source libraries are always an option. It's worth noting that our open-source library assists in document design, allowing freedom in renderer choice. While the open-source library is aimed at individuals, our API targets businesses of any size. Our pricing can be as low as $0.05 per PDF for high-volume or annual commitments. Additionally, we offer cloud hosting for your documents for up to 90 days, and our pricing includes analytics.

pdabbadabba · 2 years ago

> $0.50/PDF is way more than I can afford

But isn't that 100x what they're actually charging--at least for an enterprise account? Their pricing page says "from $0.005/doc." (Though I'm not sure how much work "from" is doing there.) Pro tier is, admittedly, more like $0.12 per document (assuming you use your full quota). But still much less than $0.50/

I'm generally very confused by the various assertions in this thread about their pricing. What am I missing?

adnans · 2 years ago

We use https://www.api2pdf.com/pricing/ and it's priced per bandwidth and usage - ($.001 per mb bandwidth and $0.00019551 per second of computation)

You can choose which API to use: Headless Chrome, Wkhtmltopdf, Libreoffice, etc.

egnehots · 2 years ago

The main issue is conflating templating and pdf generation.

Using html to pdf solutions allow to do the templating in html, where it is pretty much a solved issue.

And as many said, headless chrome is a robust html to pdf solution, even though it feel like a hack.

But, yeah, there seems to be a lack of awareness about these options within corporations. So, kudos to you for addressing a genuine problem!

pedro120 · 2 years ago

Indeed, we aim at bundling this in a way that makes it easy and obvious for enterprises to build their PDFs that way.

yencabulator · 2 years ago

Typst is a typesetting language that makes programmatic layout and processing JSON input pretty darn simple. I make invoices by having a Typst template read line items from a JSON file.

https://github.com/typst/typst

adfaure · 2 years ago

Just spent my Sunday creating my invoice template in typst as well. I enjoyed it, and I could do what I wanted quickly!

plopz · 2 years ago

The problem with chrome is the performance, it is very slow and uses a bunch of memory. There was a neat post here awhile ago about generating pdfs faster https://news.ycombinator.com/item?id=39379690

Indeed, speed is an issue (and it's hard to tackle). Additionally, when using Chrome, what you see is not always what you get. The layout often doesn't match expectations, especially with complex elements. It's ok for simple use cases, but for professional and scalable solutions, you usually need to switch to something else!!

gzapp · 2 years ago

There are also a few good options in a lot of languages for streamlining chromium use.

In C# I'd look to use the Playwright library or perhaps even embed chromium via CerSharp if I were trying to avoid extra processes.

It seems there isn't a solution that satisfies everyone so far, indeed. With concerns about languages supported, functionalities, security, etc., there's certainly a lot of room for improvement in this space to offer a better solution!

midenginedcoupe · 2 years ago

I've also spent much longer than I'd like on this same problem. Having a lightweight-enough service to convert html->pdf on the fly, with good fidelity, and that can create an accessible pdf seems to be impossible.

If you can nail accessible PDFs then you'd open up a very big government market.

We felt the same, and that's precisely why we built this tool! The key, as you mentioned, is fidelity, especially for designing complex layouts. We hope to bring something new and valuable to the table. And yes, documents are central to many industries including government, legal, banking etc.

dmazzoni · 2 years ago

Can you directly answer whether your tool generates tagged PDFs?

Of course, you can't guarantee that the resulting document is 100% compliant because you can't enforce that the input is valid, but are you at least outputting a complete tag tree with as much semantics as possible given the input?

Deleted Comment

matteason · 2 years ago

Really interesting product. I do agree that the pricing seems steep ($0.25/document on Pro on the most generous tier) but I don't know enough about pricing B2B products to know if that would be a blocker.

I agree that HTML -> PDF can be a really powerful tool. I worked on the UK government's tool to generate energy efficiency labels for consumer goods [0] and we ended up doing PDF generation with SVG templates, using Open HTML to PDF for the conversion. That ended up working very well, though as you allude to there can be some gotchas (eg unsupported CSS features) that you need to work around.

A few questions:

- Do the rendered documents support PDF's various accessibility features?

- How suitable is this for print PDF generation? For example, what version of the PDF spec do you target? What's your colour profile support like? Do you support the different PDF page boxes (MediaBox, CropBox, BleedBox, TrimBox, ArtBox)?

[0] https://github.com/UKGovernmentBEIS/energy-label-service

[1] https://github.com/danfickle/openhtmltopdf

The pricing does go down for larger volumes and is something we still have to narrow down to the exact place that makes sense to companies and is also viable.

- We do not force PDF/* profiles down to the user, but it seems that for most of them PDF/UA-1 would be a sensible default. We can extract most of the tags from the HTML semantics by themselves which makes it much easier.

- We target the PDF 1.7 spec. Color profiles can be changed and you can use a custom .icc profile, with the corresponding embedding restrictions based on the document format. MediaBox is supported through the @page size property. Bleed, trim and marks can be added using vendor specific css properties. We don't support ArtBox yet but this is something we can look into! So far none of our customers really wanted to take this out to a real print shop, but we would be glad to help people go down this route :)

So are you saying that you don't output tagged PDFs now?

For those who don't know, if you use Chromium's print-to-pdf feature you get a tagged PDF. And it's scriptable from the command-line too.

somberi · 2 years ago

Useful service and a large problem space. Congrats and all the best. As someone who is a target customer, my 2 cents:

a. If this is a strategic value for my pipeline (and it is), we are going to code it ourselves, only because we can host it inside our fences. Critical customer data and hence.

b. The pricing is way off and is not reflective of the cost or value (for us). Even if it was 1/10th of the prices you charge, it will still be a no-go. At the volumes we have, it makes sense to build this ourselves.

c. SOC2 / ISO27001 - You might want to obtain them asap if you are looking to sell to outsourcing companies or FSG.

certifications (SOC2 / ISO27001) and offer an on-premise solution! I see there's already a discussion about pricing, so I'll leave that be. However, would an unlimited volume at a fixed cost (and self-host) be an attractive solution? It could be interesting for very high volumes.

I can tell you that the world I operate in will want something like what you are proposing (fixed rate + OnPrem) and the pricing is going to have a ceiling because building this in-house is a real and viable alternative. Our problem is not so much lack of talent but other product-roadmap priorities. What is the ceiling? I do not know, but can hazard a guess. 1/4th of the yearly cost of a good developer.

HatchedLake721 · 2 years ago

Curious, with ~$0.005 per document, what volumes do you do that pricing becomes a no-go for you?

In the long term, ~$0.005 per page (as opposed to document, which I assume hatchedlake meant) say on a mortgage document (~300 pages per) it adds up. The other alternative, which is to build this in-house (say 3 months and custom build, edge cases, such goodies), is more desirable (for us).

Brajeshwar · 2 years ago

Leoko · 2 years ago

I had to deal a lot with PDF generation over the past few years and I was very unhappy with the eco-system that was available:

1. HTML-to-PDF: The web has a great layout system that works well for dynamic content. So using that seems like a good idea. BUT it is not very efficient as a lot of these libraries simply spin up a headless browser or deal with virtual doms.

2. PDF Libraries (like jsPDF): They mostly just have methods like ".text(x, y, string) which is an absolute pain to work with when building dynamic content or creating complex layouts.

This was such a pain point in various projects I worked on that I built my own library that has a component system to build dynamic layouts (like tables over multiple pages) and then computes that down to simple jsPDF commands. Giving you the best of both worlds.

Hope this makes somebody's life a bit easier: https://github.com/DevLeoko/painless-pdf

chrisfinazzo · 2 years ago

Is there a reason you didn't consider something like Weasyprint?

https://weasyprint.org

Going all the way down to raw HTML is a bit verbose, but with almost anything I've thrown at it - CV's, business cards, you name it - it hasn't let me down yet.

epgui · 2 years ago

I just considered weasyprint and couldn't figure out where to put my credit card or where to go to get started or to see some docs, so that was a very short-lived consideration.

Crowberry · 2 years ago

I'm with you..

We ended up writing a similar wrapper around https://github.com/jung-kurt/gofpdf library. We haven't open sourced it yet. But it's made it a lot easier to deal with rendering a PDF, especially over pagebreaks ect.

aforwardslash · 2 years ago

A while ago I created a pdf report generation engine for Python, supporting Jinja2 template syntax, and server and client-side generation of content. Page formatting is handled by https://pagedjs.org/, and PDF generation is performed via a separate api daemon based on chrome-headless: https://zipreport.github.io/zipreport/ It is not fast, but it works quite well.

Yes, page breaks are probably the most significant difference between the layout of a web page and a PDF document, and thereby a major drawback when using HTML-to-PDF. There is little to no tooling for this in the web.

If you want granular control over how your PDF will look with content that is more than one page long, you will have a hard time using html.

Gualdrapo · 2 years ago

It seems TeX/LaTeX is a major inspiration in this, though there can be seen some room for improvement for details like hyphenation, expansion/protusion and microtypography. Not sure if/how a web engine can reach to those points but still it seems this has a potential niche and market outcome, so congrats.

Though personally I wish stuff like ConTeXt was more popular and approachable - to my humble knowledge their Lua backend seems to have huge potential, I am doing my invoices with ConTeXt/Lua.

It definitely is! Typesetting quality was the main reason we chose not to go down the Puppeteer/headless browser route but rather use a completely separate engine where typography is a first-class citizen.

We like LaTeX, but even for advanced users laying things out can be a difficult thing. Given that documents are a frontend, we wanted to bring the same tools frontend developers already use.