Show HN: Lightpanda, an open-source headless browser in Zig

Author here. The browser is made from scratch (not based on Chromium/Webkit), in Zig, using v8 as a JS engine.

Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.

It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.

Happy to answer any questions.

JoelEinbinder · 7 months ago

When I've talked to people running this kind of ai scraping/agent workflow, the costs of the AI parts dwarf that of the web browser parts. This causes computational cost of the browser to become irrelevant. I'm curious what situation you got yourself in where optimizing the browser results in meaningful savings. I'd also like to be in that place!

I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.

fbouvier · 7 months ago

The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.

Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.

refulgentis · 7 months ago

Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible

Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.

It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.

I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.

> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.

I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?

szundi · 7 months ago

Then came deepseek

danielsht · 7 months ago

Very impressive! At Airtop.ai we looked into lightweight browsers like this one since we run a huge fleet of cloud browsers but found that anything other than a non-headless Chromium based browser would trigger bot detection pretty quickly. Even spoofing user agents triggers bot detection because fingerprinting tools like FingerprintJS will use things like JS features, canvas fingerprinting, WebGL fingerprinting, font enumeration, etc.

Can you share if you've looked into how your browser fares against bot detection tools like these?

fbouvier · 7 months ago

Thanks! No we haven't worked on bot detection.

bityard · 7 months ago

Please put a priority on making it hard to abuse the web with your tool.

At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.

I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...

hombre_fatal · 7 months ago

This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.

The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.

gkbrk · 7 months ago

Please don't.

Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.

The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.

MichaelMoser123 · 7 months ago

That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.

benatkin · 7 months ago

Make it faster and furiouser.

There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.

mpalmer · 7 months ago

If it's already a problem, nothing this developer does will improve it, including crippling their software and removing arguably legitimate use cases.

buzzerbetrayed · 7 months ago

Nerfing developer tools to save the "open web" is such a fucking backward argument.

holoduke · 7 months ago

In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.

cchance · 7 months ago

Its literally open source, any effort put into hamstringing it would just be forked and removed lol

internet_points · 7 months ago

Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.

Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.

bsnnkv · 7 months ago

Looking at the responses here, I'm glad I just chose to paywall to protect against LLM training data collection crawling abuse.[1]

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

sesm · 7 months ago

Great job! And good luck on your journey!

One question: which JS engines did you consider and why you chose V8 in the end?

fbouvier · 7 months ago

We have also considered JavaScriptCore (used by Bun) and QuickJS. We did choose v8 because it's state of the art, quite well documented and easy to embed.

The code is made to support others JS engine in the future. We do want to add a lightweight alternative like QuickJS or Kiesel https://kiesel.dev/

keepamovin · 7 months ago

If you support Page.startScreencast or even just capture screenshot we could experiment with using this as a backend for BrowserBox, when lightpanda matures. Cool stuff!

https://github.com/BrowserBox/BrowserBox/

returnofzyx · 7 months ago

Hi. Can I embed this as library? Is there C API exposed? I can't seem to find any documentation. I'd prefer this to a CDP server.

fbouvier · 7 months ago

Not now but we might do it in the future. It's easy to export a Zig project as a C ABI library.

niutech · 7 months ago

Congratulations! But does it support Google Account login? And ReCAPTCHA?

afk1914 · 7 months ago

I am curious how Lightpanda compares to chrome-headless-shell ({headless: 'shell'} in Puppeteer) in benchmarks.

fbouvier · 7 months ago

We did not run benchmarks with chrome-headless-shell (aka the old headless mode) but I guess that performance wise it's on the same scale as the new headless mode.

toobulkeh · 7 months ago

I’d love to see better optimized web socket support and “save” features that cache LLM queries to optimize fallback

dtj1123 · 7 months ago

Very nice. Does this / will this support the puppeteer-extra stealth plugin?

katiehallett · 7 months ago

Thanks! Right now no, but since we use the CDP (playwright, puppeteer), I guess it would be possible to support it

867-5309 · 7 months ago

does this work with selenium/chromedriver?

fbouvier · 7 months ago

For now we just support CDP. But Selenium is definitely in our roadmap.

xena · 7 months ago

How do I make sure that people can't use lightpanda to bypass bot protection tools?

dolmen · 7 months ago

One of Lightpanda's goals is to ease building bots.

The hello world example does not work. In fact, no website I've tried works. It's usually always panics. For the example in the readme, the errors are:

```

./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222

info(websocket): starting blocking worker to listen on 127.0.0.1:9222

info(server): accepting new conn...

info(server): client connected

info(browser): GET https://wikipedia.com/ 200

info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/index-2...: http.Status.ok

info(browser): eval script portal/wikipedia.org/assets/js/index-24c3e2ca18.js: ReferenceError: location is not defined

info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-...: http.Status.ok

error(events): event handler error: error.JSExecCallback

info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')

info(server): close cmd, closing conn...

info(server): accepting new conn...

thread 5274880 panic: attempt to use null value

zsh: abort ./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222

```

lbotos · 7 months ago

Not OP -- do you have some kind of proxy or firewall?

Looks like you couldn't download https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-... for some reason.

In my contributions to joplin s3 backend "Cannot read properties of undefined (reading 'length')" was usually when you were trying to access an object that wasn't instantiated. (Can't figure out length of <undefined>)

So for some reason it seems you can't execute JS?

krichprollsch · 7 months ago

Lightpanda co-author here.

Thanks for opening the issue in the repo. To be clear here, the crash seems relative with a socket disconnection issue in our CDP server.

> info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')

This message relates to the execution of gt-ie9-ce3fe8e88d.js. It's not the origin of the crash.

I have to dig in, but it could be due to a missing web API.

zelcon · 7 months ago

That's Zig for you. A ``modern'' systems programming language with no borrow checker or even RAII.

hansvm · 7 months ago

Those statements are mostly true and also worth talking about, but they're not pertinent to that error (remotely provided JS not behaving correctly), or the eventual crash (which you'd cause exactly the same way for the same reason in Rust with a .unwrap() call).

igorguerrero · 7 months ago

You could build the same thing in Rust and have the same exact issue.

audunw · 7 months ago

If that kind of stuff is always preferable, the nobody would use C over C++, yet to this day many projects still do. Borrow checking isn’t free. It’s a trade-off.

I mean, you could say Rust isn’t a modern language because it doesn’t use garbage collection. But it’s a nonsensical statement. Different languages serve different purposes.

Besides, Zig is focusing a lot more on heavily integrating testing, debug modes, fuzzing, etc. in the compiler itself, which when put together will catch almost all of the bugs a borrow checker catches, but also a whole ton of other classes of bugs that Rust doesn’t have compile time checks for.

I would probably still pick Rust in cases where it’s absolutely critical to avoid bugs that compromise security.

But this project isn’t that kind of project. I’d imagine that the super fast compile times and rapid iteration that Zig provides is much more useful here.

steeve · 7 months ago

That has absolutely nothing to do with RAII or safety…