Why start over? We’ve worked a lot with Chrome headless at our previous company, scraping millions of web pages per day. While it’s powerful, it’s also heavy on CPU and memory usage. For scraping at scale, building AI agents, or automating websites, the overheads are high. So we asked ourselves: what if we built a browser that only did what’s absolutely necessary for headless automation?
Our browser is made of the following main components:
- an HTTP loader
- an HTML parser and DOM tree (based on Netsurf libs)
- a Javascript runtime (v8)
- partial web APIs support (currently DOM and XHR/Fetch)
- and a CDP (Chrome Debug Protocol) server to allow plug & play connection with existing scripts (Puppeteer, Playwright, etc).
The main idea is to avoid any graphical rendering and just work with data manipulation, which in our experience covers a wide range of headless use cases (excluding some, like screenshot generation).
In our current test case Lightpanda is roughly 10x faster than Chrome headless while using 10x less memory.
It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them. It's a beta version, so expect most websites to fail or crash. The plan is to increase coverage over time.
We chose Zig for its seamless integration with C libs and its comptime feature that allow us to generate bi-directional Native to JS APIs (see our zig-js-runtime lib https://github.com/lightpanda-io/zig-js-runtime). And of course for its performance :)
As a company, our business model is based on a Managed Cloud, browser as a service. Currently, this is primarily powered by Chrome, but as we integrate more web APIs it will gradually transition to Lightpanda.
We would love to hear your thoughts and feedback. Where should we focus our efforts next to support your use cases?
Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.
It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.
Happy to answer any questions.
I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.
Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.
Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.
It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.
I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.
> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.
I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?
Can you share if you've looked into how your browser fares against bot detection tools like these?
At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.
I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...
The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.
Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.
The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.
There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.
Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.
[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...
One question: which JS engines did you consider and why you chose V8 in the end?
The code is made to support others JS engine in the future. We do want to add a lightweight alternative like QuickJS or Kiesel https://kiesel.dev/
https://github.com/BrowserBox/BrowserBox/
```
./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222
info(websocket): starting blocking worker to listen on 127.0.0.1:9222
info(server): accepting new conn...
info(server): client connected
info(browser): GET https://wikipedia.com/ 200
info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/index-2...: http.Status.ok
info(browser): eval script portal/wikipedia.org/assets/js/index-24c3e2ca18.js: ReferenceError: location is not defined
info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-...: http.Status.ok
error(events): event handler error: error.JSExecCallback
info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')
info(server): close cmd, closing conn...
info(server): accepting new conn...
thread 5274880 panic: attempt to use null value
zsh: abort ./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222
```
Looks like you couldn't download https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-... for some reason.
In my contributions to joplin s3 backend "Cannot read properties of undefined (reading 'length')" was usually when you were trying to access an object that wasn't instantiated. (Can't figure out length of <undefined>)
So for some reason it seems you can't execute JS?
Thanks for opening the issue in the repo. To be clear here, the crash seems relative with a socket disconnection issue in our CDP server.
> info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')
This message relates to the execution of gt-ie9-ce3fe8e88d.js. It's not the origin of the crash.
I have to dig in, but it could be due to a missing web API.
I mean, you could say Rust isn’t a modern language because it doesn’t use garbage collection. But it’s a nonsensical statement. Different languages serve different purposes.
Besides, Zig is focusing a lot more on heavily integrating testing, debug modes, fuzzing, etc. in the compiler itself, which when put together will catch almost all of the bugs a borrow checker catches, but also a whole ton of other classes of bugs that Rust doesn’t have compile time checks for.
I would probably still pick Rust in cases where it’s absolutely critical to avoid bugs that compromise security.
But this project isn’t that kind of project. I’d imagine that the super fast compile times and rapid iteration that Zig provides is much more useful here.
At my company we have a small project where we are running the equivalent of 6.5 hours of end2end tests daily using playwright. Running the tests in parallel takes around half an hour. Your project is still in very early stages, but assuming 10x speed, that would mean we could pass all our tests in roughtly 3 min (best case scenario).
That being said, I would make use of your browser, but would likely not make use of your business offering (our tests require internal VPN, have some custom solution for reporting, would be a lot of work to change for little savings; we run all tests currently in spot/preemptible instances which are already 80% cheaper).
Business-wise I found very little info on your website. "4x the efficiency at half the cost" is a good catch phrase, but compared to what? I mean, you can have servers in Hetzner or in AWS and one is already a fraction of the cost of the other. How convenient is to launch things on your remote platform vs launch them locally or setting it up? does it provide any advantages in the case of web scrapping compared to other solutions? how parallelizable is it? Do you have any paying customers already?
Supercool tech project. Best of luck!
Of course it's quite easy to spin a local instance of a headless browser for occasional use. But having a production platform is another story (monitoring, maintenance, security and isolation, scalability), so there are business use cases for a managed version.
Something in the spirit of wkhtmltoimage or WeasyPrint that does not require a full blown browser but more modern with support of recent HTML and CSS?
In a sense this is Lightpanda's complement to a "full panda". Just the fully rendered DOM to pixels.
(proper announcement of project coming soon)
But of course we plan to add others features, including
- tight integration with LLM
- embed mode (as a C library and as a WASM module) so you can add a real browser to your project the same way you add libcurl
- LLM training (RAG, fine tuning)
- AI agents
- scraping
- SERP
- testing
- any kind of web automation basically
Bot protection of course might be a problem but it depends also on the volume of requests, IP, and other parameters.
AI agents will do more and more actions on behalf of humans in the future and I believe the bot protection mechanism will evolve to include them as legit.