ArchiveBox is evolving: the future of self-hosted internet archives

@nikisweeting ArchiveBox is awesome and we'd really love it to be more awesome. And sustainable!

I've posted issues and PRs for showstopper issues that took months to get merged in: https://github.com/ArchiveBox/ArchiveBox/issues/991 https://github.com/ArchiveBox/ArchiveBox/pull/1026

You have the opportunity for the community to lean in on ArchiveBox. I understand it's hard to do everything as a solo dev, we've seen many cases in the community where solo devs get burned out or have personal challenges that take priority etc.

It's hard for us users to lean in on ArchiveBox when after a happy month of archiving, things start break and you're left with maintaining a branch of your own fixes that aren't in main. Meanwhile, your solution of soliciting one time donations just makes the whole project feel more rickety and fly-by-night. How about thinking bigger?

We NEED ArchiveBox to be a real thing. Decentralized tooling for archiving is SO IMPORTANT. I care about it and I suspect many people do. I'm posting this so other people who care about it can also comment and chime in and suggest how it can become something we can rely on. Because archiving isn't just about the past, it's about the future.

Maybe it needs to be a dev org of three committed part-time maintainers, and a small foundation that people recurrently support is what grants it? IDK. I'm not an expert at how to make open source resilient. There have been discussions about this in the past, but I think it's worth a serious look because ArchiveBox is IMPORTANT and I want it to work any month I decide to re-activate my interest in it. I invite people to discuss ways to make this valuable project more sustianable and resilient.

nikisweeting · a year ago

Let chat more. I'm almost ready to raise some seed money, hire a second staff dev or find a cofounder, and I'm looking for people that care deeply about the space.

It's only been during the last few months that I decided to go all in on the project, so this is still just the first few pages of a new chapter in the project's history.

(I should also mention that if you're a commercial entity relying on ArchiveBox, you can hire us for dedicated support and uptime guarantees. We have a closed source fork that has a much better test suite and lots of other goodies)

nyx · a year ago

It looks like you're doing great work here, thanks a bunch; looking forward to seeing this project develop.

Selling custom integrations, managed instances, white-glove support with an SLA, and so on seems like a reasonable funding model for a project based on an open-source, self-hostable platform. But I'm a little disheartened to read that you're maintaining a closed fork with "goodies" in it.

How do you decide which features (better test suite?) end up in the non-libre, payware fork of your software? If someone contributed a feature to the open-source version that already exists in the payware version, would you allow it to be merged or would you refuse the pull request?

bigiain · a year ago

"I too would like commit access to your promising looking project's git repo and CI/CD pipeline. Thanks, Jia Tan"

giancarlostoro · a year ago

Do you guys have a Discord by chance? I have a close friend who is insanely passionate about archiving, he has a personal instance of archivebox, and is working on a Video Downloading project as well. He has used it almost everyday and archived thousands of news articles over years. He's aware of a lot of the nuances.

manofmanysmiles · a year ago

I love this project. I "independently" "invented" it in my head the other day, and happy to see it already exists!

I'd love to see blockchain proof/notary support. The ability to say "content matching this hash existed at this time.

I'm exceptionally busy now but that being said, I may choose to contribute nonetheless.

I'd love to connect directly, and will connect to the Zulip instance later.

If we align on values, I may be able to connect you with some cash. People often call me an "anarchist" or "libertarian", though I'm just me, not labels necessary.

https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives from a target, uploard the WARC files to object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred). The Internet Archive serves a torrent file for every item they host; one can do the same with WARC archives to enable a distributed archive. CDX indexes can be used for rapidly querying the underlying WARC archives.

You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")

https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")

nikisweeting · a year ago

I recommend Browsertrix for WARC creation, I think they are the best currently available for WARC/WACZ.

ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)

digitaldragon · a year ago

Unfortunately, Browsertrix relies on the Chrome Devtools Protocol, which strips transfer encoding (and possibly transforms the data in other ways). This results in Browsertrix writing noncompliant WARC files, because the spec requires that the original transfer encoding be preserved.

toomuchtodo · a year ago

Keep in mind, what signing methodology you use is a function of who accepts it. If I can confirm "ArchiveTeam ripped this", that is is superior to whatever tlsnotary is doing with MPC, blockchain, distributed ledger, whatever (in my use case). Have to trust someone at the end of the day. ArchiveTeam's Warrior doesn't use tlsnotary, for example, and rips entire sites just fine.

fasa99 · 10 months ago

>ArchiveBox is also gearing up to support real cryptographic signing of archives

That's a really interesting point. The gut reaction is "why are we wasting time on adding a nice-to-have such as a very fancy cousin of the MD5 checksum when the real meat of the time & effort is maximizing data download and scale"

But then, then go read the book 1984 and it may become clear the importance of ensuring the data is unchanged down the road.

But if this is a hedge against hypothetical future 1984 world, one would have to ask - what if the only file available has the wrong md5sum? Because then most people would say, "welp, something is better than nothing" and that's it. Perhaps something that might provide additional information about what/how/where something was changed in more detail.

Deleted Comment

pzmarzly · a year ago

Can you recommend some tools to manage mutable torrents? I.e. create them, edit them, download them and keep them downloaded up to date.

BTW I recently tried using IPFS for a mutable public storage bucket and that didn't go well - downloads were very slow compared to torrents, and IPNS update propagation took ages. Perhaps torrents will do the job.

nikisweeting · a year ago

My plan is to use a separate control plane for the discovery/announcements of changes, and torrents just for the data transfer. The specifics are still being actively discussed, and it's a few releases away anyway.

Apocryphon · a year ago

Man, looks like the first posts about IPFS cropped up on HN a decade ago. I remember seeing Neocities announcement of support for them. I wonder if that protocol has gotten anywhere since then.

0cf8612b2e1e · a year ago

  The Internet Archive serves a torrent file for every item they host

I had no idea. I have found the IA serving speed to be pretty terrible. Are the torrents any better? Presumably the only ones seeding the files are IA themselves.

toomuchtodo · a year ago

The benefit is not in seeding speed directly from IA, but the potential for distributed access and seeding of the item. Think of it as a filename of a zip file in a flat distributed filesystem, with the ability to cherrypick files that make up the item out via traditional bittorrent mechanisms. Anyone can consume each item via torrent, continue to seed, and then also access the underlying data. IA acts as the storage system of last resort (and the metadata index).

pabs3 · a year ago

The torrents have better speeds because they have WebSeeds for multiple IA servers, so you can download from multiple servers at once.

bravura · a year ago

bityard · a year ago

So, after reading through the comments and website, I just realized I used ArchiveBox a month or two ago for a very specific purpose.

You see, I inherited a boat.

This boat belonged to my father. He was not materialistic but he took very good care of the things he cared about, and he cared about this boat. It's an old 18' aluminum fishing/cruising boat built in the early 1960's. It's not particularly valuable as a collectible but it is fairly rare and has some unique modifications. I spent a lot of time trying to dig up all of the info that I could on it, but this is one of those situations where most of the companies involved have been gone for decades and most everyone who was around when these were made are either dead or not really on the Internet.

It's a shame that I waited so long to start my research because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. I know because I have seen references to them. Some of the URLs are in archive.org, some are not. But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

I did manage to dig up some interesting articles, pictures, and forum threads and needed a way to save them so that they didn't disappear from the web as well. There is probably an easier way to go about it, but in the end I ran ArchiveBox via Docker and set it to fetching what I could find and then downloaded the resulting pages as self-contained HTML pages.

shiroiushi · a year ago

>because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. ... But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

These days, that kind of info would be locked up in a closed Discord chat somewhere, so you can forget about people 20 years from now ever seeing it.

stavros · a year ago

Or people today ever discovering it.

Magnets · a year ago

Lots of private groups on facebook too

nfriedly · a year ago

I've been using an instance of https://readeck.org/ for personal archives of web pages and I really like it, but I might try out ArchiveBox at some point too.

I also run an instance of ArchiveTeam Warrior which is constantly uploading things to archive.org, and I like the direction ArchiveBox is heading with the distributed/federated archiving on the roadmap, so I may end up setting up an instance like that even if I don't use it for personal content.

venusenvy47 · a year ago

I've been using the Single File extension to save self-contained html files of pages I want to keep for posterity. I like it because any browser can open the files it creates. Is it easy to view the archive files from readeck? I haven't looked at fancier alternatives to my existing solution.

https://addons.mozilla.org/en-US/firefox/addon/single-file/

Singlefile is excellent, Gildas is a great developer. ArchiveBox has had singlefile as one of its extractors built in for years :)

ninalanyon · a year ago

Readeck saves a page as a zip file. It's not hard to open from the command line or file manager, just unzip and launch the index.html in the web browser.

But it strips out a lot of detail. Zipping it also means that it's hard to deduplicate. I use WebScrapBook and run rdfind to hardlink all the identical files.

I haven't looked at the on-disk format, I just use the browser interface. (It's fairly common for me to save something from my phone that I'll want to review on a computer later.)

Here's an example of an Amazon "review" I recently archived that has instructions for using a USB tester I have: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw

And, for comparison, here's the original: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP

It'd be nice if I could edit out the extra junk near the top, but the important bits are all there.

I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

I've just tried Readeck and it doesn't save a good quality copy of the pages using the Firefox extension. SingleFile and WebScrapBook do a much better job.

I prefer WebScrapBook because it saves all the assets as files under the original names in a directory rather than a zip file. This means that I can use other tools such as find, grep, and file managers like Nemo to search the archive without needing to rely on the application that saved the page.

404mm · a year ago

Somewhat similar topic, anyone has recommendations for a self-hosted internet website change monitoring system? I’ve been running Huginn for many years and it works well; however, I have a feeling the project is on its last leg. Also, it’s based on either text scraping (XPath/CSS/HTML and rss but it struggles with newer JS-based sites.

I recommend urlwatch, you run it from a terminal and send the output wherever you want, such as email via cron.

https://thp.io/2008/urlwatch/

Changedetection.io

Thank you! That looks great!

arminiusreturns · a year ago

Why do you feel like Huginn is on its last leg? It's been in my list of things to play with for years now, but I never got around to it...

It looks like it’s being maintained by a single remaining developer. No new features are being added, just some basic maintenance. The product as a whole still works well, so unless you find something better, I do recommend it. I run it in k3s and the image is probably the easiest way of maintaining it.

favorited · a year ago

As someone who was archiving a doomed website earlier today using wget, I was reminded that really need to get ArchiveBox working...

I used to rely on my Pinboard subscription, but apparently archive exports haven't worked for years, so those days are over.

VTimofeenko · a year ago

I recently found omnivore.app through HN comments -- works great for sharing a reading list across machines. I am exporting articles through obsidian, but there is an API option. I don't think it supports outbound RSS, but they have inbound RSS(i.e. omnivore as RSS reader) in beta.

Pocket also doesn't offer archived page exports (or even RSS export). I feel like both are really dropping the ball in this area!

pronoiac · a year ago

Oh, writing my own Pinboard archive exporter is somewhere on my too-long to-do list. I should find out what would be good for importing into Archivebox. (WARC?)

lgvld · 10 months ago

FYI I am able to export (as JSON/HTML/XML) my Pinboard bookmarks.

rcarmo · a year ago

This is nice. I'm actually much more excited about the REST API (which will let me do searches and pull information out, I hope) than the plugin ecosystem, since the last thing I need is for another tool to have a half-baked LLM integration -- I prefer to do that myself and have full control.

Being able to do RAG on my ArchiveBox is something that I have very much wanted to do for over a year now, and it might finally be within reach without my going and hacking at the archived content tree...

Edit: Just looked at the API schema at https://demo.archivebox.io/api/v1/docs.

No dedicated search endpoint? This looks like a HUGE missed opportunity. I was hoping to be able to query an FTS index on the SQLlite database... Have I missed something?

The /cli/list endpoint is the search endpoint you're looking for. It provides FTS but I can make it clearer in the docs, thanks for the tip.

As for the AI stuff don't worry, none of it is touching core, it's all in an optional community plugin only for those who want it.

I'm not personally a huge AI person but I have clients who are already using it and getting massive value from it, so it's worth mentioning. (They're doing some automated QA on thousands of collected captured and feeding results into spreadsheets)

Thanks, I'll have a look.

My use for this is very different--I want to be able to use a specific subset of my archived pages (which is mostly reference documentation) to "chat" with, providing different LLM prompts depending on subset and fetching plaintext chunks as reference info for the LLM to summarize (and point me back to the archived pages if I need more info).

sunshine-o · a year ago

I have been using ArchiveBox recently and love it.

About search, one thing I haven't yet figured out how to do easily is to plug it to my SearXNG instance as they only seem to support Elasticsearch, Meilisearch or Solr [0]

So this new plugin architecture will allow for a meilisearch plugin I guess (with relevancy ranking).

- [0] https://docs.searxng.org/dev/engines/offline/search-indexer-...

orblivion · a year ago

Have you (and I wonder the same about archive.org) considered making a Merkle tree of the data that gets archived? Since data (including photos and videos) are getting easier to fake, it may be nice to have a provable record that at least a certain version of the data existed at a certain time. It would be most useful in case of some sort of oppressive regime down the line that wants to edit history. You'd want to publish the tip somewhere that records the time, and a blockchain seems to make the most sense to me but maybe you don't like blockchains.

Yup, already doing that in the betas. Thats what I'm referring to as the beginnings of a "content addressable store" in the article.

In the closed source fork we currently store a merkle tree summary of each dir in a dotfile containing the sha256 and blake3 hash of all entries / subdirs. When a result is "sealed" the summary is generated, and the final salted hash can be submitted to Solana or ETH or some other network to attest to the time of capture and the content. (That part is coming via a plugin later)

zvr · a year ago

You might be interested in taking a look at SWHID (Software Hash IDentifiers), which defines a way (on its way to become an ISO standard) to reference files and directories with content-based identifiers, like swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505. Yes, it uses Merkle trees for filesystem hierarchy. https://www.swhid.org/specification/v1.1/5.Core_identifiers/

Wow that's great!

beefnugs · a year ago

Not just all that nonsense, but also it makes a lot of sense to share just the parts from a website that matter like a single video etc without having to download an entire archive or the rest of the site

$ archivebox add --extractor=media,readability https://...

We try to make that easy by allowing ppl to select one or more specific archivebox extractors when adding, so you don t have to archive everything every time.

Makes it more useful for scraping in a pipeline with some other tools.