Readit News logoReadit News
Springtime · a year ago
Just in terms of privacy, it's worth noting that anyone who has uploaded something on IA already has their email address publicly viewable.

This isn't something that commonly known (even judging by comments here) but in the publicly viewable metadata of every upload it contains the uploader's IA account email address. So from a security perspective it's bad but from a privacy perspective a lot of users probably weren't aware of this detail if they've uploaded anything.

hunter2_ · a year ago
This raises an interesting question: should email addresses be private? Addresses of buildings aren't private, and they're somewhat analogous as with many computing concepts. (Aside: Before spam filters were quite good, it was typical to avoid scraping of addresses by mild obfuscation, but I think those days are gone, and this is distinct from privacy anyway.)

If someone wants to upload and never be found out, then they need to use a throwaway address in any case, lest they be providing their "private" address to the administrators of the service without explicitly forbidding further disclosure. If I say something to Alice without demanding that Alice keep it from Bob, then I implicitly don't mind if Alice tells Bob what I said.

tjoff · a year ago
Whether the email is considered private or not is completely orthogonal to whether you are allowed / should tie an action to your email. And then again completely orthogonal whether you can/should make that connection public.

Even if your email is public information and even if what is uploaded is public information that doesn't imply that the email address behind the account that uploaded that information should be public.

emidoots · a year ago
There is software which is intended to e.g. locate the GitHub profiles of people working at companies, then scrape all public repositories they've contributed to for their email address and the emails of their coworkers - to enable targeted advertising to those individuals. Very common in enterprise sales.

With ChatGPT, this can be extended to create emails that look very personal - as if someone has followed all of your work and is genuinely interested in what you are up to - with extremely low effort. And people are already doing this, I already get emails like this today.

Should emails be private? I don't know - I personally consider them to be public because I know for a fact mine will eventually be public whether I like it or not. But I am aware AI is out their slurping up every public communication I've ever had, and is likely trying to manipulate me in various ways already today.

II2II · a year ago
> This raises an interesting question: should email addresses be private? Addresses of buildings aren't private, and they're somewhat analogous as with many computing concepts.

There are several ways to look at that.

The organization that I work for considers anything that ties two pieces of information about a person together as private information. That is to say that a person's name is not private and a phone number is not private, but connecting a phone number to a name is private. In one form or another, an email is frequently tied to a name (e.g. the email address is based on their name, or an account record includes both a name and an email address).

Another way is to consider how accessible the information is. There was a lot of information that was not considered as private prior to the widespread adoption of the internet. One issue that I remember popping up in the early 1990's involved property (i.e. land) records. Historically, people had to go to a government office to access them but they were publicly available. Since they were publicly available, some governments made them available online. Once they were available online, the barriers to access were removed (e.g. having to physically visit an office) and the ability to abuse that information was vastly increased. All of a sudden, people started considering something that used to be considered as public information as private information.

Springtime · a year ago
An issue is for most sites/services an email has just become a standard authentication method, rather than something that can easily be more unique per account. So any usernames across sites/services that share it identify that user as being the same person (for data broker profiling, doxxing, etc), which is the privacy issue (not the email address per se, unless it perhaps contained one's real name).

For contrast truly unique email aliases for example aren't possible on common services like free Gmail*, only things like self-hosting/certain paid email hosts, which makes less feasible for many. So from a privacy perspective while in an ideal world everyone would be able to freely create entirely unique per-account creds we're mostly stuck with the email implementation.

* One could create entirely separate accounts but it's high friction and IIRC the same phone number (now a requirement) can only be used for 2-3 accounts.

KronisLV · a year ago
> This raises an interesting question: should email addresses be private?

I sadly don't think that's viable.

What might be, in our current world, would be having a mail server/client setup where you can generate random addresses for yourself like Wf1JJUBHLu@domain.com and never re-use an e-mail address, much like with passwords, while being able to see all of the incoming mail in the same place and respond with the corresponding accounts.

Then, when your address gets traded around, it'd be fairly obvious (with some basic bookkeeping, e.g. a text field with purpose/URL for why a certain address was created) who is to blame for it and blocking incoming traffic from somewhere would be trivial as well.

I do have a self-hosted mail server and there are commands to create new accounts pretty easily, I'd just need to figure out the configuration for collecting everything in one place, as well as maybe make a web UI for automating some of the bits. I wonder if there are any off the shelf solutions for this out there.

squarefoot · a year ago
> This raises an interesting question: should email addresses be private?

Yes and no. Both of them. As any powerful tool, email is going to be abused, like any other alternative would be when it will come one day. Those services allowing creation of dynamic email addresses do their job (until they're banned, that's why I'm not mentioning them), however using them isn't automatic and most people don't even know about their existence. What if we then did upgrade email protocols to reflect current needs wrt privacy and modified existing mail servers so that they could create dynamic addresses when asked by a simple flag? Example: I want to subscribe to a service from company XYZ, however I'm not sure how much I can trust them, therefore, when writing an email or filling a web form I can activate the option to create a new address that is tied to the recipient I'll be writing to, and will work as a dedicated proxy for my real address, that is, every mail I send to the recipient using my real address will be actually sent from the new dynamic address, then all replies to the dynamic address will be routed to my real one, but a field in its headers will always contain either a memo by me (example: "signup with XYZ") or the original recipient (example: "info@xyz_trustuswerenotspammers_yeahsure.com"). This way one can immediately spot whoever sold their address to others and blacklist them. As said, those services work well but not being built in into mail servers and clients their adoption is quite restricted. I don't see why that function shouldn't be embedded in a new upgraded email protocol as the modification would neither be that hard nor consume any serious resource. I would however expect heavy resistance against the adoption, of course.

tomjen3 · a year ago
In a world where email costs ten cents to send (per receiver) email addresses need not be private. In our world? They kinda need to for sanity.
numpad0 · a year ago
I think it just needs to be communicated. Some websites allow login only by login name and not by email, some people have identifying last name, others hardly identifying full name and whatnot. There's no universal or universally agreed answer to that, so it needs to be said whether your service _consider_ it public information or not.
makach · a year ago
Pr definition the email address is considered as private information and should be protected accordingly.
figassis · a year ago
It should, mainly because an email is not just an email, it's a channel to reach otu to you, your internet address. And we know how that is going in your inbox.
weinzierl · a year ago
This raises an interesting question: should email addresses be private?

GDPR is clear on this and there have been significant fines for revealing email addresses against the will of their owners (e.g. using cc instead of bcc). Not saying this is the ultimate wisdom, just a data point to consider.

iicc · a year ago
>Addresses of buildings aren't private, and they're somewhat analogous as with many computing concepts.

Buildings are analogous to domains, not email addresses.

fortyseven · a year ago
> should email addresses be private?

I dunno. Should your personal phone number be private? Or your home address? Would you be okay if I knew it and shared it with a stranger? Or would you rather be asked permission to share it first?

Seems pretty cut and dry to me. Yeah, there's going to be someone out there (there always is) who doesn't care, but I'd wager the majority would be pretty ticked off if you gave those pieces of information out to a rando on the street.

szundi · a year ago
This question could not be more academic
keybpo · a year ago
It's not just uploads but any item that uses the email address as a unique user identifier (I'm not technical enough to explain this clearer but [1]).

An email address will be part of the xml in his uploads but also in his profile, which anyone can access by simply changing the url from https://archive.org/details/@foobar to https://archive.org/download/foobar. So, in essence, one just needs to have a registered account, independeltly any uploads made.

[1] https://help.archive.org/help/accounts-a-basic-guide-2/

steffanA · a year ago
This is bad enough. This alone is a privacy bug/data leak.

Theoretically, someone could scrape the pages and compile a list of exposed email addresses.

spease · a year ago
> Theoretically, someone could scrape the pages and compile a list of exposed email addresses.

I laughed. Oh no! Anyways…

The people interested in identity theft are probably too busy figuring out what to do with all the SSNs they stole (not from this breach, but from the annual catastrophic breach of a credit bureau or government repository).

And the people who want your email probably already got it from one of the hundreds of other services you have to create an account for now.

I’m not really sure if there are circumstances where donating to the internet archive could be held against you and lead to persecution. Maybe in certain Luddite communities? The Amish? But then, how would they know…

rrwo · a year ago
One solution is to use a unique email address for every website, and change the address if the site gets compromised (with the old address getting added to a spam filter).
999900000999 · a year ago
A pulled an old friends website down from Internet Archive.

He's moved on the next stage, but I was glad I was able to put his site back up.

It'll be a shame if IA goes down permanently, but we need a decentralized solution anyway.

Having a single mega organization in charge of our collective heritage isn't a good idea.

gabeio · a year ago
I have always thought about this. It would be interesting to have users actually store small amounts of redundant info on a device connected to the internet. Very similarly to what a torrent does but with more peers (more data shards than full copies) and less seeds. And try and keep a huge database for everyone. Obviously open source and it would end up something like tor where they just assist the network with security patches but they don’t actually have any real “control” (admin dashboard control) over the network at large. We already do something smaller but like that with website static file caching, but at much smaller scale. Obviously security implications of this would be very hard but maybe not impossible to overcome. ipfs comes close but it again does more seeds then peers.

if anyone knows something like what I'm suggesting, I'd love to hear about it!

pbhjpbhj · a year ago
IIRC there were a few storage based projects that popped up using alt coins to encourage people to offer excess storage space for other randos on there internet. The possibility you might be storing illegal content might have been what killed it/them.

https://en.wikipedia.org/wiki/Cooperative_storage_cloud gives a few examples, like Filecoin.

IAmGraydon · a year ago
Are you, by any chance, named Richard Hendricks?
xyzsparetimexyz · a year ago
The main issue that such hosting faces is that it's less efficient and more expensive than just regular centralized servers.
rottc0dd · a year ago
Does https://ipfs.tech/ fit the bill?

Deleted Comment

Geezus_42 · a year ago
This was a plot line in Silicon Valley.
Xen9 · a year ago
I believe that it would be possible to cost effectively build and implement an architecture for a distributed IA backup—this comment entails some notes.

The system that asks volunteers about their age, sex, location, and storage format details (the model, past use etc. can be used to predict the durability of a single storage) without sharing most of this data anywhere.

The downloaders are then algorithmically allocated pieces of the archive. Exampli gratia such that there is at least limited amount of overlap between the pieces, and two people same country won't provide redunancy for each other.

When a downloader verifies that they have completed the download by giving (unique, to prevent fake-download sabotage) SHA hashes of the data, the information that these pieces have been downloaded in this or that country, plus an estimate of the reliability of the storage, is added to a public database, for the algorithm to use in the future.

Every downloader is then generated a public and private key so that they can give the hash of their download again once in a while or just verify that the piece is still there. The reliability estimates (based on storage / hardware details) would be empirically calibrated based on the data about the actual storage failures.

A public counter, estimating how well the archive is currently backed up via this scheme, could be displayed.

For copyright issues, it would be possible to encrypt some of the data, e.g. such that normally borrowable items become readable files only when X% of downloads are pieced together.

The scheme would be primarily based on existing designs and algorithms but work roughly as depicted above. I am not an expert of what compression, hashing and other algorithms should be used, and it needs lots of good work, to determine how to avoid errors in the scientific part of estimating the reliability of the downloads—and generally a situation where it would turn out that lots of data was lost when attempting to put the pieces back together again.

Remark (engineering): To empirically validate the correctness of the software of the backup architecure by testing it on grids of real hard drives in single places will probably give safety against catastrophic failure. Even better would be to obtain large amount of old hard drives and SSDs kept in a single place for a long time, to validate that the software works over time.

Remark (integrity): That a downloader actually has the downloads can be verified efficiently by IA server adding small part to the piece the downloader has, hashing it again, and requesting the new hash.

Remark (redunancy): It may be possible to develop a social program that analyzes whether a volunteer in certain place can provide more redunancy by buying themselves a hard drive or by supporting the acquisition of hard drives for volunteers who have proved themselves realiable elsewhere. This is speculative and the benefit may be lower than the risks.

Finally, instead of "public database" it may be much more optimal to decide to use a blockchain of some sort. Not a cryptocurrency, but a blockchain. This is because if the idea is to distribute copies over the world to ensure continguency in case of IA main architecture collapse, then the more parts of the distributed backup architecture (which must actually not be "the backup architecture" but "a scheme", that no everyday IA decisions rely upon, and that just exists out there) are on a blockchain network run by a "decentralized" system, the more reliable it will be.

My heuristic plausibility analysis: 0. IA backup would not need to be constantly accessed or changed (this makes storage easier, cheaper and prolongs the maximun age of the storage) 1. Not all IA has to be backed up: a distrobuted backup that successfully recovers 10% of IA in a catastrophe is by all means a great success (consequently priorization of what might / should be stored should probably be part of the algorithm that decides what volunteers download; and what existing "big" archives already store that overlaps with IA should be taken into account in this analysis) 2. I recall you estimated 30-40 M USD ballparks for a single copy: a properly led open source project may be able to develop this for free, and fairly compensated one could be ~ 0.1% to 1% of the cost. 3. The Sia network https://siascan.com/ has space for 7PB; and it's for storage where one can download their own files at any time; and they have had very little publicity. 4. 2TB hard drive costs 50-100 USD and 20PB would be 10 000 humans buying one 2TB hard drive which by itself is possible. Hobbyists and organizations may be able to provide even larger capacities. 5. Most IT projects fail, but since lots of technology already exists and in this we know what we are doing and IA might be able to recruit above talent we can conservatively, give conservatively 50% chance the groundwork development to succeed, or 45% without funding. 6. If the develoment succeeds, then there may already be around ~ 100 potential volunteers. I estimated that 0.1% IA visitors may volunteer, plus 1% from Hacker News traffick were to project to be mentioned there, plus growth over first few years and traffick from elsewhere. Perhaps 75% chance to get 10% of IA backed up by volunteers, given development succeeds. 7. If that much is backed up, there is perhaps 5% of attaining 200 TB in next few decades.

Conservatively, given that open-source development starts, one gets apprx. 33% - 38% chance that 10% backup is achieved & apprx. 1-2% that 100% of what is now in the IA, could be backed up. These are of course rather meaningless numbers, but the fact seems that in the lack of funding to build a complete backup IA can best guarantee continguency by starting to build a distributed one. Perhaps this was needlessly lots of words for a simple proposal.

- X

---

Note: It's probable that at least the NSA has a private full IA backup.

max-throat · a year ago
This is why BitTorrent and other P2P solutions were invented, but alas: A. The RIAA, MPAA, and ESA have given these technologies a terrible reputation. B. Nobody likes to seed. Some kind of seeding-based crypto would have been a great incentive if cryptocurrency wasn't also demonized by now.
fwip · a year ago
Part of the reason people don't/didn't like seeding is that many residential lines are so terribly asymmetric. If you had 100down/5up, seeding your torrent at a useful speed was often enough to degrade your connection into unusability.
aucisson_masque · a year ago
It's called torrent protocol and it doesn't work, no one wants to spend money and bandwidth hosting a god forsaken movie or book that only a handful of people care about.
squarefoot · a year ago
Not much money and bandwidth if you aren't on a metered connection. You can share tens of gigabytes or more on a cheap read only flash plugged into into a $25 single board computer that draws way less than a full PC and can be left sitting there near the router. Just limit its bandwidth on the torrent client and you won't even notice it during online gaming. The client can be as small as the Transmission daemon running headless on one of the many Debian based embedded distros: all control through either the web interface or from its client: no monitor, mouse, keyboard etc. just a small cheap box.

https://www.friendlyelec.com/index.php?route=product/product...

(just an example, as it's way overkill for the task)

https://transmissionbt.com/

https://github.com/transmission-remote-gui/transgui

oxygen_crisis · a year ago
I see 24 seeders for the entire 72-episode run of the 1991 sitcom "Herman's Head" which was so poorly rated that it's never seen a home media or streaming release, your premise doesn't hold any water at all.
0x1ch · a year ago
It does work, when you don't notice it. We need sane limits and permanent seeders. This is why so many regular people get hit with ISP notices, they don't know they've seeded Captain America for the last six months every time they started their PC.
Timber-6539 · a year ago
If the whole world has bandwidth available for TikTok, it can make the same available for sharing torrent files.
homebrewer · a year ago
I've been seeding some unpopular torrents for ten years (would have done for even longer if I did not change the torrent client a decade ago). "No one" is too strong a word, as usual with these absolutist things.
trinix912 · a year ago
In addition to the costs, I'd say it's also that no one wants to risk getting sued like the IA is getting.
EamonnMR · a year ago
I keep wanting to do this for old sites, make like a personal mini IA. Besides just using wget or curl, any tips for pulling down useable complete websites from IA?
account42 · a year ago
Agreed, especially an organziation that has already shown to not always be impartial.
Simran-B · a year ago
A decentralized solution, doesn't that scream internet archive on blockchain? What could go wrong.
brundolf · a year ago
This is one of the very few real use-cases I can think of for the blockchain
micromacrofoot · a year ago
torrents maybe
steffanA · a year ago
More details here about the data breach. Stolen database contains 31 million records.

https://www.bleepingcomputer.com/news/security/internet-arch...

ano-ther · a year ago
> the Have I Been Pwned data breach notification service created by Troy Hunt, with whom threat actors commonly share stolen data to be added to the service

Do they? Why?

Maxious · a year ago
Proves they really did hack something. There's other sites where hackers register defacements etc.
richbell · a year ago
If Troy authenticates the data, they can use that as an 'endorsement' when trying to sell it.
xproot · a year ago
Anyone who buys it or finds it in the wild can also upload it.
mkl · a year ago
> The data will soon be added to HIBP

My unique-to-archive.org email address is not there yet.

nikisweeting · a year ago
I just checked and my unique-to-archive.org email is showing up in the breach as of 2024-08-09.
paulnpace · a year ago
Many hackers will remove addresses that are obviously unique, including tags, to keep silent which database has been hacked, but it seems inconsistent.

I have checked and known my address was in a hack and it isn't there, while other times it is there. I also wonder if they start filtering out by domain, as they see a domain across multiple databases with unique addresses in each database exactly one time.

mobeigi · a year ago
Out of curiosity, do you use a unique email address for every single service?
ranger_danger · a year ago
How do they get a hold of all these leaks so fast?
maltris · a year ago
My question is: How did Scott Helme end up with a password hash that features his own name?
jgrahamc · a year ago
He didn't. If you break down that field you see:

    $2a$
    10$
    Bho2e2ptPnFRJyJKIn5Bie
    hIDiEwhjfMZFVRM9fRCarKXkemA3Pxu
    ScottHelme
2a = bcrypt, 10 = 2^10 rounds, Bho2e2ptPnFRJyJKIn5Bie is the 22 character salt, hIDiEwhjfMZFVRM9fRCarKXkemA3Pxu is the 31 character hash value, and then there's ScottHelme. Best guess is that the archive.org folks just appended the user name to the stored hash. Maybe once upon a time they didn't have a username column in their table and this was a creative way of adding it.

Funes- · a year ago
Friendly reminder to generate a unique password for every account you create so database leaks like this one don't bother you (besides on the site they're used).
AStonesThrow · a year ago
JohnMakin · a year ago
MFA
haha112 · a year ago
I use login with google, idk if it is safe
ewenjo · a year ago
Just noticed the site now alerts this:

> Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!

mewpmewp2 · a year ago
Jokes on them... I'm already on HIBP countless of times...
jsheard · a year ago
It's all good, as long as you're not in that recent AI Girlfriend breach which exposed a ton of users who were trying to coax it into generating CSAM images.

https://x.com/troyhunt/status/1843788319785939422

to-too-two · a year ago
I'm also on HIBP over 10x. What are we supposed to do? Create a new email address for every service we sign up for?

I don't know what the best practice is for keeping our personal data safe anymore.

nxobject · a year ago
And my SSN's probably available for purchase with 9 types of crypto, too.
mendym · a year ago
I assume that if this is a bad actor, then account email/name will be leaked?
uticus · a year ago
Is it a genuine alert, or hacking artifact?

Sometimes with friendly / attempt-at-humorous error messages it’s difficult to tell

jrochkind1 · a year ago
I feel like it's safe to assume the official Internet Archive would not write a "friendly"/attempt-at-humurous/unprofessional/confusing/delivered-by-popup message advertising a devastating security breach. Oh also while announcing that nowhere else.

Obv an attackers ability to insert a message does imply a breach beyond a DoS. But I am pretty confident that message was not from the IA.

n_i_k_h_i_l · a year ago
It's a literal window.alert()
EKSolutions · a year ago
It looks like someone has compromised one of their subdomains for Polyfill

Update: Subdomain seems to be returning normal responses again now.

Aachen · a year ago
You mean the IA included some JS polyfill from a subdomain and that's what's compromised / where the alert is coming from?
qnsc · a year ago
yes, "https://polyfill.archive.org/v3/polyfill.min.js?features=fet..." is the URL with the malicious code
EKSolutions · a year ago
Correct. The source subdomain of the popup seems to be hxxps[:]//polyfill[.]archive[.]org
jrochkind1 · a year ago
That would perhaps explain how they managed to inject the JS alert popup, right?
TZubiri · a year ago
Yeah, but the leak has been confirmed by HIBP, I found my address in there.
EasyMark · a year ago
One of those instances when you really wish curses worked on whoever was pulling this stunt “may you and your descendants suffer the bites of 10000 fleas for 10000 nights as punishment for your misdeeds”
PenguinRevolver · a year ago
Probably not the best time to say this, but it's surprisingly easy to go through a collection with items and grab every email along with the usernames.

https://archive.org/metadata/naturally_a_girl/metadata

One way or another, there was going to be someone who would take loads of emails with a username attached to it. A bit intrigued by how the hacker compromised the database and got the passwords.

fewgrehrehre · a year ago
Damn, I had no idea about this. Definitely would've changed some things had I known that emails were public.

This honestly seems like a bit of a design flaw.

Gingeas · a year ago
Yeah, they have ignored everyone's concerns about the email thing. https://github.com/internetarchive/iaux/issues/892
Nathans220 · a year ago
Why go for the Internet Archive go for something else not the fucking archive!
mewpmewp2 · a year ago
We all need our easily accessible decentralized archive of some sort...
Nathans220 · a year ago
yes