> Contrary to what npm states, this package actually depends on one of our aforementioned spam packages. This is a by-product of how npm handles and displays dependencies to users on its website.
For me personally, this is the biggest surprise and takeaway here. By simply having a key inside package.json's dependencies reference an existing NPM package, the NPM website links it up and counts it as a dependency, regardless of the actual value that the package references (which can be a URL to an entirely different package!). I think this puts an additional strain on an already fragile dependency ecosystem, and is quite avoidable with some checks and a little bit of UI work on NPM's side.
Here it's clear that the package links to something in a weird, non-standard way. A manual review would tell you that this is not axios.
The package.json lets you link to things that aren't even on npm [1]. You could update this to something like:
"axios": "git://cdnnpmjs.com/axios"
And it becomes less clear that this is not the thing you were intending. But at least in this case, it's clear that you're hitting a git repository somewhere. What about if we update it to the following?
"axios": "axiosjs/latest"
This would pull the package from GitHub, from the org named "axiosjs" and the project named "latest". This is much less clear and is part of the package.json spec [2]. Couple this with the fact that the npm website tells you the project depends on Axios, and I doubt many people would ever notice.
You should think of the package metadata as originating from the publisher, not from the registry. Aside from the name, version, and (generated) dist and maintainers fields, I don't think any of it is even supposed to be validated by the registry?
Agreed the website UX is confusing and could be better but in general package metadata is just whatever the publisher put there and it's up to you to verify if you care about veracity.
the fucking website processes it and after some mighty compute somehow shits out the wrong link. it's actively making things worse by trying to be helpful.
confusing is one thing, but there's a screaming security chasm around that innocent little UX problem.
MS bought npmjs and now it's LARPing as some serious ecosystem (by showing how many unresolved security notices installed packages have) while they cannot be arsed to correctly show what's actually in the metadata?
this is a little too stoic a take with respect to a tool that very unserious people building things for serious but non-technical people use on a daily basis. i think we should strive for more. npm can continue to exist in its very libertarian form, but perhaps there's room for something that cares a bit more about caution
How about removing the incentive? Take down every package with tea.yaml in it, after say 1 month's warning, so legitimate packages trying to use it don't leave their users in the lurch. The tea protocol is clearly not going to accomplish what it set out to (see below), and is instead incentivising malicious behaviour and damaging the system it set out to support.
From https://docs.tea.xyz/tea/i-want-to.../faqs: "tea is a decentralized protocol secured by reputation and incentives. tea enhances the sustainability and integrity of the software supply chain by allowing open-source developers to capture the value they create in a trustless manner."
> allowing open-source developers to capture the value they create
But... then why would I use their code if whatever value it creates is captured by them the developers and so I am no better from where I was? That's like paying your employees the additional value they produce instead of the market wages: you then literally have no reason to hire them since their work is exactly profit-neutral.
I combed through their docs to try to find how these tokens would actually make maintainers money and it seems like it people pay projects for fixing bug reports (and penalize them if they don't)? The other demand drivers of the token seem to just be shuffling money around and are at best a pyramid scheme. I'm a little confused how someone seriously thought this was gonna be a good idea.
That would be a clear violation of the npm Unpublish Policy[0]. If all it takes is some spam and pissing people off to walk away from principles, they never meant anything. A proper response needs to not break expectations like this.
The entire NPM ecosystem is a garbage fire. Who cares about whatever 'principles' it supposedly has? Other than avoiding malware I can't think of something I care about less than whatever principles NPM / JS developers in general have because they've mostly been bad so far.
I wouldn't be surprised if principles in this case leave us with thousands of spam packages degrading the node ecosystem forever. It'd be exactly what I expect. So I guess I should thank the principle of consistency.
The unpublish document describes the options that users of NPM have to remove packages themselves. It was created after some situation where someone unpublished an important package.
A whole different set of terms governs which packages NPM can remove. This definitely includes these packages, either as "abusive" or "name squatting"
Not only that, but NPM's TOS makes it very clear that you have no recourse if they decide to remove your package for any reason.
Principles are a means to an end, not an end in themselves. The end here (presumably) is a healthy ecosystem, an end which this principle arguably harms more than it helps. Rigid and unthinking adherence to principles is dogmatic, and dogma has no place in engineering.
Why are these spam accounts not perma banned and removed?
For example, this[1] account mentioned in the article has 1781 packages of gibberish.
Also, the whole reporting process is onerous, there is a large form. Of course, gatekeeping on reporting is good, but there should be a possibility to report an entire profile of package publisher.
Isn't it better to leave accounts that correlate spam than to force spammers to obscure the connection by creating a new account for each piece of spam?
That primarily works if you can shadow ban the account. Otherwise the spam is still negatively impacting the community (ex. By polluting search results).
That's not how spammers work. There is this profile with thousands and there are still hundreds of spam profiles with just a handful of packages yet. If you let them grow unchecked, they grow, exponentially. The broken Windows theory fits well here
> Next, because the AI hype train is at full steam, we must point out the obvious. AI models that are trained on these packages will almost certainly skew the outputs in unintended directions. These packages are ultimately garbage, and the mantra of “garbage in, garbage out” holds true.
hmm, inspiring thoughts. An answer to "AI is going to replace software developers in the next 10 years" is to create 23487623856285628346 spam packages that contain pure garbage code. Humans will avoid, LLMs will hallucinate wildly.
We can also seed false information more generally, especially on Reddit which every AI company loves to scrape - less so on Hacker News. I recently learned that every sodium vapor streetlamp is powered by a little hamster running on a wheel. Isn't that interesting?
Most of the recent gains in LLM quality came from improving the quality of inputs (i.e. recognizing that raw unfiltered internet is not the ideal diet for growing reason).
I don't know how good the filters are though, since they're mostly powered by LLMs...
That's not what "hallucination" is. Hallucinations in LLMs are when they unexpectedly and confidently extrapolate outside of their training set when you expected them to generate something interpolated from their training set.
In your example that's just a pollution of the training set by spam, but that's not that much of an issue in practice, as AI has been better than humans at classifying spam for over a decade now.
If I agree with your definition of hallucinations in the context of LLMs... Then isn't your second paragraph literally just a way to artificially increase the likelihood of them occurring?
You seem to differentiate between a hallucination caused by poisoning the dataset vs a hallucination caused by correct data, but can you honestly make such a distinction considering just how much data goes into these models?
Frankly, hallucination as used with LLMs today is not even really a technical term at all. It literally just means "this particular randomly sampled stream of language produced sentences that communicate falsehoods".
There's a strong argument to be made that the word is actually dangerously misleading by implying that there's some difference between the functioning of a model while producing a hallucinatory sample vs when producing a non-hallucinatory sample. There's not. LLMs produce streams of language sampled from a probability distribution. As an unexpected side effect of producing coherent language these streams will often contain factual statements. Other times the stream contains statements that are untrue. "Hallucination" doesn't really exist as an identifiable concept within the architecture of the LLM, it's just a somewhat subjective judgement by humans of the language stream.
The Tea protocol's flawed incentive model is a disaster, effectively encouraging developers to pollute npm with spam. It's a prime example of what happens when protocols prioritize quantity over quality, compromising the entire ecosystem.
A "better" way is to modify the package-lock.json. You can still spoof the package but almost no one actually reviews it as npm will usually modify 1000s of lines.
I also discovered that npm doesn't actually verify what's in node_modules when using "npm install". I found this out a few ago after I had some corrupted files due to a flake internet connection. Hugely confusing. Also doesn't seem to be a straightforward way to check this (as near I could find in a few minutes).
But luckily "npm audit" will warn us about 30 "high severity" ReDos "high impact" "vulnerabilities" that can never realistically be triggered and are not really a "vulnerability" in the first place, let alone a "high impact" one.
That (and anything else relying on the lockfile) won't take effect for users who install the package from the npm registry, unlike changes in package.json.
You just demonstrated the uglier package-manager-independent overrides(npm)/resolutions(yarn) aliternative method. Because for whatever reason they couldn't play nice with each other.
npmjs.com seems to be interpreting the field incorrectly but 1) AIUI that does not affect actual npm usage, 2) If you rely on that website for supply-chain-security input I have bridge to sell... Basically all the manifest metadata is taken as-is and if the facts are important they should be separately verified out-of-band. Publishers could arbitrarily assign unassociated authors, repo URL, and so on.
I was sad to read this and thought "this is why we can't have nice things."
But following the links was fun and educational:
"The end goal here [of the Tea protocol] is the creation of a robust economy around open source software that accurately and proportionately rewards developers based on the value of their work through complex web3 mechanisms, programmable incentives, and decentralized governance."
Which lead to:
"The term cobra effect was coined by economist Horst Siebert based on an anecdotal occurrence in India during British rule. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for the income. When the government became aware of this, the reward program was scrapped. When cobra breeders set their snakes free, the wild cobra population further increased."
Which lead to:
"Goodhart's law is an adage often stated as, 'When a measure becomes a target, it ceases to be a good measure.'"
I recently stumbled upon a bunch of repos which were clearly copied from popular projects but then renamed with a random Latin name and published to npm.
I reported some of them as spam, but there were hundreds of them. I couldn't figure out why somebody would waste the time to do that, but now it makes sense.
For me personally, this is the biggest surprise and takeaway here. By simply having a key inside package.json's dependencies reference an existing NPM package, the NPM website links it up and counts it as a dependency, regardless of the actual value that the package references (which can be a URL to an entirely different package!). I think this puts an additional strain on an already fragile dependency ecosystem, and is quite avoidable with some checks and a little bit of UI work on NPM's side.
We could do a full write-up on npm's quirks and how one could take advantage of them to hide intent.
Consider the following from the post's package.json:
Here it's clear that the package links to something in a weird, non-standard way. A manual review would tell you that this is not axios.The package.json lets you link to things that aren't even on npm [1]. You could update this to something like:
And it becomes less clear that this is not the thing you were intending. But at least in this case, it's clear that you're hitting a git repository somewhere. What about if we update it to the following? This would pull the package from GitHub, from the org named "axiosjs" and the project named "latest". This is much less clear and is part of the package.json spec [2]. Couple this with the fact that the npm website tells you the project depends on Axios, and I doubt many people would ever notice.[1] https://docs.npmjs.com/cli/v10/configuring-npm/package-json#...
[2] https://docs.npmjs.com/cli/v10/configuring-npm/package-json#...
https://www.npmjs.com/package/sournoise?activeTab=dependenci...
If it would show axios and link to the package provided in package.json, that at least would be better.
But here they actually link to the wrong package.
Agreed the website UX is confusing and could be better but in general package metadata is just whatever the publisher put there and it's up to you to verify if you care about veracity.
confusing is one thing, but there's a screaming security chasm around that innocent little UX problem.
MS bought npmjs and now it's LARPing as some serious ecosystem (by showing how many unresolved security notices installed packages have) while they cannot be arsed to correctly show what's actually in the metadata?
From https://docs.tea.xyz/tea/i-want-to.../faqs: "tea is a decentralized protocol secured by reputation and incentives. tea enhances the sustainability and integrity of the software supply chain by allowing open-source developers to capture the value they create in a trustless manner."
But... then why would I use their code if whatever value it creates is captured by them the developers and so I am no better from where I was? That's like paying your employees the additional value they produce instead of the market wages: you then literally have no reason to hire them since their work is exactly profit-neutral.
[0]: https://docs.npmjs.com/policies/unpublish
I wouldn't be surprised if principles in this case leave us with thousands of spam packages degrading the node ecosystem forever. It'd be exactly what I expect. So I guess I should thank the principle of consistency.
The unpublish document describes the options that users of NPM have to remove packages themselves. It was created after some situation where someone unpublished an important package.
A whole different set of terms governs which packages NPM can remove. This definitely includes these packages, either as "abusive" or "name squatting"
Not only that, but NPM's TOS makes it very clear that you have no recourse if they decide to remove your package for any reason.
Dead Comment
For example, this[1] account mentioned in the article has 1781 packages of gibberish.
Also, the whole reporting process is onerous, there is a large form. Of course, gatekeeping on reporting is good, but there should be a possibility to report an entire profile of package publisher.
[1] https://www.npmjs.com/~eleanorecrockets
hmm, inspiring thoughts. An answer to "AI is going to replace software developers in the next 10 years" is to create 23487623856285628346 spam packages that contain pure garbage code. Humans will avoid, LLMs will hallucinate wildly.
I don't know how good the filters are though, since they're mostly powered by LLMs...
In your example that's just a pollution of the training set by spam, but that's not that much of an issue in practice, as AI has been better than humans at classifying spam for over a decade now.
If I agree with your definition of hallucinations in the context of LLMs... Then isn't your second paragraph literally just a way to artificially increase the likelihood of them occurring?
You seem to differentiate between a hallucination caused by poisoning the dataset vs a hallucination caused by correct data, but can you honestly make such a distinction considering just how much data goes into these models?
Frankly, hallucination as used with LLMs today is not even really a technical term at all. It literally just means "this particular randomly sampled stream of language produced sentences that communicate falsehoods".
There's a strong argument to be made that the word is actually dangerously misleading by implying that there's some difference between the functioning of a model while producing a hallucinatory sample vs when producing a non-hallucinatory sample. There's not. LLMs produce streams of language sampled from a probability distribution. As an unexpected side effect of producing coherent language these streams will often contain factual statements. Other times the stream contains statements that are untrue. "Hallucination" doesn't really exist as an identifiable concept within the architecture of the LLM, it's just a somewhat subjective judgement by humans of the language stream.
So many mangling of meaning.
Like the “AI” that detects spam is way different than LLMs.
1. a cryptocurrency scheme for funding OSS development[1] is incentivizing spammers to try and monetize NPM spam
2. it's easy to spoof your dependencies with package.json[2]
[1]: https://tea.xyz/blog/the-tea-protocol-tokenomics[2]: https://www.npmjs.com/package/sournoise?activeTab=code
for example take mongoose
so long as the integrity check passes for the resolve url npm will happily install it.But luckily "npm audit" will warn us about 30 "high severity" ReDos "high impact" "vulnerabilities" that can never realistically be triggered and are not really a "vulnerability" in the first place, let alone a "high impact" one.
You just demonstrated the uglier package-manager-independent overrides(npm)/resolutions(yarn) aliternative method. Because for whatever reason they couldn't play nice with each other.
npmjs.com seems to be interpreting the field incorrectly but 1) AIUI that does not affect actual npm usage, 2) If you rely on that website for supply-chain-security input I have bridge to sell... Basically all the manifest metadata is taken as-is and if the facts are important they should be separately verified out-of-band. Publishers could arbitrarily assign unassociated authors, repo URL, and so on.
https://docs.npmjs.com/cli/v9/configuring-npm/package-json#o...
https://classic.yarnpkg.com/lang/en/docs/selective-version-r...
But following the links was fun and educational:
"The end goal here [of the Tea protocol] is the creation of a robust economy around open source software that accurately and proportionately rewards developers based on the value of their work through complex web3 mechanisms, programmable incentives, and decentralized governance."
Which lead to:
"The term cobra effect was coined by economist Horst Siebert based on an anecdotal occurrence in India during British rule. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for the income. When the government became aware of this, the reward program was scrapped. When cobra breeders set their snakes free, the wild cobra population further increased."
Which lead to:
"Goodhart's law is an adage often stated as, 'When a measure becomes a target, it ceases to be a good measure.'"
I reported some of them as spam, but there were hundreds of them. I couldn't figure out why somebody would waste the time to do that, but now it makes sense.