Simple but probably wrong solution; why not ban obfuscation libraries,
compressed and self-loading code within the PyPI ecosystem. Any
package that even refers to illegible non-source techniques gets
flagged and blocked? It seems the whole PyPI ecosystem is
undisciplined and could be tightened up. Why can't we progress here?
You can pip install complex stand alone executables, such as nodejs, and it's used in the entire ecosystem.
In fact, most packages are now wheels, which are not sources: they are compressed, and may contain binaries for compiles extensions, something extremely popular (the scientific and AI stacks exist only because of this).
Some packages need to be compiled after the fact, something that setup.py will trigger, and some even embed a fallback compiler, like some cython based packages.
Also, remember there is very few people working on pypi, there is no moderation, anybody can publish anything, so you would need a bullet proof automated heuristic. That's either impractical, or too expensive.
If you want a secure package distribution platform, there are commercial ones, such as anaconda. You get what you pay for.
Self-loading code is a huge part of the value-add of python libraries. Many of the popular libraries (e.g. Numpy and friends) trigger a bewildering chain of events to compile from source if not installing from pre-built wheels. And if you do have wheels, you have opaque binary blobs. So pick your poison: compile-on-install with possible backdoor or prebuilt .so/.dylib/.pyc with possible backdoor.
The most obvious (but not necessarily easiest) approach is to phase out setup.py and move everything to the declarative pyproject.toml approach. This is not just better for metadata (setup scripts make it really hard to statically infer what deps a lib has), it also allows for better control over what installers/toolchains run on install.
Attackers still have quite a lot of latitude during the build phase, but at least libraries have the option to specify declaratively what permissions they need (and presumably the user has the option to forbid them).
Also eval/exec are terrible and I wish there were a mode to disable their usage, but I don't know if the python runtime has some deep dependency on it. Maybe there's a way to restrict it so that only low level frames can call the eval opcode.
Would it be possible that the wheels could be built in a more-trusted / hardened environment? Having a binary blob isn't as serious when it comes from a trusted source. Almost all Debian/etc linux distributions have this feature (binary-downloading package manager).
The hardening could mitigate on-compilation hacking.
Obviously, this leaves "compile in the backdoor and wait for the user to fall into it" but at least this isn't an issue of compiling on the user's computer and it isn't a issue of binary blobs. And possibly there's a greater chance of detection if actual source code has to be available to compile.
>Also eval/exec are terrible and I wish there were a mode to disable their usage,
You can use audit hooks in the sys module (as long as you load it first) to disable eval/exec/process spawning or even arbitrary imports or network requests.
I’ve been building Packj [1] to flag PyPI/NPM/Ruby packages that contain suspicious decode+exec and other “risky” APIs using static analysis. It also uses strace-based dynamic analysis to monitor install-time filesystem/network activities. We have detected a bunch of malware with the tool.
The short answer is that this can’t be easily mitigated at the package index level, at least not without massive breaking chances to the Python packaging ecosystem: PyPI would have to ban all setup.py based source distributions.
Even then, that only pushes the problem down a layer: you’re still fundamentally installing third party code, which can do whatever it pleases. The problem then becomes one of static analysis, for which precision is the major limitation (in effect, just continuing the cat-and-mouse game.)
Why would you think that would change a thing? Also, obfuscation has legitimate uses by people making stuff they don't want easily reversed. This isn't a python specific problem.
Yeah, just get rid of anything that has a binary blob. Cool. And then when PyPI gets swapped out for whatever immediately replaces it because PyPI is useless, then at least PyPI will be secure.
A malicious author could embed malicious code in the package and still get the package signed. Hashing won't prevent this sort of thing on PyPi, it just addresses in transit and alternate supplier attacks.
Requiring anything from open source authors is a losing proposition. Items of interest just won't end up on pypi. Iirc this chain of events already happened on another distribution platform.
One of the underappreciated benefits of Richard Stallman getting what he wants would be that antivirus programs could then be updated to flag on all obfuscated code or anti-debugging actions.
Those things you named are just one of the checks it made. The python part of it was also an encode bzip file that offers a bit of debugging headache, then it downloads a pyc file which was run through an obfuscator, which more of a python headache. Your "in fact" is not a fact.
The methods this malware uses for anti debugging wouldn't cause headache for anyone that isn't completely new to the subject. Download 10 random python malware samples and you'll notice that probably at least 8 of them follow this exact same packing and execution pattern. Discord hook and laughable end payload are a good indication that whoever wrote this is probably some high school kid.
The only surprising thing about this article is the claim that these type of malware haven't been spotted in pypi before. That would suggest that there isn't much of credible actors trying to spread through pypi at all.
Huh. It never ceases to amaze me when another demonstration is presented to me that “plus ça change, plus c'est la même chose“ in this industry. I suppose it is only to be expected that some of the old anti-piracy techniques found in 8-bit floppy- and cassette-distributed software might eventually find new philosophically-similar implementations in malware in the future.
Some of that self-modifying and anti-defeat code back then were truly works of art, and squeezed into mind-bogglingly small memory and cpu foot prints, and the malware authors will have a field day re-implementing their future cousins in spirit, and some of the greybeards amongst the white hats will get to relive their 8-bit glory days hunting and defeating them.
The article gave a description of a really super primitive technique compared to the last generation of those anti-piracy techniques, but I still see a family resemblance.
The more I hear this stuff the more I write things in Go with no external dependencies pulled in. I can do 95% of what I need to do without involving a supply chain or downloading anything random off the internet other than the go distribution itself.
I like the sentiment and I'm usually first in line to ridicule the 'npm install left-pad' crowd, but this doesn't always fly. Python is a great glue language to mash high performance C/fortran components together. One does not simply write sklearn or pytorch from scratch.
I get the general idea, but at the same time, I don't have the time to write my own libraries from scratch - all modern web standards are complex and most libraries filled with years to decades worth of experience of all the edge cases that crop up, particularly as most standards don't carry a "compliance test suite".
It's one thing if I were paid by my employer to re-invent the wheel, but for personal projects... I don't have that much free time for them in the first place any more, I want to get shit done and not shave yaks all day. When I want a good grind, I'll pack out Factorio or one of the LEGO Switch games...
I always build my whole computer from scratch from NAND gates all the way up to the full OS, build my own switches, cut the network cables myself, dependencies be dammed. /s
For python at least, most of the dependencies are very justifiable. The python stdlib is very huge and satisfies most regular programs such as glue code. But for web and ML it is not possible to include these libraries in stdlib nor is it feasible to write it from scratch
Standards and requirements will change, bits will rot, and im not expecting any ecosystem, to keep up with comming and going demands.
A better solution imho would be project level capabilities, so you can pull in a dependency but restrict its lib/syscall access, so it would not compile when it turns malicious.
Maybe it will solve at least something, maybe some day.
Agree. I'd like to see an OpenBSD pledge(2) type system for libraries. So you can mask individual library capabilities rather than just programs. I don't want a web server that can write to the file system and I don't want a CSV reader that can talk to the network.
I don't think the Go-stdlib is significant better than the Python-batteries. For normal stuff, you can build without dependencies in python too. The problem starts when you use more complex stuff, or want to save time by using a lib delivering certain benefits. After all, you can't build and maintain everything by yourself.
I’m wonder if there is room for a security model based around “escrow builds”.
Imagine if PyPi could take pure source code, and run a standardized wheel build for you. That pipeline would include running security linters on the source. Then you can install the escrow version of the artifact instead of the one produced by the project maintainers.
You can even have a capability model - most installers should not need to run onbuild/oninstall hooks. So by default don’t grant that.
This sidesteps a bunch of supply-chain attacks. The cost is that there is some labor required to maintain these escrow pipelines.
With modern build tools I think this might not be unworkable, particularly given that small libraries would be incentivized to adopt standardized structures if it means they get the “green padlock” equivalent.
Libraries that genuinely have special needs like numpy could always go outside this system, and have a “be careful where you install this package from” warning. But most libraries simply have no need for the machinery being exploited here.
What does it mean for a package to have been signed with the key granted to the CI build server?
Does a Release Manager (or primary maintainer) again sign what the build farm produced once? What sort of consensus on PR approval and build output justifies use of the build artifact signing key granted to a CI build server?
>Libraries that genuinely have special needs like numpy could always go outside this system, and have a “be careful where you install this package from” warning. But most libraries simply have no need for the machinery being exploited here.
My personal experience with any situation where I need to get some crusty random python library to run has always been a situation with a lot of "-y"ing, swearing, and sketchy conda repositories. Usually it's code that was written years ago and does some very particular algorithm that's essential, so any warnings in the pipeline basically becomes ignored by the sheer difficultly of the task.
Apologies for the naive or off-topic question. I'm still a relatively new hobby Pythoner, and no formal training in CS.
I clearly get the security risks associated with random libs available for Python. Is this also the case for other languages like Java? Are the dependencies available to them also a relative free-for-all, or are bugs mostly accidental?
I think there is always a danger, for every language, when you install a 3rd party dependency from a package repoitory. But usually this is restricted to the runtime of the application that uses the 3rd party library (and maybe, depending on the language, the code-paths that are executed).
That's a difficult enough problem to deal with already, but with Python, it's possible to execute code at install time of such a 3rd party library (basically, when you do a 'pip install stuff'). So, you might never have run the application you installed, but you'd still have executed whatever malware was hiding. This is not the case for a lot of other languages. Also, Python allows the execution of code when you have an `import stuff` statement, which is also not the case in other languages, often. But this is not directly related to this, just another 'Python-specific' attack vector.
All of these libraries are completely secure as eval/exec are used with code fragments that are generated by the libraries, not based on untrusted input.
eval() /exec() are not running executable files, just Python code, the same way all the rest of the package is already doing.
If you run a security linter like ‘bandit’ you’ll get warnings for eval and other security holes.
It seems you can’t run bandit on deps, but perhaps if you fork them and build yourself?
If you are security conscious, having a rule that you can only install from a local pypi with packages you have forked would be a more defensible perimeter. But, a maintenance pain for sure.
In fact, most packages are now wheels, which are not sources: they are compressed, and may contain binaries for compiles extensions, something extremely popular (the scientific and AI stacks exist only because of this).
Some packages need to be compiled after the fact, something that setup.py will trigger, and some even embed a fallback compiler, like some cython based packages.
Also, remember there is very few people working on pypi, there is no moderation, anybody can publish anything, so you would need a bullet proof automated heuristic. That's either impractical, or too expensive.
If you want a secure package distribution platform, there are commercial ones, such as anaconda. You get what you pay for.
The most obvious (but not necessarily easiest) approach is to phase out setup.py and move everything to the declarative pyproject.toml approach. This is not just better for metadata (setup scripts make it really hard to statically infer what deps a lib has), it also allows for better control over what installers/toolchains run on install.
Attackers still have quite a lot of latitude during the build phase, but at least libraries have the option to specify declaratively what permissions they need (and presumably the user has the option to forbid them).
Also eval/exec are terrible and I wish there were a mode to disable their usage, but I don't know if the python runtime has some deep dependency on it. Maybe there's a way to restrict it so that only low level frames can call the eval opcode.
The hardening could mitigate on-compilation hacking.
Obviously, this leaves "compile in the backdoor and wait for the user to fall into it" but at least this isn't an issue of compiling on the user's computer and it isn't a issue of binary blobs. And possibly there's a greater chance of detection if actual source code has to be available to compile.
You can use audit hooks in the sys module (as long as you load it first) to disable eval/exec/process spawning or even arbitrary imports or network requests.
1. https://github.com/ossillate-inc/packj flags malicious/risky packages.
Even then, that only pushes the problem down a layer: you’re still fundamentally installing third party code, which can do whatever it pleases. The problem then becomes one of static analysis, for which precision is the major limitation (in effect, just continuing the cat-and-mouse game.)
It essentially bans binary blobs yet it is very useful.
Why will it be "useless". Explain your reasoning please.
The problem is a failure to understand security.
Dead Comment
The only surprising thing about this article is the claim that these type of malware haven't been spotted in pypi before. That would suggest that there isn't much of credible actors trying to spread through pypi at all.
Some of that self-modifying and anti-defeat code back then were truly works of art, and squeezed into mind-bogglingly small memory and cpu foot prints, and the malware authors will have a field day re-implementing their future cousins in spirit, and some of the greybeards amongst the white hats will get to relive their 8-bit glory days hunting and defeating them.
The article gave a description of a really super primitive technique compared to the last generation of those anti-piracy techniques, but I still see a family resemblance.
Dead Comment
It's one thing if I were paid by my employer to re-invent the wheel, but for personal projects... I don't have that much free time for them in the first place any more, I want to get shit done and not shave yaks all day. When I want a good grind, I'll pack out Factorio or one of the LEGO Switch games...
A better solution imho would be project level capabilities, so you can pull in a dependency but restrict its lib/syscall access, so it would not compile when it turns malicious.
Maybe it will solve at least something, maybe some day.
Imagine if PyPi could take pure source code, and run a standardized wheel build for you. That pipeline would include running security linters on the source. Then you can install the escrow version of the artifact instead of the one produced by the project maintainers.
You can even have a capability model - most installers should not need to run onbuild/oninstall hooks. So by default don’t grant that.
This sidesteps a bunch of supply-chain attacks. The cost is that there is some labor required to maintain these escrow pipelines.
With modern build tools I think this might not be unworkable, particularly given that small libraries would be incentivized to adopt standardized structures if it means they get the “green padlock” equivalent.
Libraries that genuinely have special needs like numpy could always go outside this system, and have a “be careful where you install this package from” warning. But most libraries simply have no need for the machinery being exploited here.
What does it mean for a package to have been signed with the key granted to the CI build server?
Does a Release Manager (or primary maintainer) again sign what the build farm produced once? What sort of consensus on PR approval and build output justifies use of the build artifact signing key granted to a CI build server?
How open are the build farm and signed package repo and pubkey server configurations? https://github.com/dev-sec https://pulpproject.org/content-plugins/
https://reproducible-builds.org/
My personal experience with any situation where I need to get some crusty random python library to run has always been a situation with a lot of "-y"ing, swearing, and sketchy conda repositories. Usually it's code that was written years ago and does some very particular algorithm that's essential, so any warnings in the pipeline basically becomes ignored by the sheer difficultly of the task.
I clearly get the security risks associated with random libs available for Python. Is this also the case for other languages like Java? Are the dependencies available to them also a relative free-for-all, or are bugs mostly accidental?
Thanks!
That's a difficult enough problem to deal with already, but with Python, it's possible to execute code at install time of such a 3rd party library (basically, when you do a 'pip install stuff'). So, you might never have run the application you installed, but you'd still have executed whatever malware was hiding. This is not the case for a lot of other languages. Also, Python allows the execution of code when you have an `import stuff` statement, which is also not the case in other languages, often. But this is not directly related to this, just another 'Python-specific' attack vector.
Basically if the library uses eval() it's probably a good idea to avoid it if possible.
For example here's Python dataclasses in the standard library using exec() to create the `__init__` and other methods that go on your dataclass:
https://github.com/python/cpython/blob/main/Lib/dataclasses....
Here's Pydantic using it for a jupyter notebook check:
https://github.com/pydantic/pydantic/blob/594effa279668bd955...
here's Pytest using it to rewrite modules so that functions like assert etc. are instrumented by pytest:
https://github.com/pytest-dev/pytest/blob/eca93db05b6c5ec101...
Here's the decorator module using it (as is the only way to do this in Python) to create a signature matching decorator for an arbitrary function:
https://github.com/micheles/decorator/blob/ad013a2c1ad796996...
All of these libraries are completely secure as eval/exec are used with code fragments that are generated by the libraries, not based on untrusted input.
eval() /exec() are not running executable files, just Python code, the same way all the rest of the package is already doing.
It seems you can’t run bandit on deps, but perhaps if you fork them and build yourself?
If you are security conscious, having a rule that you can only install from a local pypi with packages you have forked would be a more defensible perimeter. But, a maintenance pain for sure.
My favorite case was when a newbie coder used eval() to evaluate something that looked json-ish, which, came from an api request.