Readit News logoReadit News
saurik · 2 years ago
A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

osanseviero · 2 years ago
The safetensors format was created exactly for this - safe model serialization

https://huggingface.co/blog/safetensors-security-audit

wolftickets · 2 years ago
Disclosure I work for the company that released this: https://github.com/protectai/modelscan but we do have a tool to support scanning many models for this kind of problem.

That said you should be using something like safe-tensors.

lawlessone · 2 years ago
You have me curious now. The models generate text. Could a model hypothetically be trained in such a way that could create a buffer overflow when given certain prompts? I am guessing the way inference works in such a way that cant happen
anonymousDan · 2 years ago
For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?
btilly · 2 years ago
The AI version of https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...?

At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"

That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.

AI is hard.

sillysaurusx · 2 years ago
It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).

pixl97 · 2 years ago
In theory for any AI model that generates code you'll want to have a series of post generation tests, for example something like SAST and/or SCA that ensure the model is not biasing itself to particular flaws.

At least for common languages this should stand out.

Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.

dheera · 2 years ago
Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.
romanows · 2 years ago
Are you sure json is faster than pickle in recent python versions? That's not intuitive to me and search result blurbs seem to indicate the opposite.
BlueTemplar · 2 years ago
So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?
failuser · 2 years ago
You can’t edit them in Word, so that must be too advanced for most people. LibreOffice never opened the PDFs too well for me, but Inkspace was pretty good, one page at a time though.
rodgerd · 2 years ago
The other aspect that pertains to AI is the data-maximalist mindset around these tools: grab as much data, aggregate it all together, and to hell with any concerns about what and how the data is being used; more data is the competitive advantage. This means a failure that might otherwise be quite limited in scope becomes huge.
hedora · 2 years ago
Occasionally, I’ll talk to someone suggesting a dynamically typed language (or stringly-typed java) for a very large scale (in developer count) security or mission critical application.

This incident is a good one to point back to.

sillysaurusx · 2 years ago
laughs in log4j vuln

A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.

It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.

Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.

mattnewton · 2 years ago
The typing of python isn’t the issue, it’s effectively the eval problem of not having a separation between code and data in the pickle format often used out of convenience. There are lots of pure data containers, like huggingface’s safe tensors or tensorflow’s protobuf checkpoints, that could have been used instead.
evertedsphere · 2 years ago
types have nothing to do with this, strictly speaking; the same problems would exist if you serialised structures containing functions in a typed language to e.g. a dll or a .class file and asked users to load it at runtime

the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place

make3 · 2 years ago
that has literally nothing to do with the topic, which is just misconfigured cloud stuff. people really like starting these old crappy language arguments anywhere they can
nostoc · 2 years ago
Yeah, because statically typed language never had any kind of deserialization vulnerabilities.
chinchilla2020 · 2 years ago
What is the best practice? I'm assuming something that isn't a programming language object...

Dead Comment

benreesman · 2 years ago
I’ll venture that it’s at least adjacent that the indiscriminate assembly of massive, serious pluralities of the commons on a purely unilateral basis for profit is sort of a “just try and stop us” posture that whether or not directly related here, and clearly with some precedent, is looking to create a lot of this sort of thing over and above the status-quo ick.
short_sells_poo · 2 years ago
I have no idea what you are saying. If it is: "bad incentives cause people to misbehave", you generated an impressive verbiage around it :)

Deleted Comment

sillysaurusx · 2 years ago
The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.
cj · 2 years ago
> it’s why frequent pentests are important.

Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"

Which is guess is better than nothing.

If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)

Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.

dylan604 · 2 years ago
I recently ran into something along the lines of your devolved pentest concept. I have a public facing webapp, and the report came back with a list of "critical" issues that are solved by yum update. Nothing about vulnerability to session jacking or anything along the lines of requiring actual work. I was a few steps removed from the actual testing, so who knows what was lost in translation and it being the first time I've ever had something I worked on pen tested. However, I feel this was more of a script kiddie port scan level of effort vs actually trying to provide useful security advice. The whole process was very disappointing.
oooyay · 2 years ago
Narrowly scoped tests designed for specific compliance requirements are fine. They lower the barrier to entry to some degree for even getting testing and still, or often enough, return viable results. There's also SAAS companies that have emerged that effectively run a scripted analysis of cloud resources. The two together are more economical and still accomplish the goals that having compliance in the first place sets out.

When I was consulting architecture and code review were separate services with a very different rate from pentesting. Similar goals but far more expensive.

mymac · 2 years ago
Pentests where people actually get out of bed to do stuff (read code, read API docs etc) and then try to really hack your system are rare. Pentests where people go through the motions, send you report with a few unimportant bits highlit while patting you on the back for your exemplary security so you can check the box on whatever audit you're going through are common.
nbk_2000 · 2 years ago
If you're a large company that's actually serious about security, you'll have a Red Team that is intimately familiar with your tech stacks, procedures, business model, etc. This team will be far better at emulating motivated attackers (as well as providing bespoke mitigation advice, vetting and testing solutions, etc.).

Unfortunately, compliance/customer requirements often stipulate having penetration tests performed by third parties. So for business reasons, these same companies, will also hire low-quality pen-tests from "check-box pen-test" firms.

So when you see that $10K "complete pen-test" being advertised as being used by [INSERT BIG SERIOUS NAME HERE], good chance this is why.

_jal · 2 years ago
Let me tell you about the laptop connected to our network with a cellular antenna we found in a locked filing cabinet after getting a much-delayed forced-door alert. This, after some social engineering attempts that displayed unnerving familiarity with employees and a lot of virtual doorknob-rattling.

They may be rare, but "real" pentests are still a thing.

iamflimflam1 · 2 years ago
Yep, most pentests go through the OWASP list and call it done.
j245 · 2 years ago
From my understanding as a non security expert:

Pentest comes across more as checking all the common attack vectors don’t exist.

Getting out of bed to do the so-called “real stuff” is typically called a bug bounty program or security researching.

Both exist and I don’t see why most companies couldn’t start a bug bounty program if they really cared a lot about the “real stuff”

evntdrvn · 2 years ago
what I always want to know when people talk about this is "what reputable companies can I actually pay to do a real pentest (without costing hundreds of thousands of dollars)."
trebligdivad · 2 years ago
How would a pentest find that? Ok in this case it's splattered onto github; but the main point here is that you might have some unknown number of SAS tokens issued to unknown storage that you probably haven't any easy way to revoke.
sillysaurusx · 2 years ago
A number of ways, including:

- finding the token directly in the repo

- reviewing all tokens issued

acdha · 2 years ago
It didn’t seem to be focused on AI except for the very reasonable concerns that AI research involves lots of data and often also people without much security experience. Seeing things like personal computer backups in the dump immediately suggests that this was a quasi-academic division with a lot less attention to traditional IT standards: I’d be shocked if a Windows engineer could commit a ton of personal data, passwords, API keys, etc. and first hear about it from an outside researcher.
sneak · 2 years ago
It was so common that S3 added several features to make it really, really hard to accidentally leave a whole bucket public.

Looks like Azure hasn't done similarly.

mcast · 2 years ago
Is there any valid use case for when it's a good idea to publicly expose a S3 bucket?
doctorpangloss · 2 years ago
Cloud buckets have all sorts of toxic underdevelopment of features. They play make believe that they're file systems for adoption.

Like for starters, why is it so hard to determine effective access in their permissions models?

Why is the "type" of files so poorly modeled? Do I ever allow people to give effective public access to a file "type" that the bucket can't understand?

For example, what is the "type" of code? It doesn't have to be this big complex thing. The security scanners GitHub uses knows that there's a difference between code with and without "high entropy strings" aka passwords and keys. Or if it looks like data:content/type;base64, then at least I know it's probably an image.

What if it's weird binary files like .safetensors? Someone here saying you might "accidentally" release the GPT4 weights. I guess just don't let someone put those on a public-resolvable bucket, ever, without an explicit, uninherited manifest / metadata permitting that specific file.

Microsoft owns the operating system! I bet in two weeks, the Azure and Windows teams can figure out how to make a unified policy manifest / metadata for NTFS & ReFS files that Azure's buckets can understand. Then again, they don't give deduplication to Windows 11 users, their problem isn't engineering, it's the financialization of essential security features. Well jokes on you guys, if you make it a pain for everybody, you make it a pain for yourself, and you're the #1 user of Azure.

xbar · 2 years ago
AI data is highly centralized and not stored in a serially-accessed database, which makes it unusual inasmuch as 40TB of interesting data does not often get put into a single storage bucket.
hdesh · 2 years ago
On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.
monkpit · 2 years ago
I strongly support the “no hello” concept but I also fear being seen as “that guy” so I never mention it. Sigh
dymk · 2 years ago
I've made peace with people sending me a bare "hello" with no context. I ignore it until there's something obvious to respond to. Responding with the "no hello" webpage will often be received as (passive) aggressive, and that's a bad way to start off a conversation.

Usually within a few minutes there's followup context sent. Either the other party was already in the process of writing the followup, or they realized there was nothing actionable to respond to and they elaborate.

monkpit · 2 years ago
I should have a slack bot that replies automatically to generic greetings… that way they’ll get on with whatever the issue is and I won’t have to reply.
tgsovlerkhgsel · 2 years ago
"No hello" implies that people shouldn't be friendly at all, and comes across as rude.

The concept simply needs a more descriptive name to be accepted. It's not about not saying hello. It's about including the actual request in the first message, usually after the hello.

hiddencost · 2 years ago
I make it my status message.
fireflash38 · 2 years ago
I have seen people never ask their question after multiple days of saying "hello @user", despite having nohello as a status. And despite having asked them in the past to just ask their question and I'll respond when I can.

You just can't win.

sneak · 2 years ago
Be that guy. In the long run it's better to be right then popular.
bootloop · 2 years ago
This is quite funny for me because at first I didn't understand what the problem is.

In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)

Whereas in Englisch you assume this is just a hello and nothing more.

manojlds · 2 years ago
In England people say "You all right" and move on without even waiting for a response!
syndicatedjelly · 2 years ago
I love that an entire website was made around this, without any attempt to sell me anything. So rare to see that these days
hahn-kev · 2 years ago
Glad I've never had to deal with that in chat.

Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.

jovial_cavalier · 2 years ago
Destroying comradery with a co-worker - Any % (WR)
low_tech_punk · 2 years ago
Unfortunately, the AI researcher did not use a LLM to automatically respond the nohello content.
quickthrower2 · 2 years ago
Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

prmoustache · 2 years ago
Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.
lijok · 2 years ago
If you forge your SOC2 evidence you will legitimately wish you were never born once caught
bunderbunder · 2 years ago
Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.

osanseviero · 2 years ago
You should check out safetensors. They are used widely in diffusion models and LLMs https://huggingface.co/blog/safetensors-security-audit
jklehm · 2 years ago
ONNX[0], model-as-protosbufs, continuing to gain adoption will hopefully solve this issue.

[0] https://github.com/onnx/onnx

mxz3000 · 2 years ago
at work we use the ONNX serialisation format for all of our prod models. Those get loaded by the ONNX runtime for inference. works great.

perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?

hypeatei · 2 years ago
Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.
ozim · 2 years ago
So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?

I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.

Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.

quickthrower2 · 2 years ago
FileDownloaderAccount had no copy pastable secret that can be leaked. Shared passwords are unnecessary of course and not good. If people are going to do that just use OneDrive/Dropbox rather than letting people use advanced things.

Dead Comment

stevanl · 2 years ago
Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

[1] https://github.com/microsoft/robust-models-transfer/blame/a9...

jl6 · 2 years ago
Kind of incredible that someone managed to export Teams messages out from Teams…
pradn · 2 years ago
It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

mola · 2 years ago
It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .
alphabetting · 2 years ago
Would be kind of surprising if that weren't the case.