Space secrets leak disclosure

Aachen · 2 years ago

Just two days ago I flipped through a slide deck from a security conference where the author, Jossef Harush Kadouri, found that using a model from a place like Huggingface means the author of the model can execute any code on your machine. Not sure if the slides are uploaded elsewhere, I got them sent as file: https://dro.pm/c.pdf (45MB) slide 188

I didn't realise at the time that I flipped through the slides that this means not only the model's author gets to run code on your machine, but also if Huggingface got a court-signed letter or if someone hacked them (especially if they don't notice for a while¹)

As someone not in the AI scene, I've never run these models but was surprised at how quickly the industry standardised the format. I had assumed model files were big matrices of numbers and some metadata perhaps, but now I understand how they managed so quickly: a model is (eyeing slides 186 and 195) a Python script that can do whatever it wants. That makes "standardisation" exceedingly easy: everyone can do their own thing and you sidestep the problem altogether. But that comes with a cost.

¹ https://www.verizon.com/business/resources/articles/s/how-to... says 20% doesn't notice for months; of course, it depends on the situation and what actions the attackers take

strangecasts · 2 years ago

> I had assumed model files were big matrices of numbers and some metadata perhaps

ONNX [1] is more or less this, but the challenge you immediately run into is models with custom layers/operators with their own inference logic - you either have to implement those operators in terms of the supported ops (not necessarily practical or viable) or provide the implementation of the operator to the runtime, putting you back at square one.

[1] https://onnx.ai/

addandsubtract · 2 years ago

Isn't that why we have the .safetensors format, which can't execute code on the host?

dartos · 2 years ago

This is my understanding as well

worstspotgain · 2 years ago

As others have pointer out, this is format-dependent. One of the formats that hasn't been white-listed in this thread yet is GGUF, used by llama.cpp and derivates. It's pretty much "big matrices of numbers and some metadata." Some vulnerabilities were found [1] and patched.

[1] https://www.databricks.com/blog/ggml-gguf-file-format-vulner...

d-z-m · 2 years ago

> using a model from a place like Huggingface means the author of the model can execute any code on your machine

To my knowledge this is only a problem if the model is serialized/de-serialized via pickle[0].

[0]: https://huggingface.co/docs/hub/en/security-pickle

hehdhdjehehegwv · 2 years ago

The fact that pickle even exists is fundamentally wrong to start with. They should not be permitted as a distribution format, period.

Aachen · 2 years ago

(The dro.pm link will expire any minute now. It's so short because it's temporary, should maybe have used a more permanent service. I've found the talk here in case you're reading this later: https://m.youtube.com/watch?v=8XysLIq-e3s)

mvandermeulen · 2 years ago

> Just two days ago I flipped through a slide deck from a security conference where the author, Jossef Harush Kadouri, found that using a model from a place like Huggingface means the author of the model can execute any code on your machine.

Proceeds to link to pdf of unknown origins

Aachen · 2 years ago

That's precisely why it's unexpected that a data model can run code. Wouldn't expect a pdf to start executing code on my system either, it should be data!

koolala · 2 years ago

Hugging face standing right behind you, ready for hugs

bongodongobob · 2 years ago

Are you telling me that when I run software on my computer I could potentially be running software on my computer?

hn92726819 · 2 years ago

This is more akin to downloading a jpeg and the jpeg running arbitrary code. Models should be like jpegs and I believe safetensors treat them that way, while the old pickle format didn't.

pheatherlite · 2 years ago

Data intended to be read as instructions for the interpreter or the cpu, is a whole different ballgame than data intended to convey values of something. High order sparse/dense matrices serialized in some xyz format is what most people think of when they hear the word "model". To switch it up and send some arbitrary python file and execute it on the client is a security nightmare. This outrageous.

foobiekr · 2 years ago

Nobody expects a model file to be code thrat executes whatever.

fieldcny · 2 years ago

That’s a very weasley worded statement, to begin with “they have suspicions” is not a statement that should be in a communication of this type

erhaetherth · 2 years ago

I thought it was pretty good actually. Most of these leak disclosures usually say things like "We do not have evidence they accessed any secrets" or something like that, because they don't "know" what the hackers did once they were in. At least huggingface is saying "Yeah, they probably accessed secrets but we can't confirm it"

afro88 · 2 years ago

> Over the past few days, we have made other significant improvements to the security of the Spaces infrastructure, including completely removing org tokens (resulting in increased traceability and audit capabilities), implementing key management service (KMS) for Spaces secrets, robustifying and expanding our system’s ability to identify leaked tokens and proactively invalidate them, and more generally improving our security across the board.

That's a serious amount of non-trivial work to be done in "a few days". The kind of work that should trigger more time consuming activities like security audits, pen tests and the like, before going live, right?

erhaetherth · 2 years ago

Hopefully the work was underway for awhile already, and maybe they just launched it now because the damage is already done?

fragmede · 2 years ago

at a larger organization with a whole SRE department that inclues a dedicated security team, sure, but (my impression is) huggingface isn't that size of an org (yet).

foolishbard · 2 years ago

My anthropic key was leaked and someone ran up a 10k bill on it. Are HF going to cover that?

jerpint · 2 years ago

My openAI key was leaked and I noticed someone was using it, luckily the damage wasn’t nearly as bad as you. A few dollars worth of GPT4, a model none of my apps were using at the time.

I’m almost entirely certain it was leaked via secrets on HF space, I got a message a few days ago warning me some of my spaces were affected

Tiberium · 2 years ago

Are you sure it was only stored in your space secrets? Not variables (which are public) or stored in the .env file (also public).

foolishbard · 2 years ago

I searched everywhere for any other leaks of it and found nothing.

mrkramer · 2 years ago

I always thought you could set your "maximum limit" for spending on cloud providing platforms.

a1o · 2 years ago

That's surprisingly not a thing in many platforms.

deusum · 2 years ago

That $10k was probably the limit for their work, not someone else’s stolen time.

foolishbard · 2 years ago

Anthropic is too new to have built that functionality I guess. Only found out because they were mad that my key was abusing their ToS and they notified the organization owner.

jerpint · 2 years ago

I noticed a few weeks ago that some of my OpenAI keys got compromised, they were only active as secrets on a huggingface space. I got an email a few days ago informing me that the spaces were compromised , so I suspect this issue has been going on for at least a few weeks

Liftyee · 2 years ago

The title made me think this was an article about space, but instead I got an article about Space.

Mo3 · 2 years ago

I legit thought someone leaked proof of extraterrestrial life and disclosure began.

Another day..

nmstoker · 2 years ago

There's no mention of handling with regard to costs inappropriately incurred - wouldn't access to the secrets let people call APIs and run up costs?

Or is this purely about theft of data/code?

jerpint · 2 years ago

It could be both. In my case my keys were used to call OpenAI, almost certain they were leaked from my Spaces secrets

WhackyIdeas · 2 years ago

What is ‘Space’ ?

belter · 2 years ago

Its shortcut for their Spaces.

https://huggingface.co/docs/hub/en/spaces-overview

The front end/portal. I speculate that is coded in Python. Maybe some Django thing...

jimnotgym · 2 years ago

Why is being coded in Python relevant?

arresin · 2 years ago

It’s a vm where they run your code