Ah so its not only me that uses AWS primitives for hackily implementing all sorts of synchronization primitives.
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them.
Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
yeah. one of the goals was startup time, so It made sense to precreate them. In practice we never ran out of free machines (and if we did, I have a cdk script to make more), and inifnite scaling is a pain in the butt anyways due to having to manage subnets etc.
Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.
not sure, probably either an eks cluster with a job scheduler pod that creates jobs via the batch api. The scheduler pod might be replaced by a lambda.
Another possibility is something cooked up with a lambda creating ec2 instances via cdk and the whole thing is kept track by a dynamodb table.
the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
While it can't be done server-side, this can be done straightforwardly in a signer service, and the signer doesn't need to interact with the payloads being uploaded. In other words, a tiny signer can act as a control plane for massive quantities of uploaded data.
The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.
Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.
I have a blog post [1] that goes into more depth on signers in general.
S3 has supported SHA-256 as a checksum algo since 2022. You can calculate the hash locally and then specify that hash in the PutObject call. S3 will calculate the hash and compare it with the hash in the PutObject call and reject the Put if they differ. The hash and algo are then stored in the object's metadata. You simply also use the SHA-256 hash as the key for the object.
Unfortunately, for a multi-part upload it isn't a hash of the total object, it is a hash of the hashes for each part, which is a lot less useful. Especially if you don't know how the file was partititioned during upload.
And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.
I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.
That's interesting. Would you want it to be something like a bucket setting, like "any time an object is uploaded, don't let an object write complete unless S3 verifies that a pre-defined hash function (like SHA256) is called to verify that the object's name matches the object's contents?"
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Why would you PUT an object, then download it again to a central server in the first place? If a service is accepting an upload of the bytes, it is already doing a pass over all the bytes anyway. It doesn't seem like a ton of overhead to calculate SHA256 in the 4092-byte chunks as the upload progresses. I suspect that sort of calculation would happen anyways.
Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.
Isn't that the point of the metadata? Calculate the hash ahead of time and store it in the metadata as part of the atomic commit for the blob (at least for S3).
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
> Is there any reason you can't enforce that restriction on your side?
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3
Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)
I started looking into this but DeleteObject doesn't support these conditional headers on general purpose buckets; only directory buckets (Express Zone One).
I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
It is often very important to know, when you write an object, what the previous state was. Say you sold plushies and you had 100 plushies in a warehouse. You create a file "remainingPlushies.txt" that stores "100". If somebody buys a plushie, you read the file, and if it's bigger than 0, you subtract 1, write the new version of the file, and okay the sale.
Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.
The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
Noting that Azure Blob storage supports e-tag / optimistic controls as well (via If-Match conditions)[1], how does this differ? Or is it the same feature?
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Sure, but theoretically you could have a system where a distributed log of user generated content is built via this CAS//MD5 primitive. A malicious actor could craft the data such that entries are dropped.
With Google Cloud Storage, you can solve this by conditionally writing based on the "generation number" of the object, which always increases with each new write, so you can know whether the object has been overwritten regardless of its contents. I think Azure also has an equivalent.
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.
the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.
Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.
I have a blog post [1] that goes into more depth on signers in general.
[1] https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...
https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.
I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
https://turbopuffer.com/blog/turbopuffer
It's no longer top comment, which is fine.
Genuinely, we've wanted this for ages and we got half way there with strong consistency.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.
The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.
Deleted Comment
[1]: https://learn.microsoft.com/en-us/azure/storage/blobs/concur...
So coordinating writes to multiple objects still requires… creativity.
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Deleted Comment
Deleted Comment