S3: "Block Public Access is now enabled by default on new buckets."
On the one hand, this is obviously the right decision. The number of giant data breeches caused by incorrectly configured S3 buckets is enormous.
But... every year or so I find myself wanting to create an S3 bucket with public read access to I can serve files out of it. And every time I need to do that I find something has changed and my old recipe doesn't work any more and I have to figure it out again from scratch!
The thing to keep in mind with the "Block Public Access" setting is that is a redundancy built in to save people from making really big mistakes.
Even if you have a terrible and permissive bucket policy or ACLs (legacy but still around) configured for the S3 bucket, if you have Block Public Access turned on - it won't matter. It still won't allow public access to the objects within.
If you turn it off but you have a well scoped and ironclad bucket policy - you're still good! The bucket policy will dictate who, if anyone, has access. Of course, you have to make sure nobody inadvertantly modifies that bucket policy over time, or adds an IAM role with access, or modifies the trust policy for an existing IAM role that has access, and so on.
I just stick CloudFront in front of those buckets. You don't need to expose the bucket at all then and can point it at a canonical hostname in your DNS.
That’s definitely the “correct” way of doing things if you’re writing infra professionally. But I do also get that more casual users might prefer not to incur the additional costs nor complexity of having CloudFront in front. Though at that point, one could reasonably ask if S3 is the right choice for causal users.
Not always that simple - for example if you want to automatically load /foo/index.html when the browser requests /foo/ you'll need to either use the web serving feature of S3 (bucket can't be private) or set up some lambda at edge or similar fiddly shenanigans.
Back when GPT4 was the new hotness, I dumped the markdown text from the Azure documentation GitHub repo into a vector index and wrapped a chatbot around it. That way, I got answers based on the latest documentation instead of a year-old LLM model's fuzzy memory.
I now have the daunting challenge of deploying an Azure Kubernetes cluster with... shudder... Windows Server containers on top. There's a mile-long list of deprecations and missing features that were fixed just "last week" (or whatever). That is just too much work to keep up with for mere humans.
I'm thinking of doing the same kind of customised chatbot but with a scheduled daily script that pulls the latest doco commits, and the Azure blogs, and the open GitHub issue tickets in the relevant projects and dumps all of that directly into the chat context.
I'm going to roll up my sleeves next week and actually do that.
Then, then, I'm going to ask the wizard in the machine how to make this madness work.
You're braver than me if you're willing to trust the LLM here - fine if you're ready to properly review all the relevant docs once you have code in hand, but there are some very expensive risks otherwise.
I honestly don't mind that you have to jump through hurdles to make your bucket publically available and that it's annoying. That to me seems like a feature, not a bug
>In EC2, you can now change security groups and IAM roles without shutting the instance down to do it.
Hasn't it been this way for many years?
>Spot instances used to be much more of a bidding war / marketplace.
Yeah because there's no bidding any more at all, which is great because you don't get those super high spikes as availability drops and only the ones who bid super high to ensure they wouldn't be priced out are able to get them.
>You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.
This one was a nightmare and it took ages to convince some of my more pig headed coworkers in the past that they didn't need to do it any more. The funniest part is that they were storing their data as millions and millions of 10-100kb files, so the S3 backend scaling wasn't the thing bottlenecking performance anyway!
>Originally Lambda had a 5 minute timeout and didn’t support container images. Now you can run them for up to 15 minutes, use Docker images, use shared storage with EFS, give them up to 10GB of RAM (for which CPU scales accordingly and invisibly), and give /tmp up to 10GB of storage instead of just half a gig.
This was/is killer. It used to be such a pain to have to manage pyarrow's package size if I wanted a Python Lambda function that used it. One thing I'll add that took me an embarrassingly long time to realize is that your Python global scope is actually persisted, not just the /tmp directory.
> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.
Sorry, this is absolutely still the case if you want to scale throughput beyond the few thousand IOPS a single shard can serve. S3 will automatically reshard your key space, but if your keys are sequential (eg leading timestamp) all your writes will still hit the same shard.
Re: SG, yeah I wasnt doing any cloud stuff when that was the case. Never had to restart anything for an SG change and this must be at least 5-6 years..
Every AWS update can potentially affect your SOC 2 or HIPAA compliance posture. I've seen companies fail audits because they assumed their security configurations were still current.
The cloud moves fast. Compliance processes need to keep up. Manual annual reviews aren't enough when your infrastructure is changing constantly.
This is also why we built automated compliance monitoring - because what worked last quarter might not work today.
> Glacier restores are also no longer painfully slow.
I had a theory (based on no evidence I'm aware of except knowing how Amazon operates) that the original Glacier service operated out of an Amazon fulfillment center somewhere. When you put it a request for your data, a picker would go to a shelf, pick up some removable media, take it back, and slot it into a drive in a rack.
This, BTW, is how tape backups on timesharing machines used to work once upon a time. You'd put in a request for a tape and the operator in the machine room would have to go get it from a shelf and mount it on the tape drive.
Which is basically exactly what you described but the picker is a robot.
Data requests go into a queue; when your request comes up, the robot looks up the data you requested, finds the tape and the offset, fetches the tape and inserts it into the drive, fast-forwards it to the offset, reads the file to temporary storage, rewinds the tape, ejects it, and puts it back. The latency of offline storage is in fetching/replacing the casette and in forwarding/rewinding the tape, plus waiting for an available drive.
Realistically, the systems probably fetch the next request from the queue, look up the tape it's on, and then process every request from that tape so they're not swapping the same tape in and out twenty times for twenty requests.
I can't talk about it, but I've yet to see an accurate guess at how Glacier was originally designed. I think I'm in safe territory to say Glacier operated out of the same data centers as every other AWS service.
It's been a long time, and features launched since I left make clear some changes have happened, but I'll still tread a little carefully (though no one probably cares there anymore):
One of the most crucial things to do in all walks of engineering and product management is to learn how to manage the customer expectations. If you say customers can only upload 10 images, and then allow them to upload 12, they will come to expect that you will always let them upload 12. Sometimes it's really valuable to manage expectations so that you give yourself space for future changes that you may want to make. It's a lot easier to go from supporting 10 images to 20, than the reverse.
Im like 90% sure ive seen folks (unofficially) disclose the original storage and API decisions over the years, in roughly accurate terms. Personally I think the multi dimensional striping/erasure code ideas are way more interesting than the “its just a tape library” speculation/arguments. That and the real lessons learned around product differentiation as supporting technologies converge.
I think folks have missed what I think would have been clever about the implentation I (apparently) dreamt up. It's not that "it's just a tape library", it's that it would have used the existing FC and picker infrastructure that Amazon had already built, with some racks containing drives for removable media. I was thinking that it would not have been some special facility purely for Glacier, but rather one or more regular FCs would just have had some shelves with Glacier media (not necessarily tapes).
Then the existing pickers would get special instructions on their handhelds: Go get item number NNNN from Row/shelf/bin X/Y/Z and take it to [machine-M] and slot it in, etc.
They would definitely be using rubies robots given how uniform hard drives are. The only reason warehouses still have humans is that heterogeneity (different sizes, different textures, different squishiness, etc).
I'll add: When doing instance to instance communication (in the same AZ) always use private ips. If you use public ip routing (even the same AZ) this is charged as regional data transfer.
Even worse, if you run self hosted NAT instance(s) don't use a EIP attached to them. Just use a auto-assigned public IP (no EIP).
NAT instance with EIP
- AWS routes it through the public AWS network infrastructure (hairpinning).
- You get charged $0.01/GB regional data transfer, even if in the same AZ.
NAT instance with auto-assigned public IP (no EIP)
- Traffic routes through the NAT instance’s private IP, not its public IP.
- No regional data transfer fee — because all traffic stays within the private VPC network.
- auto-assigned public IP may change if the instance is shutdown or re-created so have automations to handle that. Though you should be using the network interface ID reference in your VPC routing tables.
I think there is more of us who kind of degenerated from doing it the AWS way - API Gateway, serverless lambdas mess around with IAM roles until it works, ... - to - Give me EC2 / LightSail VPS instance maybe an S3 bucket let's set domain through Route53 and go away with the rest of your orchestrion AWS.
At what point is AWS worth using over other compute competitors when you’re using them as a storage bucket + VPS. They’re wholly more expensive at that point. Why not go with a more traditional but rock solid VPS provider?
I have the opposite philosophy for what it’s worth: if we are going to pay for AWS I want to use it correctly, but maximally. So for instance if I can offload N thing to Amazon and it’s appropriate to do so, it’s preferable. Step Functions, lambda, DynamoDB etc, over time, have come to supplant their alternatives and its overall more efficient and cost effective.
That said, I strongly believe developers don’t do enough consideration as to how to maximize vendor usage in an optimal way
> That said, I strongly believe developers don’t do enough consideration as to how to maximize vendor usage in an optimal way
Because it's not straightforward. 1) You need to have general knowledge of AWS services and their strong and weak points to be able to choose the optimal one for the task, 2) you need to have good knowledge of the chosen service (like DynamoDB or Step Functions) to be able to use it optimally; being mediocre at it is often not enough, 3) local testing is often a challenge or plain impossible, you often have to do all testing on a dev account on AWS infra.
AWS can be used in a different, cost effective, way.
It can be used as a middle-ground capable of serving the existing business, while building towards a cloud agnostic future.
The good AWS services (s3, ec2, acm, ssm, r53, RDS, metadata, IAM, and E/A/NLBs) are actually good, even if they are a concern in terms of tracking their billing changes.
If you architect with these primitives, you are not beholden to any cloud provider, and can cut over traffic to a non AWS provider as soon as you’re done with your work.
I agree that using them as a VPS provider is a mistake.
If you don't use the E(lasticity) of EC2, you're burning cash.
For prod workloads, if you can go from 1 to 10 instances during an average day, that's interesting.
If you have 3 instances running 24/7/365, go somewhere else.
For dev workloads, being able to spin instances in a matter of seconds is a bliss.
I installed the wrong version of a package on my instance? I just terminate it, wait for the auto-scaling group to pop a fresh new one a start again.
No need to waste my time trying to clean my mess on the previous instance.
You speak about Step Functions as an efficient and cost effective service from AWS, and I must admit that it's one that I avoid as much as I can...
Given the absolute mess that it is to setup/maintain, and that you completely lock yourself in AWS with this, I never pick it to do anything.
I'd rather have a containerized workflow engine running on ECS, even though I miss on the few nice features that SF offers within AWS.
The approach I try to have is:
- business logic should be cloud agnostic
- infra should swallow all the provider's pills it needs to be as efficient as possible
Because the compartmentalization of business duties means that devs are fighting uphill against the wind to sign a deal
with a new vendor for something. It's business bikeshedding, as soon as you open the door to a new vendor everyone, especially finance,
has opinions and you might end up stuck with a vendor you didn't want. Or you can use the pre-approved money furnace and just ship.
There are entire industries that have largely de-volved their clouds primarily for footprint flexibility (not all AWS services are in all regions) and billing consistency.
Honestly just having to manage IAM is such a time-suck that the way I've explained it to people is that we've traded the time we used to spend administering systems for time spent just managing permissions, and IAM is so obtuse that it comes out as a net loss.
There's a sweet spot somewhere in between raw VPSes and insanely detailed least-privilege serverless setups that I'm trying to revert to. Fargate isn't unmanageable as a candidate, not sure it's The One yet but I'm going to try moving more workloads to it to find out.
Usually I write some IaC to automate this tedium so I only have to go through the IAM setup pain once. Now if requirements change, that's an entirely different story...
> I understand your situation is a bit unique, where you are unable to log in to your AWS account without an MFA device, but you also can't order an MFA device without being able to log in. This is a scenario that is not directly covered in our standard operating procedures.
The best course of action would be for you to contact AWS Support directly. They will be able to review your specific case and provide guidance on how to obtain an MFA device to regain access to your account. The support team may have alternative options or processes they can walk you through to resolve this issue.
Please submit a support request, and one of our agents will be happy to assist you further. You can access the support request form here: https://console.aws.amazon.com/support/home
You know what's still stupid? That if you have an S3 bucket in the same region as your VPC that you will get billed on your NAT Gateway to send data out to the public internet and right back in to the same datacenter. There is simply no reason to not default that behavior to opt out vs opt in (via a VPC endpoint) beyond AWS profiting off of people's lack of knowledge in this realm. The amount of people who would want the current opt-in behavior is... if not zero, infinitesimally small.
It's a design that is secure by default. If you have no NAT gateway and no VPC Gateway Endpoint for S3 (and no other means of Internet egress) then workloads cannot access S3. Networking should be closed by default, and it is. If the user sets up things they don't understand (like NAT gateways), that's on them. Managed NAT gateways are not the only option for Internet egress and users are responsible for the networks they build on top of AWS's primitives (and yes, it is indeed important to remember that they are primitives, this is an IaaS, not a PaaS).
Fine for when you have no NAT gateway and have a subnet with truly no egress allowed. But if you're adding a NAT gateway, it's crazy that you need to setup the gateway endpoint for S3/DDB separately. And even crazier that you have to pay for private links per AWS service endpoint.
Having experienced the joy of setting up VPC, subnets and PrivateLink endpoints the whole thing just seems absurd.
They spent the effort of branding private VPC endpoints "PrivateLink". Maybe it took some engineering effort on their part, but it should be the default out of the box, and an entirely unremarkable feature.
In fact, I think if you have private subnets, the only way to use S3 etc is Private Link (correct me if I'm wrong).
VPC endpoints in general should be free and enabled by default. That you need to pay extra to reach AWS' own API endpoints from your VPC feels egregious.
> People who are price insensitive will not invest the time to fix it
This just sounds like a polite way of saying "we're taking peoples' money in exchange for nothing of value, and we can get away with it because they don't know any better".
If you had an ALB inside the VPC that routed the requests to something that lives inside the VPC, which called the AWS PutObject api on the bucket, would that still be the case?
On the one hand, this is obviously the right decision. The number of giant data breeches caused by incorrectly configured S3 buckets is enormous.
But... every year or so I find myself wanting to create an S3 bucket with public read access to I can serve files out of it. And every time I need to do that I find something has changed and my old recipe doesn't work any more and I have to figure it out again from scratch!
Even if you have a terrible and permissive bucket policy or ACLs (legacy but still around) configured for the S3 bucket, if you have Block Public Access turned on - it won't matter. It still won't allow public access to the objects within.
If you turn it off but you have a well scoped and ironclad bucket policy - you're still good! The bucket policy will dictate who, if anyone, has access. Of course, you have to make sure nobody inadvertantly modifies that bucket policy over time, or adds an IAM role with access, or modifies the trust policy for an existing IAM role that has access, and so on.
Yeah, what month?
Once I have that I can also ask it for the custom tweaks I need.
I now have the daunting challenge of deploying an Azure Kubernetes cluster with... shudder... Windows Server containers on top. There's a mile-long list of deprecations and missing features that were fixed just "last week" (or whatever). That is just too much work to keep up with for mere humans.
I'm thinking of doing the same kind of customised chatbot but with a scheduled daily script that pulls the latest doco commits, and the Azure blogs, and the open GitHub issue tickets in the relevant projects and dumps all of that directly into the chat context.
I'm going to roll up my sleeves next week and actually do that.
Then, then, I'm going to ask the wizard in the machine how to make this madness work.
Pray for me.
You're braver than me if you're willing to trust the LLM here - fine if you're ready to properly review all the relevant docs once you have code in hand, but there are some very expensive risks otherwise.
I'm still not sure I know how to do it if I need to again.
Hasn't it been this way for many years?
>Spot instances used to be much more of a bidding war / marketplace.
Yeah because there's no bidding any more at all, which is great because you don't get those super high spikes as availability drops and only the ones who bid super high to ensure they wouldn't be priced out are able to get them.
>You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.
This one was a nightmare and it took ages to convince some of my more pig headed coworkers in the past that they didn't need to do it any more. The funniest part is that they were storing their data as millions and millions of 10-100kb files, so the S3 backend scaling wasn't the thing bottlenecking performance anyway!
>Originally Lambda had a 5 minute timeout and didn’t support container images. Now you can run them for up to 15 minutes, use Docker images, use shared storage with EFS, give them up to 10GB of RAM (for which CPU scales accordingly and invisibly), and give /tmp up to 10GB of storage instead of just half a gig.
This was/is killer. It used to be such a pain to have to manage pyarrow's package size if I wanted a Python Lambda function that used it. One thing I'll add that took me an embarrassingly long time to realize is that your Python global scope is actually persisted, not just the /tmp directory.
Sorry, this is absolutely still the case if you want to scale throughput beyond the few thousand IOPS a single shard can serve. S3 will automatically reshard your key space, but if your keys are sequential (eg leading timestamp) all your writes will still hit the same shard.
Source: direct conversations with AWS teams.
The cloud moves fast. Compliance processes need to keep up. Manual annual reviews aren't enough when your infrastructure is changing constantly.
This is also why we built automated compliance monitoring - because what worked last quarter might not work today.
I had a theory (based on no evidence I'm aware of except knowing how Amazon operates) that the original Glacier service operated out of an Amazon fulfillment center somewhere. When you put it a request for your data, a picker would go to a shelf, pick up some removable media, take it back, and slot it into a drive in a rack.
This, BTW, is how tape backups on timesharing machines used to work once upon a time. You'd put in a request for a tape and the operator in the machine room would have to go get it from a shelf and mount it on the tape drive.
https://www.reddit.com/r/DataHoarder/comments/12um0ga/the_ro...
Which is basically exactly what you described but the picker is a robot.
Data requests go into a queue; when your request comes up, the robot looks up the data you requested, finds the tape and the offset, fetches the tape and inserts it into the drive, fast-forwards it to the offset, reads the file to temporary storage, rewinds the tape, ejects it, and puts it back. The latency of offline storage is in fetching/replacing the casette and in forwarding/rewinding the tape, plus waiting for an available drive.
Realistically, the systems probably fetch the next request from the queue, look up the tape it's on, and then process every request from that tape so they're not swapping the same tape in and out twenty times for twenty requests.
It's been a long time, and features launched since I left make clear some changes have happened, but I'll still tread a little carefully (though no one probably cares there anymore):
One of the most crucial things to do in all walks of engineering and product management is to learn how to manage the customer expectations. If you say customers can only upload 10 images, and then allow them to upload 12, they will come to expect that you will always let them upload 12. Sometimes it's really valuable to manage expectations so that you give yourself space for future changes that you may want to make. It's a lot easier to go from supporting 10 images to 20, than the reverse.
It feels odd that this is some sort of secret. Why can't you talk about it?
Then the existing pickers would get special instructions on their handhelds: Go get item number NNNN from Row/shelf/bin X/Y/Z and take it to [machine-M] and slot it in, etc.
Even worse, if you run self hosted NAT instance(s) don't use a EIP attached to them. Just use a auto-assigned public IP (no EIP).
My understanding is that transfer gets charged on both sides as well. So if you own both sides you'll pay $0.02/GB.
I have the opposite philosophy for what it’s worth: if we are going to pay for AWS I want to use it correctly, but maximally. So for instance if I can offload N thing to Amazon and it’s appropriate to do so, it’s preferable. Step Functions, lambda, DynamoDB etc, over time, have come to supplant their alternatives and its overall more efficient and cost effective.
That said, I strongly believe developers don’t do enough consideration as to how to maximize vendor usage in an optimal way
Because it's not straightforward. 1) You need to have general knowledge of AWS services and their strong and weak points to be able to choose the optimal one for the task, 2) you need to have good knowledge of the chosen service (like DynamoDB or Step Functions) to be able to use it optimally; being mediocre at it is often not enough, 3) local testing is often a challenge or plain impossible, you often have to do all testing on a dev account on AWS infra.
Truly a marketing success.
AWS can be used in a different, cost effective, way.
It can be used as a middle-ground capable of serving the existing business, while building towards a cloud agnostic future.
The good AWS services (s3, ec2, acm, ssm, r53, RDS, metadata, IAM, and E/A/NLBs) are actually good, even if they are a concern in terms of tracking their billing changes.
If you architect with these primitives, you are not beholden to any cloud provider, and can cut over traffic to a non AWS provider as soon as you’re done with your work.
If you don't use the E(lasticity) of EC2, you're burning cash.
For prod workloads, if you can go from 1 to 10 instances during an average day, that's interesting. If you have 3 instances running 24/7/365, go somewhere else.
For dev workloads, being able to spin instances in a matter of seconds is a bliss. I installed the wrong version of a package on my instance? I just terminate it, wait for the auto-scaling group to pop a fresh new one a start again. No need to waste my time trying to clean my mess on the previous instance.
You speak about Step Functions as an efficient and cost effective service from AWS, and I must admit that it's one that I avoid as much as I can... Given the absolute mess that it is to setup/maintain, and that you completely lock yourself in AWS with this, I never pick it to do anything. I'd rather have a containerized workflow engine running on ECS, even though I miss on the few nice features that SF offers within AWS.
The approach I try to have is:
- business logic should be cloud agnostic
- infra should swallow all the provider's pills it needs to be as efficient as possible
There's a sweet spot somewhere in between raw VPSes and insanely detailed least-privilege serverless setups that I'm trying to revert to. Fargate isn't unmanageable as a candidate, not sure it's The One yet but I'm going to try moving more workloads to it to find out.
Want to set up MFA ... login required to request device.
Yes, I know, they warned us far ahead of time. But not being able to request one of their MFA devices without a login is ... sucky.
> I understand your situation is a bit unique, where you are unable to log in to your AWS account without an MFA device, but you also can't order an MFA device without being able to log in. This is a scenario that is not directly covered in our standard operating procedures.
The best course of action would be for you to contact AWS Support directly. They will be able to review your specific case and provide guidance on how to obtain an MFA device to regain access to your account. The support team may have alternative options or processes they can walk you through to resolve this issue.
Please submit a support request, and one of our agents will be happy to assist you further. You can access the support request form here: https://console.aws.amazon.com/support/home
That last URL? You need to login to use it ...
https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpo...
(Disclaimer: I work for AWS, opinions are my own.)
They spent the effort of branding private VPC endpoints "PrivateLink". Maybe it took some engineering effort on their part, but it should be the default out of the box, and an entirely unremarkable feature.
In fact, I think if you have private subnets, the only way to use S3 etc is Private Link (correct me if I'm wrong).
It's just baffling.
S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.
(Disclaimer: I work for AWS, opinions are my own.)
People who are probably shouldn't be on aws - but they usually have to for unrelated reasons, and they will work to reduce their bill.
This just sounds like a polite way of saying "we're taking peoples' money in exchange for nothing of value, and we can get away with it because they don't know any better".
Hideous.
They should be, of course, at least when the destination is an AWS service in the same region.
[edit: I'm speaking about interface endpoints, but S3 and DynamoDB can use gateway endpoints, which are free to the same region]
S3 can use either, and we recommend establishing VPC Gateway endpoints by default whenever you need S3 access.
(Disclaimer: I work for AWS, opinions are my own.)
Deleted Comment