Software Infrastructure 2.0: A Wishlist

Years and years ago I saw an advertisement by a SAN array vendor where their gimmick was that they supported very cheap read/write snapshots of very large datasets, in combination with a set of transforms such as data masking or anonymisation.

Their target market was same as the OP: developers that need rapid like-for-like clones of production. The SAN could create a full VM server pool with a logical clone of hundreds of terabytes of data in seconds, spin it up, and blow it away.

The idea was that instead of the typical "DEV/TST/PRD" environments, you'd have potentially dozens of numbered test environments. It was cheap enough to deploy a full cluster of something as complex as SAP or a multi-tier Oracle application for a CI/CD integration test!

Meanwhile, in the Azure cloud: Practically nothing has a "copy" option. Deploying basic resources such as SQL Virtual Machines or App Service environments can take hours. Snapshotting a VM takes a full copy. Etc...

It's like the public cloud is in a weird way ahead of the curve, yet living with 1990s limitations.

Speaking of limitations: One reason cheap cloning is "hard" is because of IPv4 addressing. There are few enough addresses (even in the 10.0.0.0/8 private range) that subnets have to be manually preallocated to avoid conflicts, addresses have to be "managed" and carefully "doled out". This makes it completely impossible to deploy many copies of complex multi-tier applications.

The public cloud vendors had their opportunity to use a flat IPv6 address space for everything. It would have made hundreds points of complexity simply vanish. The programming analogy is like going from 16-bit addressing with the complexity of near & far pointers to a flat 64-bit virtual address space. It's not just about more memory! The programming paradigms are different. Same thing with networking. IPv6 simply eliminates entire categories of networking configuration and complexity.

PS: Azure's network team decided to unnecessarily use NAT for IPv6, making it exactly (100%!) as complex as IPv4...

pgwhalen · 5 years ago

Your comment makes me realize how lucky I am at my job: we do this production environment replication via “SAN magic” and it is so insanely useful.

rurounijones · 5 years ago

How do you deal with replicating customer data / data the developer / tester should not have access to?

Do you copy that into the ephemeral test environment as well? How are permissions managd for accessing that data? Does it have the same restrictions as in prod? (i.e. even in this env the developer / tester has no access to systems / data they would not have in prod).

mcny · 5 years ago

> Meanwhile, in the Azure cloud: Practically nothing has a "copy" option. Deploying basic resources such as SQL Virtual Machines or App Service environments can take hours. Snapshotting a VM takes a full copy. Etc...

To add to the list of etc, I can't query across two databases in azure SQL.

jiggawatts · 5 years ago

You can't even copy an Azure SQL Database across subscriptions! This is something that comes up regularly at my day job: there's an Enterprise Agreement production subscription, and an Enterprise Dev/Test subscription for non-production that's much cheaper. This would be great, except for stupid limitations like this.

steverb · 5 years ago

I believe you can now query across DBs in Azure via a Linked Server. It still feels a little clunky to me though.

https://docs.microsoft.com/en-us/sql/relational-databases/li...

dilyevsky · 5 years ago

gcp doesn’t even give you ipv6 at all. Talk about limitations...

znpy · 5 years ago

> There are few enough addresses (even in the 10.0.0.0/8 private range) that subnets have to be manually preallocated to avoid conflicts

I hate to be that guy, but it's 2021 and we should start deploying stuff on ivp6 networks to avoid this kind of problems (beside other kind of problems)...

edit: ok I just saw the PS about ipv6 and Azure... sorry.

whoisburbansky · 5 years ago

Do you remember the name of the original SAN array provider?

znpy · 5 years ago

It could be NetApp, they had a huge marketing wave 2-3 years ago. In fairness, their appliances are very good, I've seen them in action both on-prem and in the cloud (on AWS). The pricing however... They're very expensive, particularly so in the cloud (since you're paying licensing on top of cloud resources).

olavgg · 5 years ago

I want to a copy of our Azure SQL database dumped to my local dev MS SQL instance very often. At the moment I do this by using SSMS, where I generate a script of the Azure SQL database and use sqlcmd to import that script. It is a really painful process, and sqlcmd sometimes just hang and becomes unresponsive.

Why is there no simple export/import functionality?

jiggawatts · 5 years ago

You can export Azure SQL Databases to a "BACPAC" file, which is much faster. This can then be restored onto on-prem SQL equally fast.

It's about 20-30 lines of PowerShell to do this, and then you can just double-click the script and drink a coffee.

Alternatively, for data-only changes you can use Azure SQL Sync, but I've found that it's primitive and a bit useless for anything complicated...

otabdeveloper4 · 5 years ago

> Azure's network team decided to unnecessarily use NAT for IPv6, making it exactly (100%!) as complex as IPv4...

People like the NAT. It's a feature, not a bug. If your selling point for IPv6 is "no more NAT" then no wonder it never went anywhere!

P.S. No, "you're doing it wrong" and "you're not allowed to like the NAT" are not valid responses to user needs.

dijit · 5 years ago

> "you're not allowed to like the NAT"

Well, the bigger question might be why do they like NAT?

If it's about having a single /128 address so they can do ACLs then that's easily fixed by just lowering the CIDR number. (unless you have an ancient version of fortigate on prem, which likely doesn't work with ipv6 anyway).

If it's about not having things poking at your servers through the NAT then the "NAT" really isn't helping anything, it's the stateful firewall doing _all_ the work there and those things are entirely independent systems. -- They're just sold to consumers as a single package.

funcDropShadow · 5 years ago

What do people like about NAT? I am guessing that the perceived increase in security. But perhaps there are more real or perceived advantages.

Virtualization eventually will be seen as the unnecessary layer added to make up for operating systems that lack capability based security.

It's going to take a decade to refactor things to remove that layer. Once done, you'll be able to safely run a process against a list of resources.

philipswood · 5 years ago

Agreed. Operating systems abstract resources, the fact that we needed containers and VMs point to the fact that the first set of implementations weren't optimal and needs to be refactored.

For VMs, security is the one concern, the others would be more direct access to lower levels and greater access to the hardware and internals than just a driver model.

For containers I'd say that the abstraction/interface presented to an application was too narrow: clearly network and filesystem abstractions needed to be included as well (not just memory, OS APIs and hardware abstractions).

I imagine that an OS from the far future could perform the functions of a hypervisor and a container engine and would allow richer abilities to add code to what we consider kernel space, one could write a program in Unikernel style, as well as have normal programs look more container-like.

nickjj · 5 years ago

> one could write a program in Unikernel style, as well as have normal programs look more container-like.

Projects similar to this exist today with https://nanos.org/.

You can run any typical web app (without having to re-write it) in a POSIX compliant Unikernel on most major cloud providers and bare metal. Each service runs in its own Unikernel.

I had its creator on my podcast the other day at: https://runninginproduction.com/podcast/79-nanovms-let-you-r...

otabdeveloper4 · 5 years ago

Most people use containers as a packaging format - instead of .tar.gz.

People run containers with host networking and bind-mounting everything and nobody minds.

(The story for hosting providers is a bit different, but they're not the main driver for Docker adoption.)

philipswood · 5 years ago

(I'm not holding my breath for the refactor though... We're stuck at a local maximum and we're better at adding stuff,rather than removing/refactoring them.

Also- just about EVERYTHING is built on top of these current layers)

nonameiguess · 5 years ago

It's not just security. There's incompatible dependencies. Virtualization allows you to run Windows and Linux on the same server, 40 different C libraries with potentially incompatible ABIs, across X time zones and Y locales, and they'll all work.

You can provide the same level of replication and separation of dependencies in an OS, but at a certain point you're just creating the same thing as a hypervisor and calling it an OS and the distinction becomes academic.

mikewarot · 5 years ago

I think that eventually every OS will converge on the same API, for now there's APE - Actually Portable Executables from the Cosmopolitan C library to get one code base working everywhere without a ton of ifdefs.

w7 · 5 years ago

We've had capability based security frameworks aka MAC (ex: AppArmor) in Linux since 1999 or earlier. Containers (which also existed long before docker) have been popularized for convenience, and virtualization would still be useful for running required systems that are not similar to the host. If anything it looks like we're going towards a convergence with "microvms".

mikewarot · 5 years ago

Capabilities are like taking a $5 bill out of your wallet, handing it to your child, and letting them buy ice cream with it.

You delegate $5, nothing more than $5 can possibly leave your wallet as a result.

AppArmor is like putting a vault around the ice cream truck and giving a strict list of who is allowed to buy what ice cream. Hardly the same thing.

ryukafalz · 5 years ago

Ehh, isn't MAC nearly the opposite of capability based security though? At the core of capability based security is that you don't separate authority to access a resource from your designation of that resource. MAC though seems to go all-in on separate policies to control what gets access to what.

pjmlp · 5 years ago

A convinience that only exists when the target hardware and underlying kernel are compatible with the development environmnet, when that isn't the case, oopla, a VM layer in the middle to fake the devenv, or in the devenv to fake the serverenv.

js8 · 5 years ago

It already exists in the IBM mainframe. But nobody wants to write apps for it..

ironicsonic · 5 years ago

Who would want to use those weird, outdated languages and technologies, though?

It's better to use a modern DevOps stack involving Kubernetes, Ansible, YAML, Jinja2, Python, cron jobs, regexes, shell scripts, JSON, Jenkinsfiles, Dockerfiles, etc.

marcosdumay · 5 years ago

Yep. At this time it wouldn't be great to just imitate the best mainframes compartmentalization, because we have much better experimental systems. But they are still much better than anything common on commodity hardware.

selfhoster11 · 5 years ago

Likely because IBM mainframes are outside of the reach of hobbyists and SMEs.

nine_k · 5 years ago

Virtualization, no. A hypervisor running a Windows kernel and a Linux kernel side by side is not about capability-based security. You can even see it like cap-based security approach: VMs only see what the hypervisor gave them, and have no way to refer to anything that the hypervisor did not pass to them.

Containers, yes. They are a pure namespacing trick, and can be replaces by cap-based security completely.

kazen44 · 5 years ago

Also, VM's make multitenancy really easy.

It allows you to divide up physical computing power across multiple people/organizations etc.

Containers make this kind of distinction far more hazy.

marcosdumay · 5 years ago

Emulation and API compatibility layers can be done in more ways than an hypervisor-based VM. And many of those other ways perform better.

sneak · 5 years ago

I am fairly certain that virtualization is going to be a core tenet of computing up to and including the point where the atoms of the gas giants and Oort cloud are fused into computronium for the growing Dyson solar swarm to run ever more instances of people and the adobe flash games they love.

tifadg1 · 5 years ago

I highly doubt that. If so inclined you can do that today with SEL as the granularity and capability is there, but how long would tuning it in would take? What about updates - will it still work as expected? What about edge cases? What about devs requiring more privilege than is needed?

encryptluks2 · 5 years ago

Agreed 100%, but you still are going to get the old-timers who insist that virtualization and containers just make things more difficult and that their old school approach is the best.

emteycz · 5 years ago

Good, that's a very universal interview question that will get me a practically direct answer to the actual question. It's not easy today... I'm asking about JS/HTML/CSS if the candidate isn't web-oriented, but what do I ask the JS guys? Perhaps a view layer bait?

chii · 5 years ago

going by how the web is doing today, the refactor you refer to will never be done.

bob1029 · 5 years ago

I love the test-in-production illustration. This is a fairly accurate map of my journey as well. I vividly remember scolding our customers about how they needed a staging environment that was a perfect copy of production so we could guarantee a push to prod would be perfect every time.

We still have that customer but we have learned a very valuable & expensive lesson.

Being able to test in production is one of the most wonderful things you can ever hope to achieve in the infra arena. No ambiguity about what will happen at 2am while everyone is asleep. Someone physically picked up a device and tested it for real and said "yep its working fine". This is much more confidence inspiring for us.

I used to go into technical calls with customers hoping to assuage their fears with how meticulous we would be with a test=>staging=>production release cycle. Now, I go in and literally just drop the "We prefer to test in production" bomb on them within the first 3 sentences of that conversation. You would probably not be surprised to learn that some of our more "experienced" customers enthusiastically agree with this preference right away.

You can test in production by moving fast and breaking things (the clueless guy).

You can test in production by having canaries, filters, etc, and allowing some production traffic to the version under test. This is the "wired" guy.

For many backend things, you can test in production by shadowing the current services, and copying the traffic into your services under test. It has limitations, but can produce zero disruption.

erhk · 5 years ago

Until you discover that prod has an edgecase string that has the blood mother of cthulu's last name written in swahili-hebrew, (you know it as swabrew, or SHE for short) which due to a lack of goat blood on migration day is one of the ~3% of edge cases that aren't replicated and now youve got an entirly new underground civilization born that is expecting you to serve them traffic with 5 9's and you can only do 4 because of the flakiness.

Deleted Comment

erikerikson · 5 years ago

IMO this still keeps production a previous thing. Next comes multi-version concurrent environments on the same replicated data streams. There is only production and delivery to customers is access version 3 instead of version 2. This can be proceeded by an automated comparison of v2's outputs to v3's and a review (or code description of the expected differences).

edoceo · 5 years ago

you still have unit tests and similar right? and maybe test in a staging env lightly first? or, right to Live?

Yes we have a 2 stage process with the customer. There is still a test & production environment, but test talks to all production business systems too. Only a few people are allowed into this environment. We test with 1-5 users for a few days then decide to take it to the broader user base.

zmmmmm · 5 years ago

This post mingles different interpretations of "serverless" - the current fad of "lambda" style functions etc and premium / full stack managed application offerings like Heroku. Although they are both "fully managed" they have a lot of differences.

Both though assume the "hard" problem of infrastructure is the one-time setup cost, and that your architecture design should be centered around minimising that cost - even though it's mostly a one-time cost. I really feel like this is a false economy - we should optimise for long term cost, not short term. It may be very seductive that you can deploy your code within 10 minutes without knowing anything, but that doesn't make it the right long term decision. In particular when it goes all the way to architechting large parts of your app out of lambda functions and the like we are making a huge archtitectural sacrifice which is something that should absolutely not be done purely on the base of making some short term pain go away.

deepzn · 5 years ago

what would you say is the architectural sacrifice of pushing things out to the edge, or using lambda functions per say in your opinion?

There are issues about resolving state at the edge, though Cloudflare has solved some of that with durable objects, and Fastly is working on something too. Think there are memory constraints as of now however. Security too is better in these platforms at the edge - true zero trust models.

It's similar to the issues with microservices. You've broken something that could be "simple" single monolithic application into many pieces and now there is huge complexity in how they are all joined together, fragemented state, latency, transactional integrity, data models, error handling, etc etc. Then your developers can't debug or test them properly, what used to be a simple breakpoint in the debugger and stepping into a function call and directly observing state requires a huge complexity to trace what is happening.

All these things got much harder than they used to be, and you're living with that forever - to pay for a one-time saving on setting up some self-updating, auto-scaling infrastructure.

dx034 · 5 years ago

Not the person you asked but for me, the discussion of Edge computing often misses if there are actually enough benefits. Slow response times are nearly always because the application is slow, not because the server is 100ms away instead of 5ms. Saving 100ms (or less for most probably) by deploying logic at the edge introduces much more complexity than shaving that time off through optimization would. Take this page as an example. I get response times of ~270ms (Europe to US-West I presume) and it still loads instantly. On the other hand I have enough systems at my office where I'm <1ms away from the application server and still wait forever for stuff to load.

I don't say that you don't need edge computing. But it'll always be more complex to run compared to a centralized system (plus vendor lock-in) and performance isn't a valid reason.

Serverless is more than functions. Consider S3, LogicApps, or DataFlow.

Serverless is about starting with highly available, auto scaling, et cetera... assets and concentrating more on delivering business value.

ngngngng · 5 years ago

> but it takes 45 steps in the console and 12 of them are highly confusing if you never did it before.

I am constantly amazed that software engineering is as difficult as it is in ways like this. Half the time I just figure I'm an idiot because no one else is struggling with the same things but I probably just don't notice when they are.

intergalplan · 5 years ago

1) It's just as shitty for everyone except the person who's selling themselves as a "consultant" for X. And actually—they're lying, it's shitty for them too.

2) A bunch of other goddamn morons managed to use this more-broken-than-not thing to make something useful, so this goddamn moron (i.e. me) surely can.

Both almost always true. After the second or third time you realize something made by the "geniuses" at FAANG (let alone "lesser" companies) is totally broken, was plainly half-assed at implementation and in fact is worse than what you might have done (yes, even considering management/time constraints), and has incorrect documentation, you start to catch on to these two truths. It's all shit, and most people are bad at it. Let that be a source of comfort.

[EDIT] actually, broader point: the degree to which "fucking around with shitty tools" is where one spends one's time in this job is probably understated. This bull-crap is why some people "nope" the hell out and specialize in a single language/stack, so at least they only have one well-defined set of terrible, broken crap to worry about. Then people on HN tell them they're not real developers because they aren't super-eager to learn [other horribly-broken pile of crap with marginal benefits over what they already know]. (nb. I'm in the generalist and willing-to-learn-new-garbage category, so this judgement isn't self-serving on my part)

esjeon · 5 years ago

I'm slightly afraid of all the negativity here, but still kinda agree with the sentiment.

However, I still want to note that there are many cases where smart people simply don't have enough time to handle all the stupidity in their products. Often times, it's just little issues like communication cost and politics. But, also often, one should care about the revenue of one's own company or clients', which slows down changes a lot. Even a simple feature can take weeks and months to roll out.

In short, even without stupid people, life sucks. :\

many things are confusing if you've never done it before.

my first HAproxy setup wasn't great, my 10th is rock solid, just needed a bit of context-specific learning.

same with PG replication (locical and archive) but 3rd go was awesome.

its a flawed expectation to get it right the first time.

mLuby · 5 years ago

> its a flawed expectation to get it right the first time.

If your toilet overflows when you push the lever down and flushes when you pull the lever up, you'll learn pretty quickly how to use it. And other users will too, after they screw up or after you carefully explain so they're sure to understand. But it's just a shitty design! The fact that you've learned to deal with it doesn't excuse it or mean it shouldn't be fixed for all future users.

I want to get it right the first time; I don't expect to only because I've been hurt too many times by shoddy UIs. A good UI, whether a GUI, API, CLI, or other TLA, guides users through their workflow as effortlessly as possible by means of sensible defaults, law of least surprise, and progressive reveal of power features.

Relevant http://xkcd.com/1168

jcelerier · 5 years ago

I'm so happy to just develop desktop apps in C++ / Qt. I commit, I tag a release, this generates mac / windows / linux binaries, users download them, the end.

m_st · 5 years ago

I'm very happy in the same camp using C# and SQL (though Windows only).

Thank you for making desktop apps.

abraxas · 5 years ago

I want to live in that world again. It ended for me about 19 years ago when I left my last "desktop software" job. How do you manage to avoid "web stacks" in this day and age?

Most people still use desktop software for large projects ? Adobe stuff, music software, basically any advanced authoring software - the performance of the web platform is just not there for projects measured in dozens of gigabytes of media, hundreds of AVX2-optimized plug-ins, etc (and it's a field with a large tradition of benchmarking software between each other to find the most efficient, see e.g. dawbench)

I write backend code, and it's basically Java + Spring. No web technologies beyond HTTP in sight.

drinchev · 5 years ago

There is hardly an app that I use these days that doesn't require internet connection. I guess once your app needs it, you will have the same problems.

> There is hardly an app that I use these days that doesn't require internet connection.

well, most of the ones I use (my note taking app, my text editor, my music player, my file manager) don't so YMMV

XorNot · 5 years ago

Some of this I agree with, even if the overall is to me asking for "magic" (I wouldn't mind trying to get there though).

For me the one I really want is "serverless" SQL databases. On every cloud platform at the moment, whatever cloud SQL thing they're offering is really obviously MySQL or Postgres on some VMs under the hood. The provisioning time is the same, the way it works is the same, the outage windows are the same.

But why is any of that still the case?

We've had the concept of shared hosting of SQL DBs for years, which is definitely more efficient resource wise, why have we failed to sufficiently hide the abstraction behind appropriate resource scheduling?

Basically, why isn't there just a Postgres-protocol socket I connect to as a user that will make tables on demand for me? No upfront sizing or provisioning, just bill me for what I'm using at some rate which covers your costs and colocate/shared host/whatever it onto the type of server hardware which can respond to that demand.

This feels like a problem Google in particular should be able to address somehow.

dragonwriter · 5 years ago

> Basically, why isn't there just a Postgres-protocol socket I connect to as a user that will make tables on demand for me? No upfront sizing or provisioning, just bill me for what I'm using at some rate which covers your costs and colocate/shared host/whatever it onto the type of server hardware which can respond to that demand.

How isn’t RDS Aurora Serverless from AWS, available in both MySQL-and Postgres-compatible flavors, exactly what you are looking for?

rswail · 5 years ago

Yes and no. Aurora Serverless isn't really "serverless", it's PG or MySQL binary compatible wire protocol with AWS custom storage engine underneath.

Provisioning of the compute part of the database is still at the level of "ACU" which essentially map to the equivalent underlying EC2s. The scale up/down of serverless V1 is clunky and when we tested, there was visible pauses in handling transactions in progress when a scaling event occurred.

There is a "V2" of serverless in beta that is much closer to "seamless" scaling, I assume using things like Nitro and Firecracker under the covers to provision compute in a much more granular way.

Huh I missed that particular annoucement I guess?

I'll look into it next time I'm building something.

paxys · 5 years ago

All cloud providers offer exactly what you are describing, just not for SQL DBs.

In my experience shared servers for MySQL (and I assume Postgres) never work because I always have to set at least some custom configurations very early into the development cycle.

stadium · 5 years ago

There is AWS aurora serverless https://aws.amazon.com/rds/aurora/serverless/

klintcho · 5 years ago

I know this perspective is not relevant for most companies: But for side-projects and a-like the cold-start time for Aurora is like 15 - 30 seconds, and the first query always times out. Having it always on (if just 1 "compute") will cost you 30 USD a month. I'm hoping for Aurora to eventually be closer to DynamoDB pricing and startup (I'm fine with 5 - 10 second cold start as long as it doesn't time out the first request everytime.)

AnimalMuppet · 5 years ago

So, XorNot, do you want to run a YC startup? You've got an idea to run with.

But, if you're going to run with it, can you make it as fast as a "real" database on a cloud instance?

tdeck · 5 years ago

Google Cloud offers Spanner, which is a proprietary distributed database that uses SQL. Does that fit with the kind of thing you're thinking of?

nelsonenzo · 5 years ago

They did, it's called Big Table by Google. Closed behind thier cloud offering few use.

knz42 · 5 years ago

Have you tried hosted CockroachDB, aka CockroachCloud?

anurag · 5 years ago

Testing in 'production' has been possible for a while: Heroku has Review Apps, and Render (where I work) has Preview Environments [1] that spin up a high-fidelity isolated copy of your production stack on every PR and tear it down when you're done.

[1] https://render.com/docs/preview-environments

nrmitchi · 5 years ago

There are a variety of other, similar Preview Environment offerings out there, and most PaaS offer their own options built in. If you're using a PaaS with a built-in solution, you'll almost always be better off using theirs. A lot of the difficulty comes when you've "outgrown" a PaaS for one-reason-or-anyother, and need to try to back-fill this functionality in a production-like fashion.

If you're intereseted, I wrote a thing attempting to cover some of the options ~6 months ago, but recognize that there has been a fair amount of change in the space since then (I really should update the post). [thing: https://www.nrmitchi.com/2020/10/state-of-continuous-product...]

Great article and certainly worth an update! Specifically for Render, you can now:

1. Specify a TTL on inactive PRs

2. Pick less expensive plans for preview services.

sebmellen · 5 years ago

Hey, a question about Render — do you run your own base infra/colocate or do you use a VPS/bare metal setup or do you sit on top of a cloud? I love the service but I’m wondering if you’ll expand outside of the two available regions anytime soon. I understand this might be hard based on your infra setup.

> I love the service

Thanks!

We will absolutely expand beyond Oregon and Frankfurt later this year. Our infrastructure was built to be cloud agnostic: we run on AWS, GCP, and Equinix Metal.

rutchkiwi · 5 years ago

This is cool - does it also copy over the data and schema of the production db? For instance on an eCommerce site, will I get my production product data to test on in stage?