K8s Service Meshes: The Bill Comes Due

I never understood the appeal of service meshes. Half of their reason to exist is covered by vanilla kubernetes, the rest is inter-node VPN (e.g. wireguard) and tracing (cilium hubble). Unless I’m missing something encrypting intra-node traffic is pretty silly.

K8S has service routing rules, network policies, access policies, and can be extended up the wazoo with whatever CNI you choose.

It’s similar to Helm, in that Helm puts a DSL (values.yaml) on top of a DSL (go templates) on top of a DSL (k8s yaml), just that it is routing, authentication, and encryption on top.. well, routing (service route keys), authentication (netpols), and encryption.

It boggles the mind!

perrygeo · a year ago

I've worked on several k8s clusters professionally but only a few that used a service mesh, Istio mainly. I'll give you the promise first, then the reality.

The promise is that all of the inter-app communication channels are fully instrumented for you. Four things mainly 1) mTLS between the pods 2) Network resilience machinery (rate limiting, timeouts, retries, circuit breakers). 3) Fine grained traffic routing/splitting/shifting. And 4) telemetry with a huge ecosystem of integrated visualization apps.

Arguably, in any reasonably large application, you're going to need all of these eventually. The core idea behind the service mesh is that you don't need to implement any of this yourself. And you certainly don't want to duplicate all of this in each of your dozens of microservices! The service mesh can do all the non-differentiated work. Your services can focus on their core competency. Nice story, right?

In reality, it's a little different. Istio is a resource hog (I've evaluated Linkerd which is slightly less heavy weight but still). Rule of thumb: For every node with 8 CPUs, expect your service mesh to consume at least a CPU. If you're using smaller nodes on smaller clusters, the overhead is absurd. After setting up your k8s cluster + service mesh, you might not have room for your app.

Second, as you mention, k8s has evolved. And much of this can be done, or even done better, in k8s directly. Or by using a thinner proxy layer to only do a handful of service-mesh-like tasks.

Third, do you really need all that? Like I said, eventually you probably do if you get huge. But a service mesh seems like buying a gigantic family bus just in case you happen to have a few dozen kids.

kuhsaft · a year ago

One major usage of services meshes that I’ve come across is for the transparent L7 ALB. gRPC, which is now very common, uses long-running connections to multiplex messages. This breaks load-balancing because new gRPC calls, within a single connection, will not be distributed automatically to new endpoints. There is the option of DNS polling, but DNS caching can interfere. So, the L7 service mesh proxy is used to load balance the gRPC calls without modification of services.

https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalan...

fragmede · a year ago

Look, back in the day, things weren't encrypted, so you could listen in on your neighbor's phone calls, read their email, hack their bank accounts. Wireshark and etherdump and the most fun of all, driftnet. So, since then, everything has to be encrypted, lest someone hack there way to the family jewels. Never mind that the number of breaks to get there means there are usually bigger fish to fry. The important thing is to sprinkle magic encryption dust on everything because then we know it's Very Secure. (That's not to deride the fact that encryption is important, because it is, but sometimes it goes a bit far when there are other gaping holes that should be patched first.)

anonzzzies · a year ago

Usually, unless someone is really doing naive things, you will need to have access to a lot of almost physical things to sniff traffic. You almost need to physically have access to room where either the server or the client is, even with unencrypted traffic. People say; 'but they can sniff it at level3'; they sure can, IF they have actual access to level3 on a higher level than just using them for normal traffic. Hacked switch or router or so. Probably state actors can and do pull that off, but outside that, it's really not so easy to get to unencrypted traffic of just a random target. You still should encrypt things of course when you can, but you don't have to get quite that paranoid about it.

All major hacks are 0-days (well, not updated Wordpress is not necessarily 0-day; a lot of 0-days are exploited months or years later), stolen credentials (social engineering usually), brute force password hacks or applications that are left open (root/root for mysql with 3306 open to the world). Those have nothing to do with (un)encrypted traffic.

bushbaba · a year ago

Service Meshes are something necessary for a small portion of Fortune 500s which have 1000s of microservices. Sure you could use load balancers but it becomes cost efficient to move towards a client-side load balancer.

If you aren't a Google, Apple, Microsoft, ...etc scale company than a service mesh might be a tad overkill

xyzzy_plugh · a year ago

You're close, but it's really when you have thousands of microservices using either shitty languages or shitty client RPC libraries where you can't easily perform client-side load balancing.

There are plenty of languages and RPC frameworks where you can solve this without resorting to a service mesh.

Practically, and to your point, service meshes solve an organizational problem, not a technical one.

marcosdumay · a year ago

I don't get this either. Doesn't the mesh become an scalability bottleneck just like load balancers?

On that scale I'd expect people to use client-selected replicated services (like SMTP), and never something that centralizes connections (no matter where it's close to).

You can always add observability at the endpoints. Unless your infrastructure is very unusual (like some random part of it costing millions of times more for no good reason, as on the cloud), this is not a big challenge; you add it to the frameworks your people use. (I mean, people don't go and design an entire service with whatever random components they pick, or do they?)

packetlost · a year ago

So like a DNS SRV record with multiple entries. Or Anycast, if you're being fancy

remram · a year ago

Isn't kube-proxy already a client-side load-balancer?

champtar · a year ago

I agree that intra node encryption, if implemented by sidecars, is just wasting CPU cycles.

Small note, unless it has changed recently, containerd default capabilities list includes CAP_NET_RAW, so hostNetwork=true pods can sniff all traffic.

pylua · a year ago

I like that istio does mtls. It also helps with monitoring the requests.

neya · a year ago

I actually never understood the appeal of Kubernetes in the first place. I have production apps running on bare bones VMs serving millions of customers. Is this sort of complexity really necessary? At this point I would just consider serverless options. Sure, they would be a little more expensive, but that's a huge savings if we account for engineering teams' time.

growse · a year ago

Counter-take: I never understood the appeal of virtualizing the hardware. Is that complexity really necessary?

Of course there's tradeoffs, but I think it's a specific perspective that says that Kubernetes is any more complex than virtualizing the hardware and scheduling multiple VMs across real hardware.

osigurdson · a year ago

If you didn't have helm, you would be writing your own regex scripts. I don't see how this would be better.

numbsafari · a year ago

If you didn’t have helm, you’d be using one of the other, much better tools, and be happier for it.

pdimitar · a year ago

Really? What's wrong with https://github.com/mikefarah/yq? Works quite fine.

I think people often don't realize that depending on the language runtime, micro-services can easily be a must.

Most service boundaries at organizations are "I need a different version of a pinned package we can't upgrade.". This is common in languages where there is support for only using one version of a given package, and it's worse if there isn't a culture of preserving function APIs. E.g. any python company with pandas/numpy in the stack will need to split the environments at some point in the future, no ifs ands or buts!

pclmulqdq · a year ago

I have heard that the reason Docker (and containers in general) took off was that they solved the problem of Python's awful package management. I didn't believe it until I saw people put Python in production and have to deal with this. At this point, I would rather have a physical snake in my server racks than any Python code.

noitpmeder · a year ago

Do we live in different worlds? Virtual environments solve 99% of all python packaging and installation use cases.

vbezhenar · a year ago

I have python service with does some AI job but it just doesn't build anymore. I have "golden image" which I carefully back up, because losing it would mean the catastrophe.

marcosdumay · a year ago

Python has a real problem with version incompatibilities for the interpreter, and a few packages require C libraries of specific versions. But outside of that, vitualenvs solve all of the issues of "how do I run those two programs together".

After the Py2 vs. Py3 thing settled down, almost all of the operations issues got away.

That said, Python has a really bad situation about dependencies upgrade on the development side. But Docker won't help you there anyway. Personally, at this point I just assume any old Python program won't work anymore.

nurettin · a year ago

I have used python in production for years, multiple servers, multiple racks, and deployment has always been as simple as

./deploy.sh pull sync migrate seed restart

pull calls git pull, sync runs pipenv sync, migrate runs django migrate, seed runs django management command seed, restart calls systemctl --user restart

jmspring · a year ago

“Microservices can easily be a must” please explain.

Your example talks of packaging issues.

LegibleCrimson · a year ago

Microservices (or really services in general) solve some of these packaging issues. If I have my application that depends on package A that pulls in dependency C version 1.x, and also on package B that pulls in dependency C version 2.x, this just doesn't work in Python, and many other languages. The only way to make it work is either rectify my dependencies so all my versions match (by running one of your dependencies out of date) or to split them up so my application is composed of one service that pulls in package A and another that pulls in package B, and have them talk over some IPC.

eptcyka · a year ago

Did you read the same article I did? How is this relevant?

parhamn · a year ago

I meant to reply to "good lord is this what modern microservices are like?".

andsens · a year ago

romantomjak · a year ago

> Infrastructure teams can usually implement a feature faster than every app team in a company, so this tends to get solved by them.

Well, that's comparing apples to oranges. Product teams have completely different goals, e.g. adoption/retention/engagement, so naturally internal cluster encryption is so far out of scope that in fact only the platform team can reasonably implement it. I don't see how that statement is relevant. You don't send an electrician to build a brick wall

flumpcakes · a year ago

Application security should be everyone's responsibility. Architects, developers, and operations.

Too many times have I seen architects and developers completely ignore it to make their jobs easier, leaving it to operations/infrastructure to implement. It's easy to twist the arm of business people with a "I can't ship feature X if you want me to look at security Y".

If everyone took this seriously perhaps we would have fewer issues.

I agree, I was just making a point that different teams have different priorities and thus different scope. Saying "PodA can only talk to PodB over mTLS" is very different to "Users need to login using oauth". Who is going to build the product if product team is working on the service mesh?

gerad · a year ago

I read the GP as it’s easier to have the single infrastructure team implement it than have every single product team add support in their service.

I mean most app servers abstract away https on the server level and most dev is done unencrypted. So this seems reasonable.

pm90 · a year ago

> Istio has become infamous among k8s infrastructure staff as being the cause of more problems than any other part of the stack

True in my opinion. Its very complicated and the docs are confusing af.

jq-r · a year ago

I've attended a talk on Kubecon last year on how one company adopted Istio service mesh. I've lost the guy in the first 10 minutes of the talk as it was so complicated, and decided that service mesh is 100% not going into our k8s clusters.

Recently an overly confident security engineer came to us and demanded that we get service mesh because thats a SOX requirement. I have no idea from where these people get pipe dreams like that.

AlecBG · a year ago

I think Istio docs are great! I do agree that it's complicated, and I think their API is more confusing that it needs to be. The ontology of DestinationRules, VirtualServices, ServiceEntrys, Gateways (as in the K8s resource), gateways (as in the istio gateway Helm chart) is not the best.

throwitaway222 · a year ago

At my last company the devops guy was installing istio for 2 years before he gave up. K8s by itself was just fine.

rexarex · a year ago

The docs are TERRIBLE once you need to actually use them in prod.

qazxcvbnm · a year ago

Not having worked with K8s, it seems to me a number of things that service meshes are capable of can be done by SDN (e.g. Tailscale, ZeroTier). As far as I'm aware, SDN can do encryption and service discovery (via things like custom DNS) just fine. Can someone explain to me the differences and tradeoffs involved?

ibotty · a year ago

Cilium is a service mesh via being a SDN. That's hinted at in the article.

atombender · a year ago

That's been my thinking for a while, too. I work extensively on Kubernetes-hosted apps, but our org has (wisely, probably) eschewed service meshes in favour of ingress-based solutions. However, the simplicity of those solutions make things creaky and error-prone.

Rather than injecting sidecar containers that set up networking and so on, having pods join an existing SDN that just works with no app-side config would be a much more elegant solution.

Other than Cilium, I'm not aware of an SDN like ZeroTier or Wireguard that works seamlessly with Kubernetes this way (and which also works on managed Kubernetes like GKE and EKS).

whatthesmack · a year ago

As great as reducing complexity is, I just don't see how it's possible to avoid implementing a service mesh in a FedRAMP moderate or high impact level environment. You essentially need to implement mTLS to meet SC-8(1), and to implement mTLS at scale, you need something like a service mesh.

Are there other ways of going about this for FedRAMP moderate or high IL?

> to implement mTLS at scale, you need something like a service mesh

What makes you think that?

Deleted Comment

The article is about service meshes and the tradeoffs amongst them. Going back several years, teams at companies ask about feature X - mtls a big one. The discussion goes to - should we use a service mesh, often the answer was no.

K8s is a great platform with many options, but many decision makers have little knowledge (or don’t research) the implication of their choices.

xyst · a year ago

> decision makers have little knowledge (or don’t research) the implication of their choices.

I hate how true this statement is within the industry. To many of these C-level executives base decisions off whatever CEO summit they recently attended.

“Every app to use microservices!111!”

“Hybrid cloud. We are doing it”

“Serverless, let’s start using this”

“We are fully going to the cloud!11!!”

Then when the results come in, the complaints start rolling in:

1) wHy iS aPp SlOwEr (after MSA)

2) gUys, iNfRaStruCtUrE cost is SoArInG (shifting to “cloud”)

3) the ApP is ToO cOmPlEx (after MSA, and “serverless”)

Some of these aging dinosaurs need to be put out to pasteur

neilv · a year ago

> To many of these C-level executives base decisions off whatever CEO summit they recently attended. [...] Some of these aging dinosaurs need to be put out to pasteur

AFAIK, the virus of C-suite IT bad ideas doesn't discriminate on the basis of age.

mianos · a year ago

It is worse than that. Many decision makers are making their decisions based on advice from people who fired up k8s and all its gadgets for their pet project or google.

billfor · a year ago

I read the article as being about service meshes now being a cost item whereas they were free or low cost. I’m not sure debating the technical merits speaks to that.

There were technical components mentioned in the article. Yes, cost comparison was the main thrust.

gnarbarian · a year ago

this is a personal attack

How so?