bigmutant (u/bigmutant)

bigmutant commented on Oracle made a $300B bet on OpenAI. It's paying the price finance.yahoo.com/news/or... · Posted by u/pera

jeffbee · 4 days ago

The idea that Java has been destroyed is pretty wild. I don't see how that belief could survive contact with the real world.

bigmutant · 4 days ago

Pretty common attitude from folks who have never worked in one of the BigTech companies where Java rules (Amazon being a prime example). Since they never encounter Java in the "SF-style Startup" world, they assume that it must be dead. Meanwhile hundreds-of-thousands of Engineers deal with hundreds-of-millions (billions?) of lines of Java every day

bigmutant commented on Ask HN: Why isn't Amazon.com impacted by AWS outages? · Posted by u/trevoragilbert

binsquare · 2 months ago

Same here, couldn't add things to cart or see the prices for a lot of things.

bigmutant · 2 months ago

DynamoDB is used *everywhere* in AMZN Retail, this is absolutely not surprising. Plus the vast majority of internal Services are using EC2 in the form of Apollo/ECS. So OP probably hit some parts of the site that are hosted in us-west-2. For all I know they started routing all requests for us-east-1 traffic to other DCs, figuring latency is a fine trade-off for availability

bigmutant commented on Ask HN: Why isn't Amazon.com impacted by AWS outages? · Posted by u/trevoragilbert

JustExAWS · 2 months ago

Source: Former AWS employee. For the most part Amazon Retail doesn’t run on AWS infrastructure and doesn’t use AWS services. I’m simplifying a little bit. But Amazon (the company) runs two sets of infrastructure “AWS” and “CDO” (or COE I don’t remember).

It’s an old wives tale that AWS came out of “excess capacity” from Amazon Retail.

bigmutant · 2 months ago

To clarify, most of CDO (Consumer Devices Other) does run on AWS in the sense that NAWS is the target state, MAWS is legacy and actively (slowly) being migrated off of. CDO (including Alexa) has been using DynamoDB/Lambda/Kinesis/SQS etc forever, its just the compute and kind-of network layers that are still MAWS. Even then, a large part of CDO has moved from Apollo to ECS/FarGate/whatever unholy Hex or DataPath thing they're pushing these days

Source: Ex-AMZN

bigmutant commented on Distributed systems programming has stalled shadaj.me/writing/distrib... · Posted by u/shadaj

EtCepeyd · 10 months ago

I was studying for my MSc in CS some 25 years ago. Our curriculum included both automata/formal languages (multiple courses over multiple semesters) and parallel programming.

The latter course (a) was built on a mathematical formalism that had been developed at the university proper and not used anywhere else, (b) used PVM: <https://www.netlib.org/pvm3/>, <https://en.wikipedia.org/wiki/Parallel_Virtual_Machine>, for labs.

Since then, I've repeatedly felt that I've seriously benefited from my formal languages courses, while the same couldn't be said about my parallel programming studies. PVM is dead technology (I think it must have counted as "nearly dead" right when we were using it). And the only aspect I recall about the formal parallel stuff is that it resembles nothing that I've read or seen about distributed and/or concurrent programming ever since.

A funny old memory regarding PVM. (This was a time when we used landlines with 56 kbit/s modems and pppd to dial in to university servers.) I bought a cheap second computer just so I could actually "distribute" PVM over a "cluster". For connecting both machines, I used linux's PLIP implementation. I didn't have money for two ethernet cards. IIRC, PLIP allowed for 40 kbyte/s transfers! <https://en.wikipedia.org/wiki/Parallel_Line_Internet_Protoco...>

bigmutant · 10 months ago

Sure, I did the same, BS/MS with a focus on Compilers/Programming Languages. It's been personally gratifying to understand programming "end-to-end" and to solve some tricky problems, but 99% of folks aren't going to hit those problems. There are tons of people interacting with Cloud Services every day that aren't aware of the basic issues like:

- Consistency models (can I really count on data being there? What do I have to do to make sure that stale reads/write conflicts don't occur?)

- Transactions (this has really fallen off, especially in larger companies outside of BI/Analytics)

- Causality (how can I handle write conflicts at the App Layer? Are there Data Structures ie CDTs that can help in certain cases?)

Even basic things like "use system time/monotonic clocks to measure elapsed time instead of wall-clock time" aren't well known, I've personally corrected dozens of CRs for this. Yes this can be built in to libs, AI agents etc but it never seems to actually be, and I see the same issues repeated over-and-over. So something is missing at the education layer

bigmutant commented on Distributed systems programming has stalled shadaj.me/writing/distrib... · Posted by u/shadaj

hinkley · 10 months ago

I am 100% convinced we could delete the compiler class from college curricula and replace it with distributed computing and the world would be a better place.

bigmutant · 10 months ago

Def agree. Most people will never touch an Abstract Syntax Tree or even Expression Trees. Almost everyone working in back-end will use Cloud Services, will make mistakes based on assumptions of what they provide

bigmutant commented on Distributed systems programming has stalled shadaj.me/writing/distrib... · Posted by u/shadaj

rectang · 10 months ago

Ten years ago, I had lunch with Patricia Shanahan, who worked for Sun on multi-core CPUs several decades ago (before taking a post-career turn volunteering at the ASF which is where I met her). There was a striking similarity between the problems that Sun had been concerned with back then and the problems of the distributed systems that power so much the world today.

Some time has passed since then — and yet, most people still develop software using sequential programming models, thinking about concurrency occasionally.

It is a durable paradigm. There has been no revolution of the sort that the author of this post yearns for. If "Distributed Systems Programming Has Stalled", it stalled a long time ago, and perhaps for good reasons.

bigmutant · 10 months ago

The fundamental problems are communication lag and lack of information about why issues occur (encapsulated by the Byzantine Generals problem). I like to imagine trying to build a fault-tolerant, reliable system for the Solar System. Would the techniques we use today (retries, timeouts, etc) really be adequate given that lag is upwards of hours instead of milliseconds? But that's the crux of these systems, coordination (mostly) works because systems are close together (same board, at most same DC)

bigmutant commented on Distributed systems programming has stalled shadaj.me/writing/distrib... · Posted by u/shadaj

bigmutant · 10 months ago

Good resources for understanding Distributed Systems:

- MIT course with Robert Morris (of Morris Worm fame): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

- Martin Kleppmann (author of DDIA): https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjc...

If you can work through the above (and DDIA), you'll have a solid understanding of the issues in Distributed System, like Consensus, Causality, Split Brain, etc. You'll also gain a critical eye of Cloud Services and be able to articulate their drawbacks (ex: did you know that replication to DynamoDB Secondary Indexes is eventually consistent? What effects can that have on your applications?)

bigmutant commented on Scalable OLTP in the Cloud: What's the Big Deal? muratbuffalo.blogspot.com... · Posted by u/SchwKatze

whartung · a year ago

Maybe someone can answer how this is done.

Simply, the mad crushing dash to get the last bit of committed inventory.

Ticketmaster has 50,000 General Admission Taylor Swift tickets and 1M fans eager to hoover them up.

This is a crushing load on a shared resource.

I don't know if there's any reasonable outcome from this besides the data center not catching on fire.

bigmutant · a year ago

As others have said, this is a solved problem in a lot of companies. Basic answers are: 1. Queuing 2. Asynchronous APIs (don't wait for the 'real' response, just submit the transaction) 3. Call-backs to the Client

A good async setup can easily handle 100k+ TPS

If you want to go the synchronous route, it's more complicated but amounts to partitioning and creating separate swim-lanes (copies of the system, both at the compute and data layers)

bigmutant commented on MySQL at Uber uber.com/blog/mysql-at-ub... · Posted by u/0xFA11

PaulHoule · a year ago

I remember when that blog used to be top notch.

Today it just seems odd that anybody is still using MySQL. Postgres? Sure. SQLlite? Hell yeah! DuckDB? Of course. MySQL? Not so much.

bigmutant · a year ago

Absolutely not true in my experience. MySQL has its share of issues (all DBs do) but it is rock-solid when using the correct engine (InnoDB for most cases, RocksDB for high-throughput writes, Memory for caching). MySQL is very hard to beat for very high-volume OLTP workloads, both reads and writes. Its replication systems were years ahead of other systems (SQL Server, Postgres, SQLite doesn't have replication). DuckDB AFAIK is OLAP and they don't compete in the same space. Every DB system has "the things its good at" and MySQL really shines at very high-volume OLTP spread across partitions.

bigmutant commented on MySQL at Uber uber.com/blog/mysql-at-ub... · Posted by u/0xFA11

aeyes · a year ago

So if these clusters are using binlog replication, do they just ignore the possibility of lost writes and inconsistent data after a failover?

bigmutant · a year ago

That all depends on the setup. The "standard" setup (not specific to MySQL) is:

- Single Write Leader per partition

- Backup Write Leader that is setup with synchronous replication (so WL -> WLB and waits for commit)

- Read Followers all connected asynchronously using either binlog replication (not recommended anymore) or GTID-based row replication (recommended)

In the above scenario, the odds of loss are pretty small since the Write Leader has a direct backup, and any of the Read Followers can be promoted to a Write Leader/Backup. DDIA calls the above semi-synchronous replication, although MySQL now supports a similar-but-slightly different version out of the box: https://dev.mysql.com/doc/refman/8.4/en/replication-semisync...