ideal_gas (u/ideal_gas)

ideal_gas commented on Golang's big miss on memory arenas avittig.medium.com/golang... · Posted by u/andr3wV

ideal_gas · 5 days ago

> By killing Memory Arenas, Go effectively capped its performance ceiling.

I'm still optimistic about potential improvements. (Granted, I doubt there will be anything landing in the near future beyond what the author has already mentioned.)

For example, there is an ongoing discussion on "memory regions" as a successor to the arena concept, without the API "infection" problem:

https://github.com/golang/go/discussions/70257

ideal_gas commented on Systems Correctness Practices at Amazon Web Services cacm.acm.org/practice/sys... · Posted by u/tanelpoder

spenczar5 · 6 months ago

I want something like this, but I work almost entirely in Go.

Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?

Does one also need to run on a particularly predictable OS?

ideal_gas · 6 months ago

For in-process go scheduling, some progress has been made here; see: https://go.dev/blog/synctest

But `synctest.Wait` won't work for non-durably blocked goroutines (such as in the middle of a TCP syscall) so requires memory-based implementations of e.g. net.Conn (I've plugged in https://pkg.go.dev/google.golang.org/grpc/test/bufconn with good success)

ideal_gas commented on Ask HN: It's 2023, how do you choose between MySQL and Postgres? · Posted by u/debo_

srcreigh · 3 years ago

Postgres is >50x slower for range queries(example below) and is akin to using array-of-pointers (ie Java) whereas MySQL supports array-of-struct (C). Illustration from Dropbox scaling talk below.

Sneak peek photo [1] (from [2]). Just imagine its literally 500-1000x more convoluted per B-tree leaf node. That's every Postgres table unless you CLUSTER periodically.

[1]: https://josipmisko.com/img/clustered-vs-nonclustered-index.w...

[2]: https://josipmisko.com/posts/clustered-vs-non-clustered-inde...

Mind boggling how many people aren't aware of primary indexes in MySQL that is not supported at all in Postgres. For certain data layouts, Postgres pays either 2x storage (covering index containing every single column), >50x worse performance by effectively N+1 bombing the disk for range queries, or blocking your table periodically (CLUSTER).

In Postgres the messiness loading primary data after reaching the B-tree leaf nodes pollutes caches and takes longer. This is because you need to load one 8kb page for every row you want, instead of one 8kb with 20-30 rows packed together.

Example: Dropbox file history table. They initially used autoinc id for primary key in MySQL. This causes everybodys file changes to be mixed together in chronological order on disk in a B-Tree. The first optimization they made was to change the primary key to (ns_id, latest, id) so that each users (ns_id) latest versions would be grouped together on disk.

Dropbox scaling talk: https://youtu.be/PE4gwstWhmc?t=2770

If a dropbox user has 1000 files and you can fit 20 file-version rows on each 8kb disk page (400bytes/row), the difference in performance for querying across those 1000 files is 20 + logN disk reads (MySQL) vs 1000 + logN disk reads (Postgres). AKA 400KiB data loaded (MySQL) vs 8.42MiB loaded (Postgres). AKA >50x improvement in query time and disk page cache utilization.

In Postgres you get two bad options for doing this: 1) Put every row of the table in the index making it a covering index, and paying to store all data twice (index and PG heap). No way to disable the heap primary storage. 2) Take your DB offline every day and CLUSTER the table.

Realistically, PG users pay that 50x cost without thinking about it. Any time you query a list of items in PG even using an index, you're N+1 querying against your disk and polluting your cache.

This is why MySQL is faster than Postgres most of the time. Hopefully more people become aware of disk data layout and how it affects query performance.

There is a hack for Postgres where you store data in an array within the row. This puts the data contiguously on disk. It works pretty well, sometimes, but it’s hacky. This strategy is part of the Timescale origin story.

Open to db perf consulting. email is in my profile.

ideal_gas · 3 years ago

(Admitting bias: I've only ever worked with postgres in production with update-heavy tables so I've dealt with more of its problems than MySQL's)

Postgres also has other gotchas with indexes - MVCC row visibility isn't stored in the index for obvious performance reasons (writes to non-indexed columns would mean always updating all indexes instead of HOT updates [1]) so you have to hope the version information is cached in the visibility map or else don't really get the benefit of index only scans.

But OTOH, I've read that secondary indexes cause other performance penalties with having to refer back to the data in clustered indexes? Never looked into the details because no need to for postgres which we've been very happy with at our scale :)

[1] https://www.postgresql.org/docs/current/storage-hot.html