If you want to access a bloom filter, cuckoo filter, list, set, bitmap, etc... from multiple instances of the same service, Redis (slash valkey, memorydb, etc...) is really your only option
Here a couple problems I've run into using GQL for backend to backend communication:
* Auth. Good GQL APIs think carefully about permission management on a per-field basis (bad GQL apis slap some auth on an entire query or type and call it a day). Back-end services, obviously, are not front end clients, and want auth that grants their service access to an entire query, object, or set of queries/mutations. This leads to tension, and (often) hacky workarounds, like back-end services pretending to be "admin users" to get the access they need to a GQL API.
* Nested federation. Federation is super powerful, and, to be fair, data loaders do a great job of solving the N+1 query problem when a query only has one "layer" of federation. But, IME, GQL routers are not smart enough to handle nested federation; ie querying for a list of object `A`s, then federating object `B` on to each `A`, then federating object `C` on to each `B`. The latency for these kinds of queries is, usually, absolutely terrible, and I'd rather make these kinds of queries over gRPC (eg hit one endpoint for all the As, then use the result to get all the Bs, then use all the Bs to get all the Cs)
You were taught wrong...
First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".
Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:
Amount of shared memory used per warp* Makeup of the instructions to be executed
* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).
I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?
So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.
People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.
Two articles on the AMD side of things:
https://gpuopen.com/learn/occupancy-explained
https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...
There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).
this seems like a classic case of impedance mismatch, trying to implement a Redis-ism using an RDBMS.
for a shared list in a relational database, you could implement it like you've said, using an array type or a jsonb column or whatever, and simulate how it works in Redis.
but to implement a "shared list" in a way that meshes well with the relational model...you could just have a table, and insert a row into the table. there's no need for a read-modify-write cycle like you've described.
or, if you really need it to be a column in an existing table for whatever reason, it's still possible to push the modification to the database without the heavy overhead. for example [0]:
> The concatenation operator allows a single element to be pushed onto the beginning or end of a one-dimensional array. It also accepts two N-dimensional arrays, or an N-dimensional and an N+1-dimensional array.
0: https://www.postgresql.org/docs/current/arrays.html#ARRAYS-M...
You’re right that managing lists in RDMSes is easy-ish, if you don’t have too many of them, and they’re not too large. But, like I mentioned in my original comment, redis really shines as a complex data structure server. I wouldn’t want to implement my own cuckoo filter in Postgres!