bassp (u/bassp) - Readit News

bassp commented on You might not need Redis viblo.se/posts/no-need-re... · Posted by u/jcartw

evil-olive · 6 months ago

> for instance, if you want to `append` to a shared list, you need to deserialize the list, append to the end of it in your application code, and write it back to the DB.

this seems like a classic case of impedance mismatch, trying to implement a Redis-ism using an RDBMS.

for a shared list in a relational database, you could implement it like you've said, using an array type or a jsonb column or whatever, and simulate how it works in Redis.

but to implement a "shared list" in a way that meshes well with the relational model...you could just have a table, and insert a row into the table. there's no need for a read-modify-write cycle like you've described.

or, if you really need it to be a column in an existing table for whatever reason, it's still possible to push the modification to the database without the heavy overhead. for example [0]:

> The concatenation operator allows a single element to be pushed onto the beginning or end of a one-dimensional array. It also accepts two N-dimensional arrays, or an N-dimensional and an N+1-dimensional array.

0: https://www.postgresql.org/docs/current/arrays.html#ARRAYS-M...

bassp · 6 months ago

Sure, but that’s not what the person responding to my original comment was suggesting :). They suggested that you serialize entire data structures (bloom filters, lists, sets, etc…) into a relational DB to get redis-like functionality out of it; I chose a list as an example to illustrate why that’s not a great option in many cases.

You’re right that managing lists in RDMSes is easy-ish, if you don’t have too many of them, and they’re not too large. But, like I mentioned in my original comment, redis really shines as a complex data structure server. I wouldn’t want to implement my own cuckoo filter in Postgres!

bassp commented on You might not need Redis viblo.se/posts/no-need-re... · Posted by u/jcartw

e_hup · 6 months ago

All of those can be serialized and stored in an RDMS. You don't need Redis for that.

bassp · 6 months ago

They can (and that's probably the right choice for a lot of use cases, especially for small data structures and infrequently updated ones), but serializing and storing them in a database requires you to (in your application code) implement synchronization logic and pay the performance cost for said logic; for instance, if you want to `append` to a shared list, you need to deserialize the list, append to the end of it in your application code, and write it back to the DB. You'd need use some form of locking to prevent appends from overwriting each other, incurring a pretty hefty perf penalty for hot lists. Also, reading an entire list/tree/set/whatever back just to add/delete one element is very wasteful (bandwidth/[de]serialization cost-wise)

bassp commented on You might not need Redis viblo.se/posts/no-need-re... · Posted by u/jcartw

bassp · 6 months ago

I agree with the author 100% (the TanTan anecdote is great, super clever work!), but.... sometimes you do need Redis, because Redis is the only production-ready "data structure server" I'm aware of

If you want to access a bloom filter, cuckoo filter, list, set, bitmap, etc... from multiple instances of the same service, Redis (slash valkey, memorydb, etc...) is really your only option

bassp commented on Kafka at the low end: how bad can it get? broot.ca/kafka-at-the-low... · Posted by u/alexwebr

NovemberWhiskey · 6 months ago

Kafka for small message volumes is one of those distinct resume-padding architectural vibes.

bassp · 6 months ago

I use Kafka for a low-message-volume use case because it lets my downstream consumers replay messages… but yeah in most cases, it’s over kill

bassp commented on Beej's Guide to Git beej.us/guide/bggit/... · Posted by u/mixto

beej71 · 7 months ago

Hey all--if you find things wrong, post 'em. I'll clean 'em up. :)

Love, Beej

bassp · 7 months ago

Your network programming guide really saved my bacon back when I was taking a networking class, I appreciate all your hard work!

bassp commented on Is gRPC Better for Microservices Than GraphQL? wundergraph.com/blog/is-g... · Posted by u/thunderbong

jensneuse · 7 months ago

We wrote a breadth first algorithm to handle the second problem you're describing. I'm curious to hear your thought on it: https://wundergraph.com/blog/dataloader_3_0_breadth_first_da...

bassp · 7 months ago

That's really clever! Kudos. I'm gonna set aside some time this week to dive into the implementation

bassp commented on Is gRPC Better for Microservices Than GraphQL? wundergraph.com/blog/is-g... · Posted by u/thunderbong

bassp · 7 months ago

IME, yes.

Here a couple problems I've run into using GQL for backend to backend communication:

* Auth. Good GQL APIs think carefully about permission management on a per-field basis (bad GQL apis slap some auth on an entire query or type and call it a day). Back-end services, obviously, are not front end clients, and want auth that grants their service access to an entire query, object, or set of queries/mutations. This leads to tension, and (often) hacky workarounds, like back-end services pretending to be "admin users" to get the access they need to a GQL API.

* Nested federation. Federation is super powerful, and, to be fair, data loaders do a great job of solving the N+1 query problem when a query only has one "layer" of federation. But, IME, GQL routers are not smart enough to handle nested federation; ie querying for a list of object `A`s, then federating object `B` on to each `A`, then federating object `C` on to each `B`. The latency for these kinds of queries is, usually, absolutely terrible, and I'd rather make these kinds of queries over gRPC (eg hit one endpoint for all the As, then use the result to get all the Bs, then use all the Bs to get all the Cs)

bassp commented on The Missing Nvidia GPU Glossary modal.com/gpu-glossary/re... · Posted by u/birdculture

einpoklum · 8 months ago

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

You were taught wrong...

First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

Amount of shared memory used per warp

* Makeup of the instructions to be executed

* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

bassp · 8 months ago

Sorry if I was sloppy with my wording, instruction issuance is what I meant :)

I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?

bassp commented on The Missing Nvidia GPU Glossary modal.com/gpu-glossary/re... · Posted by u/birdculture

saagarjha · 8 months ago

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.

bassp · 8 months ago

You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once

bassp commented on The Missing Nvidia GPU Glossary modal.com/gpu-glossary/re... · Posted by u/birdculture

jms55 · 8 months ago

The weird part of the programming model is that threadblocks don't map 1:1 to warps or SMs. A single threadblock executes on a single SM, but each SM has multiple warps, and the threadblock could be the size of a single warp, or larger than the combined thread count of all warps in the SM.

So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.

People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.

Two articles on the AMD side of things:

https://gpuopen.com/learn/occupancy-explained

https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...

bassp · 8 months ago

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).