meehai (u/meehai) - Readit News

meehai commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

rvnx · 2 months ago

I think I should have replied as a totally separate comment. This is my mistake.

It is nice that the author shared the results of his exercise / experiment. Just got sad as I was reminded (when the 100 USD were mentioned) that all this game is 90%+ about money and hardware rather than skills.

That being said I really like the initiative of the author.

meehai · 2 months ago

it's skills first and then money and hardware for scale

A more skilled person that understands all the underlying steps will always be more efficient in scaling up due to knowing where to allocate more.

basically... you always need the skills and the money is the fine tuning.

meehai commented on Vortex: An extensible, state of the art columnar file format github.com/vortex-data/vo... · Posted by u/tanelpoder

meehai · 3 months ago

Can you append new columns to a file stored on disk without reading it all in mempey? Somehoe this is beyond parquet capabilities.

meehai commented on F3: Open-source data file format for the future [pdf] db.cs.cmu.edu/papers/2025... · Posted by u/eatonphil

sakras · 4 months ago

Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:

- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.

- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.

- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.

- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.

- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.

- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.

Overall great improvement, I'm looking forward to this taking over the data analytics space.

meehai · 4 months ago

https://stackoverflow.com/questions/31812780/append-a-new-co...

meehai commented on SimpleFold: Folding proteins is simpler than you think github.com/apple/ml-simpl... · Posted by u/kevlened

ronsor · 5 months ago

No, the problem is that with training, you do care about latency, and you need a crap-ton of bandwidth too! Think of the all_gather; think of the gradients! Inference is actually easier to distribute.

meehai · 5 months ago

Yeah, but if you can do topologies based on latencies you may get some decent tradeoffs. For example with N=1M nodes each doing batch updates in a tree manner, i.e the all reduce is actually layered by latency between nodes.

meehai commented on An LLM is a lossy encyclopedia simonwillison.net/2025/Au... · Posted by u/tosh

meehai · 5 months ago

lossy encycopledia that can also do some short-term memory (RAG) things.

meehai commented on Counter-Strike: A billion-dollar game built in a dorm room nytimes.com/2025/08/18/ar... · Posted by u/asnyder

meehai · 6 months ago

I've played years of KZ and HNS after years of playing competitive CS on local communities (old PGL in romania!). I got over 6k hours in steam CS1.6 + many more on "non-steam". That game shaped me. I even learned the basics of programming while modding a KZ plugin: https://forums.alliedmods.net/showthread.php?t=130417

Nowadays I code for a living, but for sure this is the game that started the spark for me.

It was a great time and I feel that I can always run this game and get back to that childhood feeling.

meehai commented on Do not download the app, use the website idiallo.com/blog/dont-dow... · Posted by u/foxfired

markbao · 7 months ago

Don’t agree, but to each their own. The native app experience for every app noted in the article is better and smoother than the mobile web version, in my opinion. Lots of people hate Electron apps, which suggests to me that my preference for native apps isn’t unique.

Web apps can ask for your location or microphone the same way native apps can. Just reject it, there’s nothing that says you have to accept on either platform, so to say that’s a negative for native apps is odd.

The biggest downside of native apps is you can’t customize them with extensions or user styles like you can with websites.

meehai · 7 months ago

Tbh, the web won the application platform mostly because it's a standard. Everybody knows html, css and a little JS.

On the other hand, for mobile apps, there is still a device-specific mentality.

Imagine web apps being built with a different flavor for all the major browsers...

I hope that the same level of standardization comes to mobile apps too with the option to use more device-specific features on top of the generic UI.

meehai commented on My Self-Hosting Setup codecaptured.com/blog/my-... · Posted by u/mirdaki

meehai · 7 months ago

Mine is much more barebone:

- one single machine - nginx proxy - many services on the same machine; some are internal, some are supposed to be public, are all accessible via the web! - internal ones have a humongous large password for HTTP basic auth that I store in an external password manager (firefox built in one) - public ones are either public or have google oauth

I coded all of them from scratch as that's the point of what I'm doing with homelabbing. You want images? browsers can read them. Videos? Browsers can play them.

The hard part is the backend for me. The frontend is very much "90s html".