bytefactory (u/bytefactory)

bytefactory commented on Who does your assistant serve? xeiaso.net/blog/2025/who-... · Posted by u/todsacerdoti

jchw · 22 days ago

> Also take a good hard look at the token output speeds before investing. If you’re expecting quality, context windows, and output speeds similar to the hosted providers you’re probably going to be disappointed. There are a lot of tradeoffs with a local machine.

I don't really expect to see performance on-par with the SOTA hosted models, but I think I'm mainly curious what you could possibly do with local models that would otherwise not be doable with hosted models (or at least, stuff you wouldn't want to for other reasons, like privacy.)

One thing I've realized lately is that Gemini, and even Gemma, are really, really good at transcribing images, much better and more versatile than OCR models as they can also describe the images too. With the realization that Gemma, a model you can self-host, is good enough to be useful, I have been tempted to play around with doing this sort of task locally.

But again, $2,000 tempted? Not really. I'd need to find other good uses for the machine than just dicking around.

In theory, Gemma 3 27B BF16 would fit very easily in system RAM on my primary desktop workstation, but I haven't given it a go to see how slow it is. I think you mainly get memory bandwidth constrained on these CPUs, but I wouldn't be surprised if the full BF16 or a relatively light quantization gives tolerable t/s.

Then again, right now, AI Studio gives you better t/s than you could hope to get locally with a generous amount of free usage. So ... maybe it would make sense to wait until the free lunch ends, but I don't want to build anything interesting that relies on the cloud, because I dislike the privacy implications of it, even though everything I'm interested in doing is fully safe with the ToS.

bytefactory · 21 days ago

I had long been of the opinion that local models were a long way away from being useful, and that they were toys at best. I'm a heavy user of o3/GPT5, Claude Opus/Sonnet and Gemini 2.5 Pro, so my expectations were sky high.

I tried out Gemma 27B on LM Studio a few days ago, and I was completely blown away! It has a warmth and character (and smarts!) that I was not expecting in a tiny model. It just doesn't have tool use (although there are hacky workarounds), which would have made it even better. Qwen 3 with 30B parameters (3B active) seems to be nearly as capable, but also supports tool use.

I'm currently in the process of vibe coding an agent network with LangGraph orchestration, Gemma 27B/Qwen 3 30B-A3B with memory, context management and tool management. The Qwen model even uses a tiny 1.7B "draft" model for speculative decoding improving performance. In my 7800x3D, RTX 4090 with 64GB RAM, I have latency of ~200-400ms, and 20-30 tokens/s which is plenty fast.

My thought process is that this local stack will let me use agents to their fullest in administering my machine. I always felt uneasy letting Claude Code, Gemini CLI or Codex operate outside my code folders. Yet, their utility in helping me troubleshoot problems (I'm a recent Linux convert) was too attractive to ignore. Now I have the best of both worlds. Privacy, and AI models helping with sysadmin. They're also great for quick "what options does kopia backup use?" type questions I've assigned a global hotkeyed helper for.

Additionally, if one has a NAS with the *arr stack for downloading, say perfectly legal Linux ISOs, such a private model would be far more suitable.

It's early days, but I'm excited about other use cases i might discover over time! It's a good time to be an AI enthusiast.

bytefactory commented on V-JEPA 2 world model and new benchmarks for physical reasoning ai.meta.com/blog/v-jepa-2... · Posted by u/mfiguiere

cubefox · 3 months ago

I think the fundamental idea behind JEPA (not necessarily this concrete Meta implementation) will ultimately be correct: predicting embeddings instead of concrete tokens. That's arguably what animals do. Next-token prediction (a probability distribution over the possible next tokens) works well for the discrete domain of text, but it doesn't work well for a continuous domain like video, which would be needed for real-time robotics.

For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.

Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.

bytefactory · 3 months ago

Can you clarify my understanding as a layman please?

Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?

I might be using the jargon incorrectly!

bytefactory commented on AGI is not multimodal thegradient.pub/agi-is-no... · Posted by u/danielmorozoff

ryankrage77 · 3 months ago

I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing, to better 'understand' cause-and-effect. Current LLMs predict a token, have all current tokens fed back in, then predict the next, and repeat. It makes little difference if those tokens are their own, it's interesting to play around with a local model where you can edit the output and then have the model continue it. You can completely change the track by just negating a few tokens (change 'is' to 'is not', etc). The fact LLMs can do as much as they can already, is I think because language itself is a surprisingly powerful tool, just generating plausible language produces useful output, no need for any intelligence.

bytefactory · 3 months ago

> I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing

Then you'll be happy to know that this is exactly what DeepMind/Google are focusing on as the next evolution of LLMs :)

https://storage.googleapis.com/deepmind-media/Era-of-Experie...

David Silver and Richard Sutton are both highly influential figures with very impressive credentials.

bytefactory commented on Ask HN: If you've used GPT-4-Turbo and Claude Opus, which do you prefer? · Posted by u/tikkun

staticman2 · a year ago

There's a waiting list for people who want to try it for free.

bytefactory · a year ago

Ah, should have probably googled before asking. Thanks!

bytefactory commented on Ask HN: If you've used GPT-4-Turbo and Claude Opus, which do you prefer? · Posted by u/tikkun

gtirloni · a year ago

> you can feed a model your entire project, including any dependencies, and it can answer any questions in the that full context,

I've tried Gemini 1.5 with ~1M tokens and it took >90sec to answer anything.

bytefactory · a year ago

How did you get access to Gemini 1.5, I thought it wasn't available for general access yet?

bytefactory commented on I disagree with Geoff Hinton regarding "glorified autocomplete" statmodeling.stat.columbi... · Posted by u/magoghm

Falkon1313 · 2 years ago

I think that insight is an important feature that GPT doesn't seem to have, at least not yet.

For instance, I've seen people saying they can ask it for help with how to code something and it will help them. Although the results aren't perfect, they can be helpful.

However, I recall years ago asking a more senior developer how to do something. They could've just told me how, but instead they asked why I was doing that. Then they told me not to bother with writing code at all, and to instead do this other, much simpler thing which would have the same effect at solving the client's request. ChatGPT wouldn't have had that insight.

Years later, a junior dev came to me asking for help with his code because it wasn't working. We went over the code line by line and he explained what it was trying to do and it all looked good. But when he ran it, the web server crashed. I told him to try adding a comment:

    // Do not remove this comment.

Sure enough, it worked perfectly and the server no longer crashed. Why? Because I realized that if his code was wrong he should get an error message, not crash the server. But sometime back I had read about a bug in a version of the interpreter that would crash when parsing a file that was an exact multiple of 4096 bytes. Would chatGPT have thought of that? Probably not. It would've just talked about the code.

Which is not to say that it's useless. But it lacks greater context and the insight to go beyond the parameters of the question on its own initiative. Then again, so do humans most of the time.

bytefactory · 2 years ago

> I think that insight is an important feature that GPT doesn't seem to have, at least not yet.

I actually think this is a limitation of the RLHF that GPT has been put through. With open-ended questions, I've seen GPT4 come up with reasonable alternatives instead of just answering the question I've asked. This is often seen as the infamous, "however, please consider..." bits that it tacks on, which occasionally do consider actual insights into the problem I'm trying to solve.

In most cases it seems to try very hard to mold the answer into what I want to hear, which in many cases isn't necessarily the best answer. A more powerful version of GPT with a less-restrictive RLHF seems like it would be more open to suggesting novel solutions, although this is just my speculation.

bytefactory commented on I disagree with Geoff Hinton regarding "glorified autocomplete" statmodeling.stat.columbi... · Posted by u/magoghm

Jensson · 2 years ago

But it can't improve its answer after it has written it, that is a major limitation. When a human writes an article or response or solution, that is likely not the first thing the human thought of, instead they write something down and works on it until it is tight and neat and communicates just what the human wants to communicate.

Such answers will be very hard for an LLM to find, instead you mostly get very verbose messages since that is how our current LLM thinks.

bytefactory · 2 years ago

Completely agree. The System 1/System 2 distinction seems relevant here. As powerful as transformers are with just next-token generation and context, which can be hacked to form a sort of short-term memory, some time of real-time learning + long-term memory storage seems like an important research direction.

bytefactory commented on I disagree with Geoff Hinton regarding "glorified autocomplete" statmodeling.stat.columbi... · Posted by u/magoghm

Jensson · 2 years ago

A big difference between a game like Go and writing text is that text is single player. I can write out the entire text, look at it and see where I made mistakes on the whole and edit those. I can't go back in a game of Go and change one of my moves that turned out to be a mistake.

So trying to make an AI that solves the entire problem before writing the first letter will likely not result in a good solution while also making it compute way too much since it solves the entire problem for every token generated. That is the kind of AI we know how to train so for now that is what we have to live with, but it isn't the kind of AI that would be efficient or smart.

bytefactory · 2 years ago

This doesn't seem like a major difference, since LLMs are also choosing from a probability distribution of tokens for the most likely one, which is why they respond a token at a time. They can't "write out' the entire text at a time, which is why fascinating methods like "think step by step" work at all.

bytefactory commented on I disagree with Geoff Hinton regarding "glorified autocomplete" statmodeling.stat.columbi... · Posted by u/magoghm

SkiFire13 · 2 years ago

> One other thing to take into consideration, is that to play the game of Go you can't just think of the next move. You have to think far forward in the game -- even though technically all it's doing is picking the next move, it is doing so using a model that has obviously looked forward more than just one move.

It doesn't necessarily have to look ahead. Since Go is a deterministic game there is always a best move (or moves that are better than others) and hence a function that goes from the state of the game to the best move. We just don't have a way to compute this function, but it exists. And that function doesn't need the concept of lookahead, that's just an intuitive way of how could find some of its values. Likewise ML algorithms don't necessarily need lookahead, they can just try to approximate that function with enough precision by exploiting patterns in it. And that's why we can still craft puzzles that some AIs can't solve but humans can, by exploiting edge cases in that function that the ML algorithm didn't notice but are solvable with understanding of the game.

The thing is though, does this really matter if eventually we won't be able to notice the difference?

bytefactory · 2 years ago

> It doesn't necessarily have to look ahead. Since Go is a deterministic game there is always a best move

Is there really a difference between the two? If a certain move shapes the opponent's remaining possible moves into a smaller subset, hasn't AlphaGo "looked ahead"? In other words, when humans strategize and predict what happens in the real world, aren't they doing the same thing?

I suppose you could argue that humans also include additional world models in their planning, but it's not clear to me that these models are missing and impossible for machine learning models to generate during training.

bytefactory commented on GPT-4 vision prompt injection blog.roboflow.com/gpt-4-v... · Posted by u/Lealen

mathgorges · 2 years ago

Thus separating the model’s logic from the model’s data.

All that was old is new again :) [0]

0: s/model/program/

bytefactory · 2 years ago

It's interesting how this is not presumably the case within the weights of the LLM itself. Those probably encode data as well as logic!