spmurrayzzz (u/spmurrayzzz)

spmurrayzzz commented on Crush: Glamourous AI coding agent for your favourite terminal github.com/charmbracelet/... · Posted by u/nateb2022

oceanplexian · a month ago

Actually not really.

I spent at least an hour trying to get OpenCode to use a local model and then found a graveyard of PRs begging for Ollama support or even the ability to simply add an OpenAI endpoint in the GUI. I guess the maintainers simply don't care. Tried adding it to the backend config and it kept overwriting/deleting my config. Got frustrated and deleted it. Sorry but not sorry, I shouldn't need another cloud subscription to use your app.

Claude code you can sort of get to work with a bunch of hacks, but it involves setting up a proxy and also isn't supported natively and the tool calling is somewhat messed up.

Warp seemed promising, until I found out the founders would rather alienate their core demographic despite ~900 votes on the GH issue to allow local models https://github.com/warpdotdev/Warp/issues/4339. So I deleted their crappy app, even Cursor provides some basic support for an OpenAI endpoint.

spmurrayzzz · a month ago

> I spent at least an hour trying to get OpenCode to use a local model and then found a graveyard of PRs begging for Ollama support

Almost from day one of the project, I've been able to use local models. Llama.cpp worked out of the box with zero issues, same with vllm and sglang. The only tweak I had to make initially was manually changing the system prompt in my fork, but now you can do that via their custom modes features.

The ollama support issues are specific to that implementation.

spmurrayzzz commented on LLM Embeddings Explained: A Visual and Intuitive Guide huggingface.co/spaces/hes... · Posted by u/eric-burel

krackers · a month ago

The paper showing that dropping positional encoding entirely is feasible is https://arxiv.org/pdf/2305.19466 . But I was misremembering as to its long context performance, Llama 4 does use NoPE but it's still interleaved with RoPE layers. Just an armchair commenter though, so I may well be wrong.

My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.

spmurrayzzz · a month ago

Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal.

NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).

There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).

This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.

[1] https://arxiv.org/abs/2501.00073

spmurrayzzz commented on LLM Embeddings Explained: A Visual and Intuitive Guide huggingface.co/spaces/hes... · Posted by u/eric-burel

krackers · a month ago

I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain.

spmurrayzzz · a month ago

Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.

Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.

EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955

spmurrayzzz commented on LLM Embeddings Explained: A Visual and Intuitive Guide huggingface.co/spaces/hes... · Posted by u/eric-burel

mbowcut2 · a month ago

The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations.

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...

spmurrayzzz · a month ago

Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.

I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).

spmurrayzzz commented on Show HN: RULER – Easily apply RL to any agent openpipe.ai/blog/ruler... · Posted by u/kcorbitt

spmurrayzzz · 2 months ago

Might end up being some confusion with the RULER benchmark from NVIDIA given the (somewhat shared) domain: https://github.com/NVIDIA/RULER

EDIT: by shared I only mean the adjacency to LLMs/AI/ML, RL is a pretty big differentiator though and project looks great

spmurrayzzz commented on About AI Evals hamel.dev/blog/posts/eval... · Posted by u/TheIronYuppie

davedx · 2 months ago

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

spmurrayzzz · 2 months ago

Concrete example from my own workflows: in my IDE whenever I accept or reject a FIM completion, I capture that data (the prefix, the suffix, the completion, and the thumbs up/down signal) and put it in a database. The resultant dataset is annotated such that I can use it for analysis, debugging, finetuning, prompt mgmt, etc. The "custom" tooling part in this case would be that I'm using a forked version of Zed that I've customized in part for this purpose.

spmurrayzzz commented on Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite blog.google/products/gemi... · Posted by u/meetpateltech

quelladora · 2 months ago

MRCR does go significantly beyond multi-needle retrieval - that's why the performance drops off as a function of context length. It's still a very simple task (reproduce the i^th essay about rocks), but it's very much not solved.

See contextarena.ai and the original paper https://arxiv.org/abs/2409.12640

It also seems to match up well with evals like https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

The other evals you mention are not necessarily harder than this relatively simple one..

spmurrayzzz · 2 months ago

Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.

> The other evals you mention are not necessarily harder than this relatively simple one.

If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.

The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.

Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.

EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.

spmurrayzzz commented on Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite blog.google/products/gemi... · Posted by u/meetpateltech

NitpickLawyer · 2 months ago

> The performance is abysmal after ~32k.

Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.

spmurrayzzz · 2 months ago

We're a G-suite shop so I set aside a ton of time trying to get 2.5 pro to work for us. I'm not entirely unhappy with it, its a highly capable model, but the long context implosion significantly limits it for the majority of task domains.

We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.

But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.

In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.

spmurrayzzz commented on Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite blog.google/products/gemi... · Posted by u/meetpateltech

ttul · 2 months ago

I can throw a pile of NDAs at it and it neatly pulls out relevant stuff from them within a few seconds. The huge context window and excellent needle in a haystack performance is great for this kind of task.

spmurrayzzz · 2 months ago

The NIAH performance is a misleading indicator for performance on the tasks people really want the long context for. It's great as a smoke/regression test. If you're bad on NIAH, you're not gonna do well on the more holistic evals.

But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.

I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.

EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.