And interestingly, that is indeed the feature I find most compelling from Cursor. I particularly love when I’m doing a small refactor, like changing a naming convention for a few variables, and after I make the first edit manually Cursor will jump in with tab suggestions for the rest.
To me, that fully encapsulates the definition of a HUD. It’s a delightful experience, and it’s also why I think anyone who pushes the exclusively-copilot oriented Claude Code as a superior replacement is just wrong.
I've spent the last few months using Claude Code and Cursor - experimenting with both. For simple tasks, both are pretty good (like identifying a bug given console output) - but when it comes to making a big change, like adding a brand new feature to existing code that requires changes to lots of files, writing tests, etc - it often will make at least a few mistakes I catch on review, and then prompting the model to fix those mistakes often causes it to fix things in strange ways.
A few days ago, I had a bug I just couldn't figure out. I prompted Claude to diagnose and fix the issue - but after 5 minutes or so of it trying out different ideas, rerunning the test, and getting stuck just like I did - it just turned off the test and called it complete. If I wasn't watching what it was doing, I could have missed that it did that and deployed bad code.
The last week or so, I've totally switched from relying on prompting to just writing the code myself and using tab complete to autocomplete like 80% of it. It is slower, but I have more control and honestly, it's much more enjoyable of an experience.
The naive solution I could come up with would be really expensive with openai, but if you have an open source model, you can write up custom inference that goes one-token-at-a-time through the text, and on each token you look up the difference in logprobs between the token that the LLM predicted vs what was actually there, and use that to color the token.
The downside I imagine to this approach is it would probably tend to highlight the beginning of bad code, and not the entire block - because once you commit to a mistake, the model will generally roll with it - ie, a 'hallucination' - so logprobs of tokens after the bug happened might only be slightly higher than normal.
Another option might be to use a diffusion based model, adding some noise to the input and having it iterate a few times through, then measuring the parts of the text that changed the most. I have only a light theory understanding of these models though, so I'm not sure how well that would work