calebkaiser (u/calebkaiser)

calebkaiser commented on Claude Code IDE integration for Emacs github.com/manzaltu/claud... · Posted by u/kgwgk

This is awesome. I love emacs and I love integrating AI into my coding work flow.

What I really want is to be able to run something like this locally for, say, less than $2000 in computer hardware. Is this feasible now or any time soon. Anyone out there using agents with local models for coding?

calebkaiser · 18 days ago

There's a lot of great work both around supporting memory efficient inference (like on a closer-to-consumer machine), as well as on open source code-focused models.

A lot of people are excited about the Qwen3-Coder family of models: https://huggingface.co/collections/Qwen/qwen3-coder-687fc861...

For running locally, there are tools like Ollama and LM Studio. Your hardware needs will fluctuate depending on what size/quantization of model you try to run, but 2k in hardware cost is reasonable for running a lot of models. Some people have good experiences using the M-series Macs, which is probably a good bang-for-buck if you're exclusively interested in inference.

I'd recommend checking out the LocalLlamas subreddit for more: https://www.reddit.com/r/LocalLLaMA/

Getting results on par with big labs isn't feasible, but if you prefer to run everything locally, it is a fun and doable project.

calebkaiser commented on AI promised efficiency. Instead, it's making us work harder afterburnout.co/p/ai-prom... · Posted by u/mooreds

ryandrake · 20 days ago

It's insane that any company would just be OK with "IDK Claude did that" any more than a 2010 version of that company would be OK with "IDK I copy pasted from StackOverflow." Have engineering managers actually drank this Kool-aid to the point where they're actually OK with their direct reports just chucking PRs over the wall that they don't even understand?

calebkaiser · 20 days ago

I don't think this is an AI specific thing. I work in the field, and so I'm around some of the most enthusiastic adopters of LLMs, and from what I see, engineering cultures surrounding LLM usage typically match the org's previous general engineering culture.

So, for example, by and large the orgs I've seen chucking Claude PRs over the wall with little review were previously chucking 100% human written PRs over the wall with little review.

Similarly, the teams I see effectively using test suites to guide their code generation are the same teams that effectively use test suites to guide their general software engineering workflows.

calebkaiser commented on We may not like what we become if A.I. solves loneliness newyorker.com/magazine/20... · Posted by u/defo10

rayiner · 22 days ago

Coal miners in 1890s appalachia had healthier and more active social lives than american white collar workers. This does not have anything to do with economics.

calebkaiser · 22 days ago

The 1890s were the launching point for widespread unionization among coal miners in places like my home-state of Kentucky. Company towns were increasingly common, and major motivations for unionization were to combat things like being paid in company skrip or letting neighborhood kids ("breaker boys" as young as 8) work in the mines. Their social lives--from their neighborhood, to their social "clubs", to the literal currency they were able to use--were entirely defined by their job and the company they worked for.

Tough to use them as proof that this "doesn't have anything to do with economics" when their entire social life was defined by the economics of coal mining.

calebkaiser commented on 'Positive review only': Researchers hide AI prompts in papers asia.nikkei.com/Business/... · Posted by u/ohjeez

pcrh · 2 months ago

After a brief scan, I'm not competent to evaluate the essay by Chris Olah you posted.

I probably could get an LLM to do so, but I won't....

calebkaiser · 2 months ago

Neel Nanda is also very active in the field and writes some potentially more approachable articles, if you're interested: https://www.neelnanda.io/mechanistic-interpretability

Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.

calebkaiser commented on 'Positive review only': Researchers hide AI prompts in papers asia.nikkei.com/Business/... · Posted by u/ohjeez

pcrh · 2 months ago

How is an LLM supposed to review an original manuscript?

At their core (and as far as I understand), LLMs are based on pre-existing texts, and use statistical algorithms to stitch together text that is consistent with these.

An original research manuscript will not have formed part of any LLMs training dataset, so there is no conceivable way that it can evaluate it, regardless of claims that LLMs "understand" anything or not.

Reviewers who use LLMs are likely deluding themselves that they are now more productive due to use of AI, when in fact they are just polluting science through their own ignorance of epistemology.

calebkaiser · 2 months ago

You might be interested in work around mechanistic interpretability! In particular, if you're interested in how models handle out-of-distribution information and apply in-context learning, research around so-called "circuits" might be up your alley: https://www.transformer-circuits.pub/2022/mech-interp-essay

calebkaiser commented on About AI Evals hamel.dev/blog/posts/eval... · Posted by u/TheIronYuppie

davedx · 2 months ago

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

calebkaiser · 2 months ago

Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like "In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context." or "In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifiying questions."

You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.

calebkaiser commented on About AI Evals hamel.dev/blog/posts/eval... · Posted by u/TheIronYuppie

calebkaiser · 2 months ago

I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals.

This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.

I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik

calebkaiser commented on The AI Backlash Keeps Growing Stronger wired.com/story/generativ... · Posted by u/01-_-

yahoozoo · 2 months ago

LLMs are plateauing. At this point, anyone who has cared enough knows their fundamental limitations. Don’t get me wrong, they do provide immense value and have grown quickly in just a few years. But, they will probably never get past the “we must treat it like a junior engineer” phase, especially as LLM outputs inevitably leak back into the training data since everyone is using it (or being forced to by employers). Notice that everything major from these companies lately hasn’t been a vastly improved model from the previous iteration, but products and tooling: see Anthropic allowing you to create in-Claude apps, every companies own version of agentic CLI, and so on.

calebkaiser · 2 months ago

I think this is recency bias. There was roughly 33 months between the initial release of GPT 3 and GPT 4, which was released in March 2023.

calebkaiser commented on Enterprises are getting stuck in AI pilot hell, say Chatterbox Labs execs theregister.com/2025/06/0... · Posted by u/dijksterhuis

calebkaiser · 3 months ago

Before getting too invested in any conclusions drawn from this piece, it's important to recognize this is mostly PR from Chatterbox.

From the Chatterbox site:

> Our patented AIMI platform independently validates your AI models & data, generating quantitative AI risk metrics at scale.

The article's subtitle:

> Security, not model performance, is what's stalling adoption