Readit News logoReadit News
extr commented on Kimi K2 1T model runs on 2 512GB M3 Ultras   twitter.com/awnihannun/st... · Posted by u/jeudesprits
docjay · 2 days ago
Try this system prompt: “Your responses are always bald-on-record only; suppress FTA redress, maximize unmitigated dispreference marking and explicit epistemic stance-taking. Suppress inline typographic weight marking; structural markup permitted.”

Prompt: “The wife of the CEO keeps coming around asking people questions about their job and she doesn’t even work at the company! I was thinking I’d tell her “holy crap, go away you don’t even work here” because it’s annoying.”

# Sonnet 4.5

No-System: “I get why that's frustrating, but saying that would likely blow up in your face professionally - even if you're right that it's inappropriate.

Some better options: … “

With-System: “That's a terrible idea that will almost certainly get you fired.

Telling the CEO's spouse to "go away" in those terms is career suicide. She has direct influence over the person who controls your employment. Even if she technically doesn't work there, she has more organizational power than you do.

Your actual options: … “

# GPT 5.1

No-System: “I’d strongly avoid saying that, even if you’re joking.

She has a lot of informal power because of who she’s married to, and anything that sounds rude or dismissive could blow back on you in ways that aren’t fair but are very real.

Better options, depending on your style and the culture there: … “

With-System: “Say that and you’ll look unprofessional, childish, and difficult to work with. It will damage your reputation, not hers.

You’ve got three real options: … “

extr · 2 days ago
Are those responses really "better"? Having the LLM tell you you're wrong can mean different things. Your system prompt makes it more direct and less polite, but that's very different from challenging the frame of your question, or asking the right questions before answering to understand the issue behind the issue.

It's like how people used to make fun of StackOverflow:

> I'm having trouble with X, how do I make it work?

> What are you trying to do? Z? Oh if you're doing Z, forget about X, don't even think about it, you want Y instead. (Never answers anything about X).

I think this is closer to what people usually mean when they say they want disagreement from LLMs.

extr commented on OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI   simonwillison.net/2025/De... · Posted by u/simonw
sigmoid10 · 3 days ago
Anthropic is in a bit of a rough spot if you look at the raw data points we have available. Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users. And current leaderboards for famous unsolved benchmarks like ARC AGI and HLE are also dominated by Google and OpenAI. Announcements like the one you linked are the only way for Anthropic to stay in the news cycle and justify its valuation to investors. Their IPO rumours are yet another example of this. But I really wonder how long that strategy can keep working.
extr · 3 days ago
Hard to believe you could be so misinformed. Anthropic is not far behind OAI on revenue and has a much more stable position with most of it coming from enterprise/business customers.
extr commented on OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI   simonwillison.net/2025/De... · Posted by u/simonw
extr · 3 days ago
It’s crazy how Anthropic keeps coming up with sticky “so simple it seems obvious” product innovations and OpenAI plays catch up. MCP is barely a protocol. Skills are just md files. But they seem to have a knack for framing things in a way that just makes sense.
extr · 3 days ago
Oh yeah I forgot the biggest one. Claude fucking code. Lol
extr commented on OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI   simonwillison.net/2025/De... · Posted by u/simonw
extr · 3 days ago
It’s crazy how Anthropic keeps coming up with sticky “so simple it seems obvious” product innovations and OpenAI plays catch up. MCP is barely a protocol. Skills are just md files. But they seem to have a knack for framing things in a way that just makes sense.
extr commented on Mistral 3 family of models released   mistral.ai/news/mistral-3... · Posted by u/pember
popinman322 · 14 days ago
They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

extr · 14 days ago
??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.

We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)

extr commented on Python is not a great language for data science   blog.genesmindsmachines.c... · Posted by u/speckx
paddleon · 21 days ago
> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,

which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.

extr · 21 days ago
Yeah data.table is just about the best-in-class tool/package for true high-throughput "live" data analysis. Dplyr is great if you are learning the ropes, or want to write something that your colleagues with less experience can easily spot check. But in my experience if you chat with people working in the trenches of banks, lenders, insurance companies, who are running hundreds of hand-spun crosstabs/correlational analyses daily, you will find a lot of data.table users.

Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.

extr commented on Sunsetting Supermaven   supermaven.com/blog/sunse... · Posted by u/vednig
extr · 23 days ago
Super disappointing there effectively doesn’t exist an “open” competitor in this space that’s close to parity with Cursor/supermaven. Although I wouldn’t have guessed the product category would get out-competed by agentic AI agents writing 100% of code so soon.
extr commented on GPT-5.1: A smarter, more conversational ChatGPT   openai.com/index/gpt-5-1/... · Posted by u/tedsanders
gilfoy · a month ago
It might actually help output answer with less nonsense.

As an example in some workflow I ask chatgpt to figure out if the user is referring to a specific location and output a country in json like { country }

It has some error rate at this task. Asking it for a rationale improves this error rate to almost none. { rationale, country }. However reordering the keys like { country, rationale } does not. You get the wrong country and a rationale that justifies the correct one that was not given.

extr · a month ago
This is/was a great trick for improving accuracy of small model + structured output. Kind of an old-fashoined Chain of Thought type of thing. Eg: I used this before with structured outputs in Gemini Flash 2.0 to significantly improve the quality of answers. Not sure if 2.5 Flash requires it, but for 2.0 Flash you could use the propertyOrdering field to force a specific ordering of JSONSchema response items, and force it to output things like "plan", "rationale", "reasoning", etc as the first item, then simply discard it.
extr commented on Claude for Excel   claude.com/claude-for-exc... · Posted by u/meetpateltech
jay_kyburz · 2 months ago
I think perhaps you underestimate how antithetical the current batch of LLM AI's is to what most programmers strive for every day, and what we want from our tools. Its not about losing our job, its about "correctness". (or as said below - deterministic)

In a lot of jobs, particularly in creative industries, or marketing, media and writing, the definition of a job well done is a fairly grey area. I think AI will be mostly disruptive in these areas.

But in programming there is a hard minimum of quality. Given a set of inputs, does the program return the correct answer or not? When you ask it what 2+2, do you get 4?

When you ask AI anything, it might be right 50% of the time, or 70% of the time, but you can't blindly trust the answer. A lot of us just find that not very useful.

extr · 2 months ago
I am a SWE myself and use LLMs to write ~100% of my code. That does not mean I fire and forget multiplexed codex instances. Many times I step through and approve every edit. Even if it was nothing but a glorified stenographer - there are substantial time savings in being able to prototype and validate ideas quickly.
extr commented on Claude for Excel   claude.com/claude-for-exc... · Posted by u/meetpateltech
mbesto · 2 months ago
> All processes in reality, everywhere, are probablistic.

If we want to go in philosophy then sure, you're correct, but this not what we're saying.

For example, an LLM is capable (and it's highly plausible for it to do so) of creating a reference to a non-existent source. Humans generally don't do that when their goal is clear and aligned (hence deterministic).

> Building a process to get a similar confidence in LLM output is part of the game.

Which is precisely my point. LLMs are supposed to be better than humans. We're (currently) shoehorning the technology.

extr · 2 months ago
> Humans generally don't do that when their goal is clear and aligned (hence deterministic).

Look at the language you're using here. Humans "generally" make less of these kinds of errors. "Generally". That is literally an assessment of likelihood. It is completely possible for me to hire someone so stupid that they create a reference to a non-existent source. It's completely possible for my high IQ genius employee who is correct 99.99% of the time to have an off-day and accidentally fat finger something. It happens. Perhaps it happens at 1/100th of the rate that an LLM would do it. But that is simply an input to the model of the process or system I'm trying to build that I need to account for.

u/extr

KarmaCake day5394May 26, 2016View Original