whymauri (u/whymauri)

whymauri commented on Claim: GPT-5-pro can prove new interesting mathematics twitter.com/SebastienBube... · Posted by u/marcuschong

whymauri · 4 hours ago

I used to work at a drug discovery startup. A simple model generating directly from latent space 'discovered' some novel interactions that none of our medicinal chemists noticed e.g. it started biasing for a distribution of molecules that was totally unexpected for us.

Our chemists were split: some argued it was an artifact, others dug deep and provided some reasoning as to why the generations were sound. Keep in mind, that was a non-reasoning, very early stage model with simple feedback mechanisms for structure and molecular properties.

In the wet lab, the model turned out to be right. That was five years ago. My point is, the same moment that arrived for our chemists will be arriving soon for theoreticians.

whymauri commented on 95% of generative AI pilots at companies are failing – MIT report fortune.com/2025/08/18/mi... · Posted by u/amirkabbara

grahar64 · 6 days ago

5% success is actually way higher than I thought it would be. At that rate I suppose there will be actually profitable AI companies with VC subsidies

whymauri · 6 days ago

5% success rate might mean: if you get it to work, you are capturing value that the other 95% are not.

A lot of this must come down to execution. And there's a lot of snake oil out there at the execution layer.

whymauri commented on Gemma 3 270M: Compact model for hyper-efficient AI developers.googleblog.com... · Posted by u/meetpateltech

canyon289 · 10 days ago

Speaking for me as an individual as an individual I also strive to build things that are safe AND useful. Its quite challenging to get this mix right, especially at the 270m size and with varying user need.

My advice here is make the model your own. Its open weight, I encourage it to be make it useful for your use case and your users, and beneficial for society as well. We did our best to give you a great starting point, and for Norwegian in particular we intentionally kept the large embedding table to make adaption to larger vocabularies easier.

whymauri · 10 days ago

To be fair, Trust and Safety workloads are edgecases w.r.t. the riskiness profile of the content. So in that sense, I get it.

whymauri commented on Gemma 3 270M: Compact model for hyper-efficient AI developers.googleblog.com... · Posted by u/meetpateltech

NorwegianDude · 10 days ago

The Gemma 3 models are great! One of the few models that can write Norwegian decently, and the instruction following is in my opinion good for most cases. I do however have some issues that might be related to censorship that I hope will be fixed if there is ever a Gemma 4. Maybe you have some insight into why this is happening?

I run a game when players can post messages, it's a game where players can kill each other, and people often send threats along the lines of "I will kill you". Telling Gemma that it should classify a message as game related or a real life threat, and that it is for a message in a game where players can kill each other and threats are a part of the game, and that it should mark it as game related if it is unclear if the message is a game related threat or a real life threat does not work well. For other similar tasks it seems to follow instructions well, but for serious topics it seems to be very biased, and often err on the side of caution, despite being told not to. Sometimes it even spits out some help lines to contact.

I guess this is because it was trained to be safe, and that affects it's ability to follow instructions for this? Or am I completely off here?

whymauri · 10 days ago

LLMs are really annoying to use for moderation and Trust and Safety. You either depend on super rate-limited 'no-moderation' endpoints (often running older, slower models at a higher price) or have to tune bespoke un-aligned models.

For your use case, you should probably fine tune the model to reduce the rejection rate.

whymauri commented on Qodo CLI agent scores 71.2% on SWE-bench Verified qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle

itamarcode · 12 days ago

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team

whymauri · 12 days ago

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

whymauri commented on Qodo CLI agent scores 71.2% on SWE-bench Verified qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle

terminalshort · 12 days ago

From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?

whymauri · 12 days ago

Papers have been doing rollouts that involve a model proposing N solutions and then self-reviewing to choose the best one (prior to the verifier). So far, I think that's been counted as one pass.

whymauri commented on Qodo CLI agent scores 71.2% on SWE-bench Verified qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle

energy123 · 12 days ago

What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?

whymauri · 12 days ago

This is what the rows look like:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...

Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:

https://x.com/brhydon/status/1953648884309536958

whymauri commented on Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use · Posted by u/wujerry2000

wujerry2000 · 13 days ago

Theses are really good questions!

we share the public/consumer simulators, but we also build bespoke environments on a per customer basis (think enterprise sites or even full VMs loaded with applications and data).

environment creation scalability is a big priority for us. we currently automate most of the process, but it still takes a fair bit of manual work to finish them and to get the details right. there is some reusability across environments, for example, we can use the flight results generation code in any travel/flightbooking sim. we also have some semi-automated approaches for creating tasks and verifiers. but still lots of work to be done here.

whymauri · 13 days ago

Super interesting, thank you.

whymauri commented on Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use · Posted by u/wujerry2000

whymauri · 13 days ago

Are these simulations shared between your customers, or are you building bespoke environments per client/user? How does the creation of environments scale?

whymauri commented on Modification of acetaminophen to reduce liver toxicity and enhance drug efficacy societyforscience.org/reg... · Posted by u/felineflock

porphyra · 3 months ago

It's always weird to me that acetaminophen has such a low therapeutic index, like in order to get enough for it to do anything, you're also on the verge of liver failure (especially if you also drink alcohol). Also it just doesn't work super well in my personal experience --- I hardly feel anything when I take it. And yet it's one of the most commonly taken medicines worldwide.

whymauri · 3 months ago

acetaminophen should not be an OTC drug