Readit News logoReadit News
slewis commented on Deep learning gets the glory, deep fact checking gets ignored   rachel.fast.ai/posts/2025... · Posted by u/chmaynard
amelius · 3 months ago
Before making AI do research, perhaps we should first let it __reproduce__ research. For example, give it a paper of some deep learning technique and make it produce an implementation of that paper. Before it can do that, I have no hope that it can produce novel ideas.
slewis · 3 months ago
OpenAI created a benchmark for this: https://openai.com/index/paperbench/
slewis commented on Run DeepSeek R1 Dynamic 1.58-bit   unsloth.ai/blog/deepseekr... · Posted by u/noch
slewis · 7 months ago
It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.
slewis commented on AI Programmer, from Weights and Biases   medium.com/@shawnup/the-b... · Posted by u/gwintrob
slewis · 7 months ago
Hey, that's me!

Happy to answer any questions about how this works if folks are interested.

slewis commented on OpenAI O3 breakthrough high score on ARC-AGI-PUB   arcprize.org/blog/oai-o3-... · Posted by u/maurycy
timabdulla · 8 months ago
What's your explanation for why it can only get ~70% on SWE-bench Verified?

I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)

slewis · 8 months ago
I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.

slewis commented on Show HN: Finic – Open source platform for building browser automations   github.com/finic-ai/finic... · Posted by u/jasonwcfan
slewis · a year ago
Is it stateful? Like can I do a run, read the results, and then do another run from that point?
slewis commented on Founder Mode   paulgraham.com/foundermod... · Posted by u/bifftastic
slewis · a year ago
This 100% matches my experience.

I like to jokingly call founder mode: "fine-grained multi-level oversight". Others might call it the derogatory "micromanagement".

That doesn't mean I control every decision, or that I don't give people space to be creative. What it means is: for whatever is most important for the business, I get involved with the details. The goal is that when I move out of that area, the team I worked with is able to operate closer to founder mode than when I started.

The issue is that vision fundamentally can't be communicated by telephone, or all at once. You're trying to get to a point on the map that most people can't see. The path to it is the integration of all of the tiny decisions everyone makes along the way.

If you only course correct from the highest level you'll never get there.

slewis commented on Sam Altman returns as CEO, OpenAI has a new initial board   openai.com/blog/sam-altma... · Posted by u/davidbarker
enraged_camel · 2 years ago
>> They overplayed their position. That's all there is to it.

They tried to enforce the non-profit's charter, as is their duty. I would hardly frame that as overplaying their hand.

slewis · 2 years ago
Overplay one’s hand: spoil one's chance of success through excessive confidence in one's position
slewis commented on ChatGPT with voice is now available to all free users   twitter.com/OpenAI/status... · Posted by u/Jimmc414
jmccarthy · 2 years ago
While walking the dog today, it talked me through some trade-offs between DBSCAN and isolation forests. Walking + verbalizing the problem is a very different and positive experience for me.

I've also used it several times on ~15-20min drives to memorize something I wanted to have available for immediate recall. I had it chunk & quiz me, and by the end of the drive I had it down pat. Fun use of drive time.

slewis · 2 years ago
The memorization use case is brilliant. Put your talk track for a presentation in and say “help me memorize this by quizzing me”. Thanks!
slewis commented on Do something, so we can change it   allenpike.com/2023/do-som... · Posted by u/ingve
slewis · 2 years ago
I call this "keep the fingers moving".

u/slewis

KarmaCake day1397November 12, 2010
About
shlewis at gmail

@shawnup on X

Founder/CTO @ Weights & Biases

View Original