Readit News logoReadit News
razemio · 6 months ago
I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on top. I have tried so many times to use other models for various tasks, not only coding. The only thing where OpenAI accelerates is analytic tasks at a much higher price. Everything else sonnet works for me the best and Gemini Flash 2.0 for cost effective and latency relevant tasks.

In practice this perception of mine seems to be valid: https://openrouter.ai/models?order=top-weekly

The same with this model. It claims to be good at coding but it seriously isn't compared to sonnet. Funny enough it isn't being tested against.

esperent · 6 months ago
> In practice this perception of mine seems to be valid: >https://openrouter.ai/models

According to your link, there's two providers (Chutes and Targon) offeri Deepseek R1 API access completely free of charge.

Is that true? Can't see any info about usage limits.

Is that

razemio · 6 months ago
That is true and it works. Response times however are worse compared to the paid ones. If this is not a concern it works really well.
integralof6y · 6 months ago
I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak point for LLM specially on a chat. In one of the programs it used a hardcoded number 10 to branch, this suggests that the program generated was fitted to give a good result for the test (I did give it the correct result before). This suggests that you should be careful when testing the generated programs, they could be fitted to pass your simple tests. Edited: Also I tried to compute the integral of 6y on the triangle with vertices A(8, 8), B(15, 29), C(10, 12) and it yield a wrong result of 2341, then I suggested computing that using the formula for the barycenter of the triangle, that is, 6Area*(Mean of y-coordinates) and it returned the correct value of 686.

To summarize: It seems that LLM are not able to give correct result for simple math problems (here a double integral on a triangle). So students should not rely on them since nowaday they are not able to perform simple task without many errors.

vmg12 · 6 months ago
Here is an even easier one, ask llms to take the integral from 0 to 3 of 1/(x-1)^3. It fails to notice it's an improper integral and just gives an answer.
floam · 6 months ago
ChatGPT definitely noticed: o1, o3-mini, o3-mini-high.

Maybe 4o will get it wrong? I wouldn’t try it for math.

HeatrayEnjoyer · 6 months ago
>compute the integral of 6*y on the triangle with vertices A(8, 8), B(15, 29), C(10, 12)

o3-mini returned 686 on the first try, without executing any code.

bionhoward · 6 months ago
Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad
stuartjohnson12 · 6 months ago
To be fair to the robots, those humans also had the audacity to learn from the creative output of their fellow humans and then use the law to restrict access to their intellectual property.
Szpadel · 6 months ago
what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/)
jstummbillig · 6 months ago
"Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."

Visible above the fold. Thanks for getting to the point.

Alifatisk · 6 months ago
Cohere API Pricing for Command A

- Input Tokens: $2.50 / 1M

- Output Tokens: $10.00 / 1M

WOW, what makes them this expensive? Are we going against the trend here and raising the prices instead?

jasonjmcghee · 6 months ago
It's got Claude Sonnet pricing but they don't compare to it in benchmarks.
UncleEntity · 6 months ago
To be fair, or not, Claude isn't all that great.

I was working on a project to get the historic data out of a bluetooth thermometer I bought a while back to learn about Bluetooth LE and it would quite often rewrite the entire thing using a completely different bluetooth library instead of simply addressing the error.

And this is after I gave up having it create a kernel module for the same thermometer (just because, not that anyone needs such a thing) where it would continually try to write a helper program that wrote to the /proc filesystem and I would ask "why would I want to do this when I could just use the example program I gave you?" Claude, of course, was highly apologetic every single time it made the exact same mistake so there's that.

I understand these are the early days of the robotic overthrow of humanity but, please, at least sell me a working product.

codedokode · 6 months ago
It is interesting that there is a graph showing performance on benchmarks like MMLU, and different models have similar performance. I wonder, are the tasks they cannot solve, the same for every model? And how the "unsolvable" tasks are different from solvable?

Also, I cannot check it with latest models, but I am curious, have they learned to answer simple questions like "What is 10000099983 + 1000017"?

floam · 6 months ago
There are questions on MMLU that you must get wrong if you are right:

> The most widespread and important retrovirus is HIV-1; which of the following is true? (A) Infecting only gay people (B) Infecting only males (C) Infecting every country in the world (D) Infecting only females

the corpus indicates A is the correct answer but it was obviously meant to be C.