Readit News logoReadit News
throwup238 · 4 months ago
The leap frogging at this point is getting insane (in a good way, I guess?). The amount of time each state of the art feature gets before it's supplanted is a few weeks at this point.

LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.

Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.

user_7832 · 4 months ago
I agree with your overall message - rapid growth appears to encourage competition and forces companies to put their best foot forward.

However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.

Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.

I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.

garrickvanburen · 4 months ago
My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.
tiberriver256 · 4 months ago
3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.
csomar · 4 months ago
Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now

3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.

fastball · 4 months ago
Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).

If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.

airstrike · 4 months ago
I too like 3.5 better than 3.7 and I use it pretty often. It's like 3.7 is better in 2 metrics but worse in 10 different ones

Deleted Comment

joshstrange · 4 months ago
I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).

I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.

3.5->3.7 (even with extended thinking) felt like a nothingburger.

mattlutze · 4 months ago
The expectation that one model be top marks for all things is, imo, asking too much.
greymalik · 4 months ago
Out of curiosity - can you give any examples of the programming questions you are using deep research on? I’m having a hard time thinking of how it would be helpful and could use the inspiration.
dimitri-vs · 4 months ago
Easy, any research task that will take you 5 minutes to complete it's worth firing off a Deep Research request while you work on something else in parallel.

I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.

emorning3 · 4 months ago
I often use Chrome to valid what I think I know.

I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).

The response convinced me that I'm not a total idiot.

I'm not an academic and I'm often wrong about theory so the validation is really useful to me.

itissid · 4 months ago
I've been using it for pre scoping things I have no idea about and rapidly iterating by refeeding it a version with guard rails and conditions from previous chats.

Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).

It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.

iLoveOncall · 4 months ago
Calling some APIs is leap-frogging? You could do this with GPT-3, nothing has changed except it's branded under a new name and tries to establish a (flawed) standard.

If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.

never_inline · 4 months ago
Why would you love on-call though?
risyachka · 4 months ago
What are you talking about

It is literally stagnated for a year now

All that changed is they connect more apis.

And add a thinking loop with same model powering it

This is the reason it seems fast - nothing really happens except easy things

tymscar · 4 months ago
I totally agree with you, especially if you actually try using these models, not just looking at random hype posters on twitter or skewed benchmarks.

That being said, isn’t it strange how the community has polar opposite views about this? Did anything like this ever happen before?

apwell23 · 4 months ago
> DeepResearch which started to actually come up with useful results on more complex programming questions

Is there a youtube video of ppl using this on complex open source projects like linux kernel or maybe something like pytorch.

How come none of the oss pojects( atleast not the ones i follow) are progressing fast(er) from AI like 'deepresearch'

wilg · 4 months ago
o3 since it can web search while reasoning is a really useful lighter weight deep research
ilrwbwrkhv · 4 months ago
None of those reports are any good though. Maybe for shallow research, but I haven't found them deep. Can you share what kind of research you have been trying there where it has done a great job of actual deep research.
Balgair · 4 months ago
I'm echoing this sentiment.

Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?

Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.

To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.

But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.

Deleted Comment

xrdegen · 4 months ago
It is because you are just such a genius that already knows everything unlike us stupid people that find these tools amazingly useful and informative.
spaceman_2020 · 4 months ago
Gemini 2.5 pro was the moment for me where I really thought “this is where true adoption happens”

All those talks about AI replacing people seemed a little far fetched in 2024. But in 2025, I really think models are getting good enough

antupis · 4 months ago
You still need "human in the loop" because with simple tasks or some tasks that have lots of training material, models can one-shot answer and are like super good. But if the domain grows too complex, there are some not-so-obvious dependencies, or stuff that is in bleeding edge. Models fail pretty badly. So you need someone to split those complex tasks to more simpler familiar steps.
meander_water · 4 months ago
Looks like this is possible due to the relatively recent addition of OAuth2.1 to the MCP spec [0] to allow secure comms to remote servers.

However, there's a major concern that server hosters are on the hook to implement authorization. Ongoing discussion here [1].

[0] https://modelcontextprotocol.io/specification/2025-03-26

[1] https://github.com/modelcontextprotocol/modelcontextprotocol...

marifjeren · 4 months ago
That github issue is closed but:

> major concern that server hosters are on the hook to implement authorization

Doesn't it make perfect sense for server hosters to implement that? If Claude wants access to my Jira instance on my behalf, and Jira hosts a remote MCP server that aids in exposing the resources I own, isn't it obvious Jira should be responsible for authorization?

How else would they do it?

halter73 · 4 months ago
That github issue is closed because it's been mostly completed. As of https://github.com/modelcontextprotocol/modelcontextprotocol..., the latest draft specification does not require the resource server to act as or poxy to the IdP. It just hasn't made its way to a ratified spec yet, but SDKs are already implementing the draft.
cruffle_duffle · 4 months ago
The authorization server and resource server can be separate entities. Meaning that jira instance can validate the token but not be the one issuing it or handling credentials.
VSerge · 4 months ago
Ongoing demo of integrations with Claude by a bunch of A-list companies: Linear, Stripe, Paypal, Intercom, etc.. It's live now on: https://www.youtube.com/watch?v=njBGqr-BU54

In case the above link doesn't work later on, the page for this demo day is here: https://demo-day.mcp.cloudflare.com/

n_ary · 4 months ago
Is this the beginning of the apps for everything era and finally the SaaS for your LLM begins? Initially we had internet but value came when instead of installed apps, webapps arrived to become SaaS. Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila! Let the new golden age of SaaS for LLMs begin and the old fad(replace job XYZ with AI) die already.
insin · 4 months ago
It's perfect, nobody will have time to care about how many 9s your service has because the nondeterministic failure mode now sitting slap-bang in the middle is their problem!
Manfred · 4 months ago
Imagine dynamic subscription rates based on vibes where you won't even notice price hikes because not even the supplier can explain what they are.
clvx · 4 months ago
I'm more excited I can run now a custom site, hook an MCP for it, and have all the cool intelligence I had to pay for SaaS without having to integrate to them plus govern my data, it's a massive win. I just see AI assistant coding replicating current SaaS services that I can run internally. If my shop was a specific stack, I could aim to have all my supporting apps in that specific stack using AI assistant coding, simplifying operations, and being able to hook up MCP's to get intelligence from all of them.

Truly, OSS should be more interesting in the next decade for this alone.

heyheyhouhou · 4 months ago
We should all thank the chinese companies for releasing so many incredible open weight models. I hope they keep doing it, I dont want to rely on OpenAI, Anthropic or Google for all my future computer interactions.
naravara · 4 months ago
On one hand, yes this is very cool for a whole host of personal uses. On the other hand giving any company this level of access to as many different personal data sources as are out there scares the shit out of me.

I’d feel a lot better if we had something resembling a comprehensive data privacy law in the United States because I don’t want it to basically be the Wild West for anyone handling whatever personal info doesn’t get covered under HIPAA.

falcor84 · 4 months ago
Absolutely agreed, but just wanted to mention that it's essentially the same level of access you would give to Zapier, which is one of their top examples of MCP integrations.
n_ary · 4 months ago
It took many years for online tracking, iframes, sticky cookies and cambridge analytics before things like GDPR came into existence. We have to similarly wait a few years before similar major leaks happen through LLM pipelines/integrations. Sadly, that is the reality we live with.
OtherShrezzing · 4 months ago
I'd love a _tip jar_ MCP, where the LLM vendor can automatically tip my website for using its content/feature/service in a query's response. Even if the amount is absolutely minuscule, in aggregate, this might make up for ad revenue losses.
fredoliveira · 4 months ago
Not that exactly, but I just saw this on twitter a few minutes ago from Stripe: https://x.com/jeff_weinstein/status/1918029261430255626
donmcronald · 4 months ago
> Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila!

I've always worked under the assumption the best employees make themselves replaceable via well defined processes and high quality documentation. I have such a hard time understanding why there's so much willingness to integrate irreplaceable SaaS solutions into business processes.

I haven't used AI a ton, but everything I've done has focused on owning my own context, config, etc.. How much are people going to be willing to pay if someone else owns 10+ years of their AI context?

Am I crazy or is owning the context massively valuable?

brumar · 4 months ago
Hello fellow context owner. I like my modules with their context.sh at their root level. If crafted with care, magic happens. Reciprocally, when AI derails, it's most often due to bad context management and fixed by improving it.
throwaway7783 · 4 months ago
MCP is yet another interface for an existing SaaS (like UI and APIs), but now magically "agent enabled". And $$$ of course
sebstefan · 4 months ago
An AI that is capable of responding to a "How do I do X" prompt with "Hey this seems related to a ticket that was already opened on your Jira 2 months ago", or "There is a document about this in Sharepoint", it would bring me such immense value, I think I might cry.

Edit: Actually right in the tickets themselves would probably be better and not require MCP... but still

MagicMoonlight · 4 months ago
Copilot can already be setup to use sharepoint etc. And you can set it up to only respond based on internal content.

So if you ask it “who is in charge of marketing” it will read it off sharepoint instead of answering generically

conroy · 4 months ago
Remote MCP servers are still in a strange space. Anthropic updated the MCP spec about a month ago with a new Streamable HTTP transport, but it doesn't appear that Claude supports that transport yet.

When I hooked up our remote MCP server, Claude sends a GET request to the endpoint. According to the spec, clients that want to support both transports should first attempt to POST an InitializeRequest to the server URL. If that returns a 4xx, it should then assume the SSE integration.

kordlessagain · 4 months ago
Claude Desktop doesn't support resources without directly importing them and now they've taken the button away for that and tools so I have to build the status of them into a tool so I could see what was loading and what wasn't.

Here's my tool for Desktop: https://github.com/kordless/EvolveMCP

joshwarwick15 · 4 months ago
Created a list of remote MCP servers here so people can keep track of new releases - https://github.com/jaw9c/awesome-remote-mcp-servers
tkgally · 4 months ago
For the past couple of months, I’ve been running occasional side-by-side tests of the deep research products from OpenAI, Google, Perplexity, DeepSeek, and others. Ever since Google upgraded its deep research model to Gemini 2.5 Pro Experimental, it has been the best for the tasks I give them, followed closely by OpenAI. The others were far behind.

I ran two of the same prompts just now through Anthropic’s new Advanced Research. The results for it and for ChatGPT and Gemini appear below. Opinions might vary, but for my purposes Gemini is still the best. Claude’s responses were too short and simple and they didn’t follow the prompt as closely as I would have liked.

Writing conventions in Japanese and English

https://claude.ai/public/artifacts/c883a9a5-7069-419b-808d-0...

https://docs.google.com/document/d/1V8Ae7xCkPNykhbfZuJnPtCMH...

https://chatgpt.com/share/680da37d-17e4-8011-b331-6d4f3f5ca7...

Overview of an industry in Japan

https://claude.ai/public/artifacts/ba88d1cb-57a0-4444-8668-e...

https://docs.google.com/document/d/1j1O-8bFP_M-vqJpCzDeBLJa3...

https://chatgpt.com/share/680da9b4-8b38-8011-8fb4-3d0a4ddcf7...

The second task, by the way, is just a hypothetical case. Though I have worked as a translator in Japan for many years, I am not the person described in the prompt.