The leap frogging at this point is getting insane (in a good way, I guess?). The amount of time each state of the art feature gets before it's supplanted is a few weeks at this point.
LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.
Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.
I agree with your overall message - rapid growth appears to encourage competition and forces companies to put their best foot forward.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.
3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.
Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now
3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.
Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).
If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.
I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).
I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.
3.5->3.7 (even with extended thinking) felt like a nothingburger.
Out of curiosity - can you give any examples of the programming questions you are using deep research on? I’m having a hard time thinking of how it would be helpful and could use the inspiration.
Easy, any research task that will take you 5 minutes to complete it's worth firing off a Deep Research request while you work on something else in parallel.
I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.
I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).
The response convinced me that I'm not a total idiot.
I'm not an academic and I'm often wrong about theory so the validation is really useful to me.
I've been using it for pre scoping things I have no idea about and rapidly iterating by refeeding it a version with guard rails and conditions from previous chats.
Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).
It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.
Calling some APIs is leap-frogging? You could do this with GPT-3, nothing has changed except it's branded under a new name and tries to establish a (flawed) standard.
If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.
None of those reports are any good though. Maybe for shallow research, but I haven't found them deep. Can you share what kind of research you have been trying there where it has done a great job of actual deep research.
Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?
Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.
To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.
But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.
You still need "human in the loop" because with simple tasks or some tasks that have lots of training material, models can one-shot answer and are like super good. But if the domain grows too complex, there are some not-so-obvious dependencies, or stuff that is in bleeding edge. Models fail pretty badly. So you need someone to split those complex tasks to more simpler familiar steps.
> major concern that server hosters are on the hook to implement authorization
Doesn't it make perfect sense for server hosters to implement that? If Claude wants access to my Jira instance on my behalf, and Jira hosts a remote MCP server that aids in exposing the resources I own, isn't it obvious Jira should be responsible for authorization?
That github issue is closed because it's been mostly completed. As of https://github.com/modelcontextprotocol/modelcontextprotocol..., the latest draft specification does not require the resource server to act as or poxy to the IdP. It just hasn't made its way to a ratified spec yet, but SDKs are already implementing the draft.
The authorization server and resource server can be separate entities. Meaning that jira instance can validate the token but not be the one issuing it or handling credentials.
Ongoing demo of integrations with Claude by a bunch of A-list companies: Linear, Stripe, Paypal, Intercom, etc.. It's live now on: https://www.youtube.com/watch?v=njBGqr-BU54
Is this the beginning of the apps for everything era and finally the SaaS for your LLM begins? Initially we had internet but value came when instead of installed apps, webapps arrived to become SaaS. Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila! Let the new golden age of SaaS for LLMs begin and the old fad(replace job XYZ with AI) die already.
It's perfect, nobody will have time to care about how many 9s your service has because the nondeterministic failure mode now sitting slap-bang in the middle is their problem!
I'm more excited I can run now a custom site, hook an MCP for it, and have all the cool intelligence I had to pay for SaaS without having to integrate to them plus govern my data, it's a massive win.
I just see AI assistant coding replicating current SaaS services that I can run internally. If my shop was a specific stack, I could aim to have all my supporting apps in that specific stack using AI assistant coding, simplifying operations, and being able to hook up MCP's to get intelligence from all of them.
Truly, OSS should be more interesting in the next decade for this alone.
We should all thank the chinese companies for releasing so many incredible open weight models. I hope they keep doing it, I dont want to rely on OpenAI, Anthropic or Google for all my future computer interactions.
On one hand, yes this is very cool for a whole host of personal uses. On the other hand giving any company this level of access to as many different personal data sources as are out there scares the shit out of me.
I’d feel a lot better if we had something resembling a comprehensive data privacy law in the United States because I don’t want it to basically be the Wild West for anyone handling whatever personal info doesn’t get covered under HIPAA.
Absolutely agreed, but just wanted to mention that it's essentially the same level of access you would give to Zapier, which is one of their top examples of MCP integrations.
It took many years for online tracking, iframes, sticky cookies and cambridge analytics before things like GDPR came into existence. We have to similarly wait a few years before similar major leaks happen through LLM pipelines/integrations. Sadly, that is the reality we live with.
I'd love a _tip jar_ MCP, where the LLM vendor can automatically tip my website for using its content/feature/service in a query's response. Even if the amount is absolutely minuscule, in aggregate, this might make up for ad revenue losses.
> Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila!
I've always worked under the assumption the best employees make themselves replaceable via well defined processes and high quality documentation. I have such a hard time understanding why there's so much willingness to integrate irreplaceable SaaS solutions into business processes.
I haven't used AI a ton, but everything I've done has focused on owning my own context, config, etc.. How much are people going to be willing to pay if someone else owns 10+ years of their AI context?
Am I crazy or is owning the context massively valuable?
Hello fellow context owner. I like my modules with their context.sh at their root level. If crafted with care, magic happens. Reciprocally, when AI derails, it's most often due to bad context management and fixed by improving it.
An AI that is capable of responding to a "How do I do X" prompt with "Hey this seems related to a ticket that was already opened on your Jira 2 months ago", or "There is a document about this in Sharepoint", it would bring me such immense value, I think I might cry.
Edit: Actually right in the tickets themselves would probably be better and not require MCP... but still
Remote MCP servers are still in a strange space. Anthropic updated the MCP spec about a month ago with a new Streamable HTTP transport, but it doesn't appear that Claude supports that transport yet.
When I hooked up our remote MCP server, Claude sends a GET request to the endpoint. According to the spec, clients that want to support both transports should first attempt to POST an InitializeRequest to the server URL. If that returns a 4xx, it should then assume the SSE integration.
Claude Desktop doesn't support resources without directly importing them and now they've taken the button away for that and tools so I have to build the status of them into a tool so I could see what was loading and what wasn't.
For the past couple of months, I’ve been running occasional side-by-side tests of the deep research products from OpenAI, Google, Perplexity, DeepSeek, and others. Ever since Google upgraded its deep research model to Gemini 2.5 Pro Experimental, it has been the best for the tasks I give them, followed closely by OpenAI. The others were far behind.
I ran two of the same prompts just now through Anthropic’s new Advanced Research. The results for it and for ChatGPT and Gemini appear below. Opinions might vary, but for my purposes Gemini is still the best. Claude’s responses were too short and simple and they didn’t follow the prompt as closely as I would have liked.
The second task, by the way, is just a hypothetical case. Though I have worked as a translator in Japan for many years, I am not the person described in the prompt.
LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.
Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.
3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.
If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.
Deleted Comment
I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.
3.5->3.7 (even with extended thinking) felt like a nothingburger.
I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.
I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).
The response convinced me that I'm not a total idiot.
I'm not an academic and I'm often wrong about theory so the validation is really useful to me.
Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).
It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.
If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.
It is literally stagnated for a year now
All that changed is they connect more apis.
And add a thinking loop with same model powering it
This is the reason it seems fast - nothing really happens except easy things
That being said, isn’t it strange how the community has polar opposite views about this? Did anything like this ever happen before?
Is there a youtube video of ppl using this on complex open source projects like linux kernel or maybe something like pytorch.
How come none of the oss pojects( atleast not the ones i follow) are progressing fast(er) from AI like 'deepresearch'
Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?
Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.
To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.
But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.
Deleted Comment
All those talks about AI replacing people seemed a little far fetched in 2024. But in 2025, I really think models are getting good enough
However, there's a major concern that server hosters are on the hook to implement authorization. Ongoing discussion here [1].
[0] https://modelcontextprotocol.io/specification/2025-03-26
[1] https://github.com/modelcontextprotocol/modelcontextprotocol...
> major concern that server hosters are on the hook to implement authorization
Doesn't it make perfect sense for server hosters to implement that? If Claude wants access to my Jira instance on my behalf, and Jira hosts a remote MCP server that aids in exposing the resources I own, isn't it obvious Jira should be responsible for authorization?
How else would they do it?
Source: https://github.com/modelcontextprotocol/modelcontextprotocol...
In case the above link doesn't work later on, the page for this demo day is here: https://demo-day.mcp.cloudflare.com/
Truly, OSS should be more interesting in the next decade for this alone.
I’d feel a lot better if we had something resembling a comprehensive data privacy law in the United States because I don’t want it to basically be the Wild West for anyone handling whatever personal info doesn’t get covered under HIPAA.
I've always worked under the assumption the best employees make themselves replaceable via well defined processes and high quality documentation. I have such a hard time understanding why there's so much willingness to integrate irreplaceable SaaS solutions into business processes.
I haven't used AI a ton, but everything I've done has focused on owning my own context, config, etc.. How much are people going to be willing to pay if someone else owns 10+ years of their AI context?
Am I crazy or is owning the context massively valuable?
Edit: Actually right in the tickets themselves would probably be better and not require MCP... but still
So if you ask it “who is in charge of marketing” it will read it off sharepoint instead of answering generically
When I hooked up our remote MCP server, Claude sends a GET request to the endpoint. According to the spec, clients that want to support both transports should first attempt to POST an InitializeRequest to the server URL. If that returns a 4xx, it should then assume the SSE integration.
Here's my tool for Desktop: https://github.com/kordless/EvolveMCP
I ran two of the same prompts just now through Anthropic’s new Advanced Research. The results for it and for ChatGPT and Gemini appear below. Opinions might vary, but for my purposes Gemini is still the best. Claude’s responses were too short and simple and they didn’t follow the prompt as closely as I would have liked.
Writing conventions in Japanese and English
https://claude.ai/public/artifacts/c883a9a5-7069-419b-808d-0...
https://docs.google.com/document/d/1V8Ae7xCkPNykhbfZuJnPtCMH...
https://chatgpt.com/share/680da37d-17e4-8011-b331-6d4f3f5ca7...
Overview of an industry in Japan
https://claude.ai/public/artifacts/ba88d1cb-57a0-4444-8668-e...
https://docs.google.com/document/d/1j1O-8bFP_M-vqJpCzDeBLJa3...
https://chatgpt.com/share/680da9b4-8b38-8011-8fb4-3d0a4ddcf7...
The second task, by the way, is just a hypothetical case. Though I have worked as a translator in Japan for many years, I am not the person described in the prompt.