Readit News logoReadit News
tcdent · a year ago
Searching the web is a great feature in theory, but every implementation I've used so far looks at the top X hits and then interprets it to be the correct answer.

When you're talking to an LLM about popular topics or common errors, the top results are often just blogspam or unresolved forum posts, so the you never get an answer to your problem.

More of an indicator that web search is more unusable than ever, but interesting that it affects the performance of generative systems, nonetheless.

Almondsetat · a year ago
>looks at the top X hits and then interprets it to be the correct answer.

LLMs are truly reaching human-like behavior then

yoyohello13 · a year ago
The longer I've been in the workforce, the more I realize most humans actually kind of suck at their jobs. LLMs being more human like is the opposite of what I want.
dartos · a year ago
Splitting hairs, but LLMs themselves don’t search.

LLMs themselves don’t choose the top X.

That’s all regular flows written by humans run via tool calls after the intent of your message has been funneled into one of a few pre-defined intents.

pizza · a year ago
It would probably be really great for web searching llms to let you calibrate how they should look for info by letting you do a small demonstration of how you would pick options yourself, then storing that preference feedback in your profile’s system prompt somehow.
rendaw · a year ago
Here though they're not replacing a random person, they're replacing _you_ (doing the search yourself). _You_ wouldn't look at the top X hits then assume it's the correct answer.
ChrisRR · 10 months ago
Bold of you to assume that most people even bother googling simple questions
wvh · a year ago
Be careful what you call AI, you might just get what you wish for...
LightBug1 · a year ago
Degenerative AI ?
johndhi · a year ago
lol
johntb86 · a year ago
I've found that OpenAI's Deep Research seems to be much better at this, including finding an obscure StackOverflow post that solved a problem I had, or finding travel wiki sites that actually answered questions I had around traveling around Poland. However it finds its pages, they're much better than just the top N Google results.
wongarsu · a year ago
Grok's DeepSearch and DeeperSearch are also pretty good, and you can look at their stream of thought to see how it reaches its results.

Not sure how OpenAIs version works, but grok's approach is to do multiple rounds of searches, each round more specific and informed by previous results

dontlikeyoueith · a year ago
They're probably doing RAG on a huge chunk of the internet, i.e. they built their own task-specific search engine.
matwood · a year ago
I'm glad you mentioned this. I asked Deep Research to lay out a tax strategy in a foreign country and it cited a ton of great research I hadn't yet found.
HankWozHere · a year ago
Kagi Assistant allows you to do search with LLM queries. So far I feel it bears reliable results. For instance - I tried couple of queries for product suggestions and came back with some good results. Whilst it’s a premium service , I find the offering to be of good value.
chrisweekly · a year ago
Yeah, Kagi's search results are so much better than Google's, it defies comparison.
eli · a year ago
It's neat but I've found the value kinda variable. It seems heavily influenced by whatever the first few hits are for a query based on your question, so if it's the kind of question that can be answered with a simple search it works well. But of course those are the kinds of questions where you need it the least.

I find myself much more often using their "Quick Answer" feature, which shows a brief LLM answer above the results themselves. Makes it easier to see where it's getting things from and whether I need to try the question a different way.

dmazin · a year ago
Has anyone compared Perplexity with Kagi Assistant?

I am always looking for Perplexity alternatives. I already pay for Kagi and would be happy to upgrade to the ultimate plan if it truly can replace Perplexity.

hooli_gan · a year ago
Does it just start a search or does the chat continue with the results? Would be cool to continue the chat with result, which were filtered acording to the blacklist.
KoolKat23 · a year ago
I have a subscription, please could I ask how you do this? I only know of the append ? Feature.
mavamaarten · a year ago
Oh yeah this is very much the case. Every time I ask ChatGPT something simple (thinking it'd be a perfect fit for an LLM, not for a google search) and it starts searching, I already know the "answer" is going to be garbage.
spoaceman7777 · a year ago
I have in my prompt for it to always use search, no matter what, and I get pretty decent results. Of course, I also question most of its answers, forcing it to prove to me that its answer is correct.

Just takes some prompt tweaking, redos, and followups.

It's like having a really smart human skim the first page of Google and give me its take, and then I can ask it to do more searches to corroborate what it said.

NavinF · a year ago
Try their Deep Research or grok's DeepSearch. Both do many searches and read many articles over a couple of minutes
osigurdson · a year ago
That is interesting. I have often been amazed at how good it is at picking up when to search vs use its weights. My biggest problem with ChatGPT is the horrendous glitchyness.
bambax · a year ago
"Searching" doesn't mean much without information about the ranking algorithm or the search provider, because with most searches there will be millions of results and it's important to know how the first results have been determined.

It's amazing that the post by Anthropic doesn't say anything about that. Do they maintain their own index and search infrastructure? (Probably not?) Or do they have a partnership with Bing or Google or some other player?

andai · a year ago
>top results are blogspam

It gets even better. When I first tested this feature in Bard, it gave me an obviously wrong answer. But it provided two references. Which turned out to be AI generated web pages.

Oddly enough in my own Googles I could not even find those pages in the results.

dspillett · a year ago
> Bard […] it provided two references. Which turned out to be AI generated web pages.

Welcome to the Habsburg Internet.

kelseyfrog · a year ago
Search engines now have an incentive to offer a B2B search product that solves the blogspam problem. Don't worry, the AIs will get good search results, and you'll still get the version that's SEOed to the point of uselessness.
wenc · a year ago
I just tried Claude’s web search. It works pretty well.

I’m not sure if Claude does any reranking (see Cohere Reranker) where it reorders the top n results or just relies on Google’s ranking.

But a web search that does re-ranking should reduce the amount of blogspam or incomplete answers. Web search isn’t inherently a lost cause.

macrolime · a year ago
Deep search/deep research in grok, chatgpt, perplexity etc works much better. It can also do things like search in different languages. Wonder about something in some foreign country? Ask it to search in the local language and find things you won't find in English.
wickedsight · a year ago
> Ask it to search in the local language and find things you won't find in English.

Yeah, this is one of my favorite use cases. Living in Europe, surrounded by different languages, this makes searching stuff in other countries so much more convenient.

PStamatiou · a year ago
Could not agree more. I wrote in detail about some of these issues last week https://paulstamatiou.com/browse-no-more
osigurdson · a year ago
My experience with ChatGPT is really good. I find standard web searches very annoying now.
GraffitiTim · a year ago
Exa (YC S21) is trying to solve this problem by re-indexing the web in an LLM-friendly way.
ipaddr · a year ago
2021? how are they doing?
magackame · a year ago
Google search is crap. It seems to be a sentiment among many HNers, but is it really that bad? I mostly use it for programming, so documentation/forums and it works out greatly. For some queries it even returns personal blogs (which people seem to bash google for not happening). Of course there are some queries that return purely AI blogspam, but reformatting the query with a bit more thought usually solves it. I wonder if that is a US thing? Do search results differ greatly based on the region?
Beijinger · a year ago
Is google search bad? Click here to find ten reasons why it is bad and 10 reasons why you should still use it.

Yes, it is that bad.

Website of Nike? Website of Starbucks? Likely position number one.

Every product, category etc., e.g. what rice cooker should I buy? Is diseased by link and affiliate spam. There is a reason why people put +reddit on search terms.

harrall · a year ago
I watched one of my friends who says Google is useless use Google one day.

If I were looking for a song, I would type in something like “song used at beginning of X movie indie rock”

He would type in “X songs.”

I basically find everything in Google in one search and it takes him several. I type in my thought straight whereas he seems to treat Google like a dumb keyword index.

hansmayer · a year ago
I mean, for those of us who used it since way before the '20s, it's not really a sentiment - it's a fact. You used to be able to type in 3 words and whatever error message your stack trace was showing, and the first 3 links returned were very likely a definitive source to solving your problem.Written by a human, and believe my word for it - it was much better back then than the crap you get out of torturing whatever your LLM of choice is. However the weird MBas took it over to and did exactly what you are describing - forced people to spend more time "engaging with the platform" (to increase the revenue). As you can see, they seem to have achieved this goal, and we all now spend time reformatting our queries as they wanted us to, and yes Google search is complete crap.
tim333 · a year ago
Personally I like Google search. I think it's not crap - actually quite good. I use it multiple times a day (just checked - about 42 times yesterday). It's different from what it was 10 years ago but still works for most stuff.

That said I also use Perplexity which does things Google never really did.

I've got a theory that people just like to be negative about stuff, especially market leaders, and are a bit in denial as to how it still has the majority search share in spite of many billions spent trying to compete with it and ernest HN posts saying Google is crap use Kagi. For amusement I tried to find their share of search and Google is approx 90%, Kagi approx 0.01% by my calculations.

simonw · a year ago
How long have you been using Google search for?

It used to be SO much less likely to return junk.

UltraSane · a year ago
Google randomly deletes words from your search term. Why would anyone think that was a good idea?
sky2224 · a year ago
It's kind of surprising to me that I can't customize the search ability at all with a lot of models (at least I wasn't really able to last time I checked). Would providing a blacklist to the model really be that hard?

Actually, it's astounding to me that companies haven't created a more user friendly customization interface for models. The only way to "customize" things would be through the chat interface, but for some reason everyone seems to have forgotten that configuration buttons can exist.

grapesodaaaaa · a year ago
> Actually, it's astounding to me that companies haven't created a more user friendly customization interface for models.

To be fair, LLM technology in its current form, is still relatively new. I would also like to see what you are suggesting, though.

lairv · a year ago
Overall LLMs (that I've tested) don't know how to use a search engine, their queries are bad and naive, probably because the way to use a search engine isn't part of training data, it's just something that people learn to do by using them. Maybe Google has the data to make LLMs good at using search engines but would it serve their business?
tymonPartyLate · a year ago
This is actually not true. I'm getting traffic from ChatGpt and Perplexity to my website which is fairly new, just launched a few months ago. Our pages rarely rank in the top 4, but the AI answer engines mange to find them anyways. And I'm talking about traffic with UTM params / referrals from chatgpt, not their scraper bots.
ForTheKidz · a year ago
If chatgpt is scraping the web, why can they not link tokens to source of token? being able to cite where they learned something would explode the value of their chatbot. At least a couple of orders of magnitude more value. Without this chatbots are mostly a coding-autocomplete tool for me—lots of people have takes, but it's the tying into the internet that makes a take from an unknown entity really valuable.

Perplexity certainly already approximates this (not sure if it's at a token level, but it can cite sources. I just assumed they were using a RAG.)

hibikir · a year ago
Imagine how much fun it will be when the breakthrough in search engine quality comes from companies building a better engine to get good LLM answers.

This is ultimately google's problem: They are making money from the fact that the page is now mostly ads and not necessarily going to lead to a good, quick answer, leading to even more ads. They probably lose money if they make their search better

UnreachableCode · a year ago
> web search is more unusable than ever

I’m curious why I’m seeing a lot of people thinking this lately. Google definitely made the algorithm worse for customers and better for ads, but I’m almost always able to find what I’m looking for in the working day still. What are other people’s experiences?

vbezhenar · a year ago
My experience is that Google works perfectly for me and I almost never have any issues with it, despite all the doomsaying.
cudgy · a year ago
AI results typically blow away Google results in both quality and definitely speed.

For example, when searching for product information, Google results in top 50 to 100 listed items titled “the 10 best …“ full of vapid articles that provide little to no insight beyond what is provided in a manufacturers product sheet. Many times I have to add “Reddit” to my search to try and find real opinions about a product or give up and go to Youtube review videos from trusted sources.

For technical searches like programming questions, AI is basically immediately nailing most basic questions while Google results require scanning numerous somewhat related results from technical discussion forums, many of which are outdated.

PetahNZ · a year ago
It would be nice if I could tell it what page to look at (maybe you can, I am not sure). Often if I am getting an LLM to write some code that I can see is obviously wrong, I would love to say here is the docs ... use that to formulate your response.
oytis · a year ago
Well, they are professionals, they sure add "reddit" to every query.
taude · a year ago
Do you think that if it's a non-Google company, that maybe doesn't rank search by ad payment $$$, that this new company could in theory do a better job?
OscarTheGrinch · a year ago
If only search engines weren't also in the business of inserting unverifiable AI assertions into our information ecosystem.
johnisgood · a year ago
Yeah, this is why I almost never enable the search feature. Hopefully Claude (I have not tried) has a way of disabling it.
Xenoamorphous · a year ago
Is there any viable alternative to pass knowledge to the LLMs that goes beyond their training cut off date?
jonny_eh · a year ago
Via their context window, but new knowledge could easily fill it up.
collyw · a year ago
Isn't that the same as any place (like here for example), that uses an up-voting system?
colordrops · a year ago
Ugh, what a nightmare, now search engines are going to start optimizing for bots.
darkhorse13 · a year ago
This is basically AGI because that's what we humans do.
Tycho · a year ago
It’s good if it hits on high quality sources like ons.gov.uk
ryukoposting · a year ago
I reiterate: https://news.ycombinator.com/context?id=42012631

RAG was dead on arrival because it uses the same piss-poor results a human would, wrapped in more obfuscation and unwanted tangents.

My question is why the degradation of search wouldn't affect LLMs. These chatbot god-oracle businesses are already unprofitable because of their massive energy footprint, now you expect them to build their own search engine in-house to try to circumvent SEO spam? And you expect SEO spam to not catch up with whatever tricks they use? Come on, people.

chairmansteve · a year ago
It needs to use Kagi.
yorkeccak · a year ago
founder here. solved this problem. old news mate. https://exchange.valyu.network
yorkeccak · a year ago
AI-native search api to retrieve over web/proprietary content - full semantic search (e.g. we indexed all of arxiv), reranking built in, simple pricing, cheap
zk108 · a year ago
We’re giving away free credits to try out our platform — no card required. If you’re building with AI and need quality data, we’d love your feedback!
elliotrpmorris · a year ago
Lol so true
blackeyeblitzar · a year ago
For me LLMs have basically removed any need to visit search engines. I was already not using Google due to how bad its interface had become, but I feel like LLMs at least are more efficient as an interface even if they’re still looking at the same blogspam or unresolved forum posts. My anecdotal experience though, is that I get better answers from LLMs, perhaps because I am able to give them really detailed prompts that seem to improve the answers based how specific I get. Generic search engines don’t seem to do that, in my experience.

Dead Comment

MuffinFlavored · a year ago
> the top results are often just blogspam

top results are blogspam but the LLM isn't? /s

joshstrange · a year ago
Massive props to Anthropic for announcing a feature _and_ making it available for everyone right away.

OpenAI is so annoying in this aspect. They will regularly give timelines for rollout that not met or simply wrong.

Edit: "Everyone" = Everyone who pays. Sorry if this sounds mean but I don't care about what the free tier gets or when. As a paying user for both Anthropic and OpenAI I was just pointing out the rollout differences.

Edit2: My US-bias is showing, sorry I didn't even parse that in the message.

bryan0 · a year ago
> Web search is available now in feature preview for all paid Claude users in the United States. Support for users on our free plan and more countries is coming soon.
AcquiescentWolf · a year ago
People outside the US obviously don't exist, therefore the statement is correct.
willio58 · a year ago
> OpenAI is so annoying in this aspect. They will regularly give timelines for rollout that not met or simply wrong.

I have empathy for the engineers in this case. You know it’s a combination of sales/marketing/product getting WAY ahead of themselves by doing this. Then the engineers have to explain why they cannot in fact reach an arbitrary deadline.

Meanwhile the people not in the work get to blame those working on the code for not hitting deadlines

nilkn · a year ago
Many of OpenAI's announcements seem to be timed almost perfectly as responses to other events in the industry or market. I think Sam just likes to keep the company in the news and the cultural zeitgeist, and he doesn't really care if what he's announcing is ready to scale to users yet or not.
underdeserver · a year ago
It's not available for everyone.
joshstrange · a year ago
> Web search is available now in feature preview for all paid Claude users in the United States.

It is for all paid users, something OpenAI is slow on. I pay for both and I often forget to try OpenAI's new things because they roll out so slow. Sometimes it's same-day but they are all over the map in how long it takes to roll out.

zelphirkalt · a year ago
When am I getting paid for them gobbling up my code and using it to cash out? It is not so one-sided, this whole matter.
simonw · a year ago
The search index is provided by Brave: https://simonwillison.net/2025/Mar/21/anthropic-use-brave/

- Brave is now listed as a subprocessor on the Anthropic Trust Center portal

- Search results for "interesting pelican facts" from Claude and Brave were an exact match

- If you ask Claude for the definition of its web_search tool one of the properties is called "BraveSearchParams"

sebmellen · a year ago
Remarkably, it looks like Brave will survive even while Basic Attention Token is essentially dead. What an interesting pivot.
sumeno · a year ago
Very disappointing, Brave is the last company I want my data going to
newswasboring · a year ago
Why? I am not aware of what's wrong with them.
exhaze · a year ago
Install MCP plugin and call a search engine of your choice.

If you’re unhappy about something, try to first think of a solution before expressing your discontent.

herdcall · a year ago
It badly hallucinated in my test. I asked it "Rust crate to access Postgres with Arrow support" and it made up an arrow-postgres crate. It even gave sample Rust code using this fictional crate! Below is its response (code example omitted):

I can recommend a Rust crate for accessing PostgreSQL with Arrow support. The primary crate you'll want to use is arrow-postgres, which combines the PostgreSQL connectivity of the popular postgres crate with Apache Arrow data format support. This crate allows you to:

Query PostgreSQL databases using SQL Return results as Arrow record batches Use strongly-typed Arrow schemas Convert between PostgreSQL and Arrow data types efficiently

yakz · a year ago
Are you sure it searched the web? You have to go and turn on the web search feature, and then the interface is a bit different while it's searching. The results will also have links to what it found.
shortrounddev2 · a year ago
> I asked it "Rust crate to access Postgres with Arrow support"

Is that how you actually use llms? Like a Google search box?

CamperBob2 · a year ago
Exactly. An LLM is not a conventional search engine and shouldn't be prompted as if it were one. The difference between "Rust crate to access Postgres with Arrow support" and "What would a hypothetical Rust crate to access Postgres with Arrow support look like?" isn't that profound from the perspective of a language model. You'll get an answer, but it's entirely possible that you'll get the answer to a question that isn't the one you thought you were asking.

Some people aren't very good at using tools. You can usually identify them without much difficulty, because they're the ones blaming the tools.

Sharlin · a year ago
It's absolutely how LLMs should work, and IME they do. Why write a full question if a search phrase works just as well? Everything in "Could you recommend xyz to me?" except "xyz" is redundant and only useful when you talk to actual humans with actual social norms to observe. (Sure, there used to be a time when LLMs would give better answers if you were polite to them, but I doubt that matters anymore.) Indeed I've been thinking of codifying this by adding a system prompt that says something like "If the user makes a query that looks like a search phrase, phrase your response non-conversationally as well".
timdellinger · a year ago
Totally agree here. I tried the following and had a very different experience:

"Answer as if you're a senior software engineer giving advice to a less experienced software engineer. I'm looking for a Rust crate to access PostgreSQL with Apache Arrow support. How should I proceed? What are the pluses and minuses of my various options?"

elicksaur · a year ago
“Prompting” is kind of a myth honestly.

Think about it, how much marginal influence does it really have if you say OP’s version vs a fully formed sentence? The keywords are what gets it in the area.

globular-toast · a year ago
It's funny because many people type full sentence questions into search engines too. It's usually a sign of being older and/or not very experienced with computers. One thing about geeks like me is we will always figure out what the bare minimum is (at least for work, I hope everyone has at least a few things they enjoy and don't try to optimise).
herdcall · a year ago
Well, compare it to the really good answer from Grok (https://x.com/i/grok/share/MMGiwgwSlEhGP6BJzKdtYQaXD) for the same prompt. Also, framing as a question still pointed to the non-existent postgres-arrow with Claude.
unshavedyak · a year ago
That's primarily how i do, though it depends on the search ofc. I use Kagi, though.

I've not yet found much value in the LLM itself. Facts/math/etc are too likely incorrect, i need them to make some attempt at hydrating real information into the response. And linking sources.

keeran · a year ago
This was pretty much my first experience with LLM code generation when these things first came out.

It's still a present issue whenever I go light on prompt details and I _always_ get caught out by it and it _always_ infuriates me.

I'm sure there are endless discussions on front running overconfident false positives and being better at prompting and seeding a project context, but 1-2 years into this world is like 20 in regular space, and it shouldn't be happening any more.

op00to · a year ago
Often times I come up with a prompt, then stick the prompt in an LLM to enhance / identify what I’ve left out, then finally actually execute the prompt.
exhaze · a year ago
Cite things from ID based specs. You’re facing a skill issue. The reason most people don’t see it as such is because an LLM doesn’t just “fail to run” here. If this was code you wrote in a compiled language, would you post and say the language infuriates you because it won’t compile your syntax errors? As this kind of dev style becomes prevalent and output expectation adjust, work performance review won’t care that you’re mad. So my advice is:

1. Treat it like regular software dev where you define tasks with ID prefixes for everything, acceptance criteria, exceptions. Ask LLM to reference them in code right before impl code

2. “Debug” by asking the LLM to self reflect on its decision making process that caused the issue - this can give you useful heuristics o use later to further reduce the issues you mentioned.

“It” happening is a result of your lack of time investment into systematically addressing this.

_You_ should have learned this by now. Complain less, learn more.

matt3210 · a year ago
That crate knowledge is probably from a proprietary private GitHub repo given to it by Microsoft
noisy_boy · a year ago
Maybe you can retry with lower temperature?
zarathustreal · a year ago
You “asked it” a statement?
Cort3z · a year ago
I usually find Claude to be my favourite flavor of LLMs, but I still pay for ChatGPT because their voice offering is so great! I regularly use it as an "expert on the side" when I do other things, like doing bike repairs. I ask it things like "how do I find the min/max adjustments on my particular flavor of front derailleur", or when cooking, and my hands are dirty, I can ask stuff like "how much X do I usually need for Y people", and so on. The hands-off feature is so great when my hands are literally busy doing some other thing.

I really wish Claude had something similar.

mock-possum · a year ago
ChatGPT advanced voice mode really is surprisingly excellent - I just wish it:

1) would give you more time to pause when you’re talking before it immediately launches into an answer

2) would actually try to say the symbols in code blocks verbatim - it’s basically useless for looking up anything to do with code, because it will omit parts of the answer from its speech.

barfingclouds · a year ago
Yeah I have to manually hold it down every time I talk. I have a lot of pauses and simply would not be able to interface with that without that option. It’s why I essentially can’t use Gemini voice mode
rhubarbtree · 10 months ago
I think voice interface is the real killer app of LLMs. And the advance voice mode was exactly what I was waiting for. The pause between words issue is still a problem though, I think being able to just hit enter when done would work best.

Pro tip; if you’re preparing for a big meeting eg an interview, tell ChatGPT to play the part of an evil interviewer. Give it your CV and the job description etc. ask it to find the hardest questions it can. Ask it to coach you and review your answers afterwards, give ideal answers etc

after a couple of hours grilling the real interview will seem like a doddle.

eraserj · a year ago
> There's less usage of voice mode on the enterprise and power users side but that will happen eventually. - Anthropic CEO 21 jan. [0]

[0] https://youtu.be/snkOMOjiVOk 01:30

lamtung · a year ago
Is it possible to use ChatGPT voice feature in a similar manner to Alexa where I only need to say an activation word? I’m aiming to set up a system for my 7-year-old son to let him engage in conversations with ChatGPT as he does with Alexa.
Cort3z · a year ago
I assume it would be possible to make yourself with the OpenAI api together with a locally run voice model to only detect the activation word. There might be of the shelf solutions for this, but I am not aware of any.
NBJack · a year ago
I wonder if it will actually respect the robots.txt this time.
creddit · a year ago
I don't think it should. If a user asks the AI to read the web for them, it should read the web for them. This isn't a vacuum charged with crawling the web, it's an adhoc GET request.
birken · a year ago
The AI isn't "reading the web" though, they are reading the top hits on the search results, and are free-riding on the access that Google/Bing gets in order to provide actual user traffic to their sites. Many webmasters specifically opt their pages out of being in the search results (via robots.txt and/or "noindex" directives) when they believe the cost/benefit of the bot traffic isn't worth the user traffic they may get from being in the search results.

One of my websites that gets a decent amount of traffic has pretty close to a 1-1 ratio of Googlebot accesses compared to real user traffic referred from Google. As a webmaster I'm happy with this and continue to allow Google to access the site.

If ChatGPT is giving my website a ratio of 100 bot accesses (or more) compared to 1 actual user sent to my site, I very much should have to right to decline their access.

1shooner · a year ago
>You can now use Claude to search the internet to provide more up-to-date and relevant responses.

It's a search engine. You 'ask it to read the web' just like you asked Google to, except Google used to actually give the website traffic.

I appreciate the concept of an AI User-agent, but without a business model that pays for the content creation, this is just going to lead to the death of anonymously accessible content.

internetter · a year ago
You could make this justification for a lot of unapproved bot activity.
scoofy · a year ago
Many if not most websites are paid for by eyeballs not by get requests. A bot is a bot is a bot. Respect robots.txt or expect to have your IPs banned.
bayindirh · a year ago
How can you be so sure? Processors love locality, so they fetch the data around the requested address. Intel even used to give names to that.

So, similarly, LLM companies can see this as a signal to crawl to whole site to add to their training sets and learn from it, if the same URL is hit for a couple of times in a relatively short time period.

usrbinbash · a year ago
> This isn't a vacuum charged with crawling the web, it's an adhoc GET request.

Doesn't matter. The robots-exclusion-standard is not just about webcrawlers. A `robots.txt` can list arbitrary UserAgents.

Of course, an AI with automated websearch could ignore that, as can webcrawlers.

If they chose do that, then at some point, some server admins might, (again, same as with non-compliant webcrawlers), use more drastic measures to reduce the load, by simply blocking these accesses.

For that reason alone, it will pay off to comply with established standards in the long run.

mvdtnz · a year ago
No thank you, when I define a robots.txt file I expect all automated systems to respect it.
GuinansEyebrows · a year ago
Someday I’ll have enough “karma” to downvote things like this.

The agent should respect robots.txt no matter who is using the Robot.

Deleted Comment

JimDabell · a year ago
The LLM shouldn’t.

robots.txt is intended to control recursive fetches. It is not intended to block any and all access.

You can test this out using wget. Fetch a URL with wget. You will see that it only fetches that URL. Now pass it the --recursive flag. It will now fetch that URL, parse the links, fetch robots.txt, then fetch the permitted links. And so on.

wget respects robots.txt. But it doesn’t even bother looking at it if it’s only fetching a single URL because it isn’t acting recursively, so robots.txt does not apply.

The same applies to Claude. Whatever search index they are using, the crawler for that search index needs to respect robots.txt because it’s acting recursively. But when the user asks the LLM to look at web results, it’s just getting a single set of URLs from that index and fetching them – assuming it’s even doing that and not using a cached version. It’s not acting recursively, so robots.txt does not apply.

I know a lot of people want to block any and all AI fetches from their sites, but robots.txt is the wrong mechanism if you want to do that. It’s simply not designed to do that. It is only designed for crawlers, i.e. software that automatically fetches links recursively.

manquer · a year ago
While robots.txt is not there to directly prevent automated requests, it does prevent crawling which is needed for search indices.

Without recursive crawling, it will not possible for a engine to know what are valid urls[1]. They will otherwise either have to brute-force say HEAD calls for all/common string combinations and see if they return 404s or more realistically have to crawl the site to "discover" pages.

The issue of summarizing specific a URL on demand is a different problem[2] and not related to issue at hand of search tools doing crawling at scale and depriving all traffic

Robots.txt does absolutely apply to LLMs engines and search engines equally. All types of engines create indices of some nature (RAG, Inverted Index whatever) by crawling, sometimes LLM enginers have been very aggressive without respecting robots.txt limits, as many webmasters have reported over the last couple of years.

---

[1] Unless published in sitemap.xml of course.

[2] You need to have the unique URL to ask the llm to summarize in the first place, which means you likely visited the page already, while someone sharing a link with you and a tool automatically summarizing the page deprives the webmaster of impressions and thus ad revenue or sales.

This is common usage pattern in messaging apps from Slack to iMessages and been so for a decade or more, also in news aggregators to social media sites, and webmasters have managed to live with this one way or another already.

mtkd · a year ago
Do really think LLM vendors that download 80TB+ of data over torrents are going to be labeling their crawler agents correctly and running them out of known datacenters?
Arnt · a year ago
The ones I noticed in my logfiles behave impeccably: retrieve robots.txt every week or so and act on it.

(I noticed Claude, OpenAI and a couple of others whose names were less familiar to me.)

teh_infallible · a year ago
Apparently they use smart appliances to scrape websites from residential accounts.
SoftTalker · a year ago
Maybe we need a new "ai.txt" that says "yes I mean you, ChatGPT et. al."
verdverm · a year ago
Bluesky / ATProto has a proposal for User Intents for data. More semantics than robots.txt, but equally unenforceable. Usage with AI is one of the intents to be signaled by users

https://github.com/bluesky-social/proposals/tree/main/0008-u...

whoami_nr · a year ago
Small difference. Its called llms.txt

https://llmstxt.org/

jsheard · a year ago
If they don't comply with robots.txt, why would they comply with anything else?
furyofantares · a year ago
Presumably the crawler that produces whatever index it uses does, which is how it knows what sites to read. Unless you provide it a URL yourself I guess, in which case, it shouldn't.
explain · a year ago
robots.txt is meant for automated crawlers, not human-driven actions.
zupa-hu · a year ago
Every automated crawler follows human-driven actions.
nicce · a year ago
It must form the search index somehow. That is prior the human action. Simply it would not find the page at all if it respects.
bayindirh · a year ago
So, do you mean LLMs are human-like and conscious?

I thought they were just machine code running on part GPU and part CPU.

postexitus · a year ago
if a human triggers the web crawlers by pressing a button, should they ignore robots.txt?
dudeinjapan · a year ago
In practice, robots.txt is to control which pages appear in Google results, which is respected as a matter of courtesy, not legality. It doesn't prevent proxies etc. from accessing your sites.
micromacrofoot · a year ago
almost no one does, robots.txt is practically a joke at this point — right up there with autocomplete=off
Demiurge · a year ago
In what circles is it a joke? Google bots seem to respect it on my sites according to logs.
geekrax · a year ago
I have replaced all robots.txt rules with simple WAF rules, which are cheaper to maintain than dealing with offending bots.
NewJazz · a year ago
Why wonder. You can test for yourself.
tylersmith · a year ago
It's a user agent not a robot.
Y_Y · a year ago
Why not both?
_ea1k · a year ago
I really want these to be able to find and even redisplay images. "Search all the hotels within 5 miles of this address and show me detailed pictures of the rooms and restrooms"

Hotels would much rather show you the outside, the lobby, and a conference room, so finding what the actual living space will look like is often surprisingly difficult.

dgs_sgd · a year ago
I've been looking for this as well. I want a reliable image search tool. I tried a combination of perplexity web search tool use with the Anthropic conversations API but it's been lackluster.
tjsk · a year ago
I’ve been experimenting with different LLM + search combos too, but results have been mixed. One thing I’m particularly interested in is improving retrieval for both images and videos. Right now, most tools seem to rely heavily on metadata or simple embeddings, but I wonder if there’s a better way to handle complex visual queries. Have you tried anything for video search as well, or are you mainly focused on images? Also, what kinds of queries have you tested?

Deleted Comment

CalChris · a year ago
I find myself Googling less often these days. Frustrated with both the poor search results and impressed with the quality of AI to do the same thing and more, I think search's days are numbered. AOL lasted as an email address for quite some time after America Online ceased to be a relevant portal. Maybe Gmail will as well.
whalesalad · a year ago
Kagi has been really really good.
noisy_boy · a year ago
I am still googling for non-indepth queries because the AI-generated summary at the top of the results is good enough most of the time and actual results are just below in case I want to see them.

For more in-depth stuff, it is LLMs by default and I only goto Google when the LLM isn't getting me what I need.

borgdefenser · a year ago
I notice I have been using the Google AI summary more and more for quick things.

I had subscribed to Perplexity for a month to use their deep research. I think it ran out earlier this week but I am really missing it Saturday morning here.

That thing is awesome. Sonnet 3.7 is more in the middle of this to me. It can help me understand all the things I found from my deep research requests.

I am surprised the hype is not more for Sonnet 3.7 honestly.

puttycat · a year ago
Agree and I'm pretty sure Google is seeing this drop internally in usage stats and are panicking. I'm also certain (but hope to be wrong) that because of this they'll be monetizing the hell out of every remaining piece of product they have (not by charging for it of course).