Readit News logoReadit News
jhpacker commented on Yes, OpenAI Scrapes Google Search   quantable.com/ai/openai-s... · Posted by u/jhpacker
n1xis10t · 2 months ago
Oh, yes you’re right. I didn’t read the whole thing, which was clearly a mistake. I apologize. Good job with the article, and thank you. It’s pretty funny that they only get the excerpts.
jhpacker · 2 months ago
No worries & thanks! Yea I didn't expect to get them with the excerpts too, that was a surprise bonus.
jhpacker commented on Yes, OpenAI Scrapes Google Search   quantable.com/ai/openai-s... · Posted by u/jhpacker
n1xis10t · 2 months ago
I’m not sure that your test is conclusive. It would be if the only way that OpenAI indexed new websites was to find them linked from other places, but I think they have other methods as well.

This article from Cloudflare is about the behavior of Perplexity’s crawler: https://blog.cloudflare.com/perplexity-is-using-stealth-unde...

Basically Cloudflare put up a website with a robots.txt file that banned Perplexity’s crawler, and Perplexity crawled it anyway. The article was focussed on the rudeness of the crawler, but there is something else interesting here too. About their test domains they said, “These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way.” I think that this would mean they didn’t do anything to have other websites link to them.

One way that you could find websites that aren’t linked to by anyone would be to use something like zmap, which can scan the whole ipv4 address space in about 45 minutes with a good internet connection. You would tell it to send packets to ports commonly used by https servers, and then after a scan you would have a list of all homepages, whether or not they are linked to from anyone. In fact, you’d even get ones that don’t have a domain name, but are just IP addresses. It is harder to scan ipv6 because it’s so much bigger, but most things are still ipv4 I believe, and I think there are still supposed to be ways to do it. I think there are blocks of IP addresses that are more likely to be used. There is a search engine called Shodan that let’s you search for servers, and they do scans like this. Scanning in this fashion might be illegal in some places, but I think it’s legal in America, where Shodan and Perplexity and OpenAI all are.

So that’s one way you could do it. There might be another way too though. I own a couple domains that I haven’t done anything with yet, and I used Ahrefs’ backlink checker thing on them, and even though I hadn’t done anything they were actually linked to by some websites that were like “List of newly registered domains for this month”. I don’t think these people found my domains by scanning, because I didn’t have servers running so a scan shouldn’t have picked them up unless Cloudflare had some response, but I don’t think they did. They may have gotten the information somehow from the registrar, or maybe from a higher up like ICANN. It’s public information what is registered, and they might have gotten a list somehow.

jhpacker · 2 months ago
Unlinked domains can definitely be found in a lot of ways, but like I show in the article there was literally no fetching of the page except for Googlebot. So even if the hostname was leaked somehow the contents of the page require fetching the page, which was only done by Google. Also like I show in the article the content that ChatGPT knows identically matches what's in a Google search snippet, down to where a word-break is.
jhpacker commented on Oddest ChatGPT leaks yet: Cringey chat logs found in Google Analytics tool   arstechnica.com/tech-poli... · Posted by u/vlod
Legend2440 · 2 months ago
>Packer thinks his testing of ChatGPT leaks may be evidence that OpenAI not only scrapes “SERPs in general to acquire data,” but also sends user prompts to Google Search.

Seems more likely that this is an erroneous call to the search tool? It certainly does call Google Search when it thinks it will help answer a question, although it does not normally send the entire prompt.

jhpacker · 2 months ago
What I am saying is that this was a glitch where the full prompt rather than a translated prompt was sent to Google Search. OpenAI says they fixed the glitch, so yes it was definitely an error on their part. My research doesn't show how to repro that error, just that it existed.
jhpacker commented on Oddest ChatGPT leaks yet: Cringey chat logs found in Google Analytics tool   arstechnica.com/tech-poli... · Posted by u/vlod
thesumofall · 2 months ago
I found it really difficult to figure out if it is safe to turn on web search in a corporate (or privacy critical) environment. The AI companies generally seem to claim it’s safe but looking at this + the general ability of Google Search Console to display search queries, the answer seems a no? Did anyone already found definitive proof either way (beyond the bug(?) of this article)
jhpacker · 2 months ago
I definitely wouldn't! Beyond the "glitch" I'm reporting here where full prompts are seemingly sent to GSC, it may still be that searches are scraped... meaning that while it's less obviously personal than a raw prompt it still could leak user intent.
jhpacker commented on Oddest ChatGPT leaks yet: Cringey chat logs found in Google Analytics tool   arstechnica.com/tech-poli... · Posted by u/vlod
cortesoft · 2 months ago
My bigger takeaway here is that I didn’t realize that Google shared the full text of search queries that users used that resulted in a site being returned.

That seems like a bigger personal data leak than OpenAI doing anything here.

jhpacker · 2 months ago
GSC does filter and threshold what shows, but that doesn't always work 100%. Also those filters are built to work against traditional keyword searches, not prompts. It's also supposed to threshold low volume queries which should have kept a lot of things prompts out of GSC, but for whatever reason that wasn't very effective.

I've worked in many GSC consoles over the years, and I've never seen anything like what I saw in this case. (I'm the original author)

jhpacker commented on The dueling truths of AI, both grift and revolution   quantable.com/ai/the-dual... · Posted by u/jhpacker
jhpacker · 6 months ago
Ok this is pretty funny. Just like any good internet commenter this bot didn't actually read the article... which doesn't actually say "AI is a tool, not a panacea" anywhere.

u/jhpacker

KarmaCake day162March 11, 2016
About
Analytics architect at quantable.com
View Original