Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.
Considering how it must be getting hammered what with the "AI" nonsense, it's interesting how crt.sh continues to remain usable, particularly the (limited) direct PostgresSQL db access
To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet
crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues
With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)
EDIT: that's the flip side of supporting HTTPS that's not well-known among developers - by acquiring a legitimate certificate for your service to enable HTTPS, you also announce to the entire world, through a public log, that your service exists.
"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"
The reason presented by the blog post is "for what I assume are things to scrape from"
Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs
After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)
But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties
Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2
X then writes an op-ed that suggests Z is using the phone book "for who to call"
This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling
Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.
Why does someone use these public resources. Online commenter: "To look up names and numbers"
Correct. But that alone is not very interesting. Why are they looking up the names and numbers
1.
We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions
With the telephone, X could ask "Why are you calling?"
With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence
No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share
Similarly no one knows what OpenAI will do with the data in the future
One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"
2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion
3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do with the data. In other words, removing speculation
Presumably this is well-known among people that already know about this.
P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
Certificate transparency log is a Google project. They don’t need to scrape it. They host all the data. It’s one of those projects where Google hosts it because it thinks it genuinely improves the internet, by reducing certificate authority abuse.
This could be OpenAI, or it could be another company using their header pattern.
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
I don't have a statistic here, but I'm always surprised how many websites I come across that do limited user-agent and origin/referrer checks, but don't maintain any kind of active IP based tracking. If you're trying to build a site-specific scraper and are getting blocked, mimicking headers is an easy and often sufficient step.
> Some search engines provide a list of their scraper IP ranges
Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".
Yep, but this comes with a tradeoff: all of your services now have a valid key/cert for your whole domain, significantly increasing the blast radius if one service is compromised.
Is it technically possible to obtain a wildcard cert from LetsEncrypt, but then use OpenSSL / X.509 tooling to derive a restricted cert/key to be deployed on servers, which only works for specific domains under the wildcard?
Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.
Yep, but next year they intend to launch an alternative DNS challenge which doesn't require changing DNS records with every renewal. Instead you'll create a persistent TXT record containing a public key, and then any ACME client which has the private key can keep requesting new certs forever.
If you are using a non-standard DNS provider that doesn’t have integration with certbot or cert-manager or whatever you are using, it is pretty easy to set up an acme-dns server to handle it
May I know does Caddy automatically update with apt if you built custom Caddy binaries for the DNS provider plugin?
Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.
I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!
People are just raging and want an outlet. They aren’t thinking logically.
It’s been going on forever (remember how companies were reading files off your computer aka cookies in 1999?)
This seems like a total non-issue and expect that any public files are scraped by OpenAI and tons others. If I don’t want something scraped, I don’t make it public.
The ostensible purpose of the certificate transparency logs is to allow validation of a certificate you're looking at - I browse to https://poormathskills.com and want to figure out the details of when its cert was issued.
The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.
The main thing isn't validating the cert you're looking at, per se, it's to validate the activities of the issuers. Mainly that they aren't issuing certificates they aren't supposed to (i.e. you can monitor the logs for your domain to check some random CA you've never done business hasn't issued a cert for it). This is mainly enforced by any violations (i.e. any certificates found that don't show up in the logs) being grounds for being removed from browser's trusted list.
For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.
Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt
I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
There's no evidence the barrage of residentially-proxied bot accesses hitting every public website have anything to do with OpenAI, but then again, there's also no evidence they don't.
Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:
- X happened
- Person P says "Ah, X happened."
- Person Q interprets this in a particular way
and says "Stop saying X is BAD!"
- Person R, who already knows about X...
(and indifferent to what others notice
or might know or be interested in)
...says "(yawn)".
- Person S narrowly looks at Person R and says
"Oh, so you think Repugnant-X is ok?"
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.
See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.
I agree with your analysis but try to not agree with your conclusion, purely for my own metal hygiene: I believe one can retrain the pattern matching of one’s brain for happier outcomes. If I let my brain judge this as a “failure“ (judgment “it is wrong“), I will either get sad about it (judgment “… and I can’t change it“) or angry (… and I can do something about it“). In cases such as this I prefer to accept it as is, so I try to rewrite my brain rule to consider it a necessary part of life (judgment “true/good/correct“).
Ah, in case it didn't come across clearly, my conclusion isn't to blame the individuals. My assessment is to seek out better communication patterns, which is partly about "technology" and partly about culture (expectations). People could indeed learn not to act this way with a bit of subtle nudging, feedback, and mechanism design.
I'm also pretty content accepting the unpleasant parts of reality without spin or optimism. Sometimes the better choice is still crappy, after all ;) I think Oliver Burkeman makes a fun and thoughtful case in "The Antidote: Happiness for People Who Can't Stand Positive Thinking" https://www.goodreads.com/book/show/13721709-the-antidote
(the site may occasionally fail to load)
https://www.merklemap.com/search?query=ycombinator.com&page=...
Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).
Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):
https://www.merklemap.com/search?query=ycombi&page=0
To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet
crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues
For example:
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)Needs clarification. What reason
EDIT: that's the flip side of supporting HTTPS that's not well-known among developers - by acquiring a legitimate certificate for your service to enable HTTPS, you also announce to the entire world, through a public log, that your service exists.
"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"
The reason presented by the blog post is "for what I assume are things to scrape from"
Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs
After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)
But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties
Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2
X then writes an op-ed that suggests Z is using the phone book "for who to call"
This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling
Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.
Why does someone use these public resources. Online commenter: "To look up names and numbers"
Correct. But that alone is not very interesting. Why are they looking up the names and numbers
1.
We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions
With the telephone, X could ask "Why are you calling?"
With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence
No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share
Similarly no one knows what OpenAI will do with the data in the future
One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"
2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion
3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do with the data. In other words, removing speculation
Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.
P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
[1]: https://www.nature.com/articles/s41562-023-01719-1
Dead Comment
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
https://openai.com/searchbot.json
I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
We think we're so different from animals https://en.wikipedia.org/wiki/Mimicry
Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".
Then all they know is the main domain, and you can somewhat hide in obscurity.
https://letsencrypt.org/2025/12/02/from-90-to-45#making-auto...
https://github.com/joohoi/acme-dns
Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!
It’s been going on forever (remember how companies were reading files off your computer aka cookies in 1999?)
This seems like a total non-issue and expect that any public files are scraped by OpenAI and tons others. If I don’t want something scraped, I don’t make it public.
The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.
I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.
I'm also pretty content accepting the unpleasant parts of reality without spin or optimism. Sometimes the better choice is still crappy, after all ;) I think Oliver Burkeman makes a fun and thoughtful case in "The Antidote: Happiness for People Who Can't Stand Positive Thinking" https://www.goodreads.com/book/show/13721709-the-antidote