If this is talking about a "placeholder" index.htm that comes with the server, I think the answer is obvious: If you request that page, it will successfully find and serve it in the same way as any other page. Thus 200 is the expected response.
On the other hand, if there's no page to be served, not even a default one, then it should 404. (Unless the default config is to list the (empty) directory, and it exists, in which case 200.)
There's no need to complicate things beyond that. One thing that I've learned over many years of experience with software is that if at all possible you should never add additional conditions or edge-cases, because they will tend to create more problems than they solve. The server is behaving most consistently if it treats the placeholder page the same as any other.
Would you want a compiler that specifically detects "hello world" programs and compiles them to always return failure, under the similar argument that it's not a "real" program? Because that's the logical conclusion of this sort of inane overthinking.
It is about fallback placeholder which is not the same as default placeholder, but rather indicates that user did not configure his website at all. I'd say that 5xx is appropriate here. For example you don't want for Google to remember your website as "Apache installed" if its crawler happens to run nearby.
You can serve a body with a 404, user will see the appropriate message and robot can safely ignore this page until further notice. Search engine will often retry 404 later and slowly reduce crawling if it stays not found.
You’re not wrong, but I also believe that a lot of these problems are inherently complicated. Trying to encapsulate all the cases for a given problem will inherently have edge cases. Yes they suck and yes we should try to prevent them when possible, but I feel like ignoring them is also a huge foot-gun.
Agreed. Anything other than 200 is going to require special case handling by the server. How else would it know whether it's serving up the default index? You could add a .htaccess (or equivalent) rule but that increases the likelihood of somebody forgetting to remove it.
otoh, if you somehow monitor if your web servers are working, you wouldn’t want your server to answer with 200 in the case the configuration is reset by error.
I mean, the real point is here : nobody cares what your server is answering just after being setup. However I can see how it’s a problem if for any reason, a server loses its configuration and still acts like everything is fine.
(ofc you can bypass this by monitoring a more specific url but it’s not always possible if you are not the one deciding what the server serves)
For your analogy, it’s more like you’d want your compiler to fail if you fed it with no source code.
I run Nginx Proxy Manager as a reverse proxy on my home server that has NextCloud, Mastodon, and a few other things on it.
I kept hitting an issue where random things on the internet would stop working: USPS's login page, my WiFi garage door opener, etc.
I finally tracked the problem down to bright cloud flagging my home IP as a "proxy". We went back and forth for a while, and the one thing they eventually showed me was a number of subdomains for things I no longer had online (such as a gitlab instance), that now got the default Nginx Proxy Manager "success" page. (I have a wildcard subdomain set up, so it continues to resolve even after I take something down.)
It turns out that brightcloud's crawler just flags any page with the word "proxy" on it - it doesn't distinguish between a reverse proxy and the open forward proxies that their customers actually care about.
I switched the configuration to serve up a 404 for unrecognized domains/subdomains and haven't had a problem since.
Depending on how public you want your home server to be, I'd recommend either blocking IPs you don't want touching it (yes, this includes those "security scanner" services) or allowing only the ones you do.
I usually setup HTTP basic authentication for these types of things. It also prevents exploitation by bots when a zero day is out and you haven't patched yet. The username/password pair can be trivial, even something like `foo/bar` stops pretty much all automated scanning.
> On the other hand, the HTTP status code does matter (sometimes a lot) to programs that hit the URL, including status monitoring programs; these will probably consider their checks to fail if the web server returns a 404 and succeed if it returns a 200. If you're pointing status checking programs at the front page of your just set up web server to make sure it's up, probably you want a HTTP 200 code (although not if the real thing you're checking is whether or not the web server and the site have been fully set up).
This is a subtle but important distinction…
There are so many layers now between a user and the application code. What if due to some misconfiguration or new image push or ___ the web server or load balancer or PaaS router or CDN or Cloudflare or whatever starts serving some default placeholder, or error message, or its own content up on my URL?
That’s why I’d argue for a non-200 status code for the default “hello” page.
And in production monitoring I’d use something like https://heiioncall.com/blog/enhanced-api-monitoring-with-exp... to verify the presence of some special header set only by your application, so you know that your desired code is actually being called. (In addition to asserting the HTTP status code.)
But that is also an argument for 200. Because if you want to test your load balancer against your new web server you will want it to serve a 200 or else you will just see an error from the load balancer.
> As with other HTTP error codes, the real answer is that one should probably use whatever status code is most convenient.
That's all you need to know. HTTP, like the other components of the web stack, is an organically grown monstrosity that resembles what you would get if a thousand random people shat on a pile. Any attempt to extract philosophical purity and/or rigorous discussion from it is a massive waste of time. Just use what you feel makes sense at the current moment, and move on.
OP suggests 404 Not Found, but there’s also argument to made for a 5xx server error. After all, this “hello” front page pops up not because of client error, but because the server has not been configured properly.
Now you'd have to argue the semantics of what it means to be misconfigured. Is it misconfigured just because the placeholder index hasn't yet been replaced? How does the server know?
The probable reason they used an error handler for that "welcome" page is so that it would keep /var/www/html empty and any upgrades wouldn't try to replace an index.html or whatever you put there yourself. So it's a "hack" to serve a welcome page from outside the default DocumentRoot, not to force some kind status code. That status code is just a downside of this hack and not really of importance because whoever made it also knew it was going to be the first thing you remove when you want to use the webserver.
I build my own Apache container images (long story, Nginx and Caddy are okay too for most purposes) and I need to do health checks, so I also had to think about this.
When I launch a container that's supposed to sit in front of multiple other containers as a reverse proxy or just serve static files, I need to know whether the Apache process in it is working and actually serving files. This is regardless of the rest of the configuration and whether every site is up: for example, if 19 out of 20 sites are configured correctly and are served, the failing one can be addressed separately later.
In my case, that's as easy as the following:
healthcheck:
test: "curl http://127.0.0.1:80 | grep 'Apache2 is up and running' || exit 1"
interval: 10s
timeout: 10s
retries: 6
start_period: 5s
(there are also separate external uptime checks for the actual sites with domains and HTTPS, too)
I serve some HTML files by default in every container with specific contents, in addition to any domains that the web server has configured. If I can access these default files, that means that the web server is up and I can then think about testing the rest of the configuration myself. In this case, I decided to check the file contents instead of the status code.
Frankly, you can get the same result with just status codes (probably a 200 IMO because of how simple it is), or maybe some specific HTML contents in the default page to identify whether you're talking to the correct web server instance and have deployed it in the right place, maybe even page contents that have been generated by something like PHP-FPM to check whether scripts get executed correctly (also like testing OpenResty with Lua) if you need that sort of thing.
On the other hand, if there's no page to be served, not even a default one, then it should 404. (Unless the default config is to list the (empty) directory, and it exists, in which case 200.)
There's no need to complicate things beyond that. One thing that I've learned over many years of experience with software is that if at all possible you should never add additional conditions or edge-cases, because they will tend to create more problems than they solve. The server is behaving most consistently if it treats the placeholder page the same as any other.
Would you want a compiler that specifically detects "hello world" programs and compiles them to always return failure, under the similar argument that it's not a "real" program? Because that's the logical conclusion of this sort of inane overthinking.
You can serve a body with a 404, user will see the appropriate message and robot can safely ignore this page until further notice. Search engine will often retry 404 later and slowly reduce crawling if it stays not found.
5xx is a failure to serve the page, which is not the case if a placeholder is served
No code is faster than no code
I mean, the real point is here : nobody cares what your server is answering just after being setup. However I can see how it’s a problem if for any reason, a server loses its configuration and still acts like everything is fine.
(ofc you can bypass this by monitoring a more specific url but it’s not always possible if you are not the one deciding what the server serves)
For your analogy, it’s more like you’d want your compiler to fail if you fed it with no source code.
I kept hitting an issue where random things on the internet would stop working: USPS's login page, my WiFi garage door opener, etc.
I finally tracked the problem down to bright cloud flagging my home IP as a "proxy". We went back and forth for a while, and the one thing they eventually showed me was a number of subdomains for things I no longer had online (such as a gitlab instance), that now got the default Nginx Proxy Manager "success" page. (I have a wildcard subdomain set up, so it continues to resolve even after I take something down.)
It turns out that brightcloud's crawler just flags any page with the word "proxy" on it - it doesn't distinguish between a reverse proxy and the open forward proxies that their customers actually care about.
I switched the configuration to serve up a 404 for unrecognized domains/subdomains and haven't had a problem since.
This is a subtle but important distinction…
There are so many layers now between a user and the application code. What if due to some misconfiguration or new image push or ___ the web server or load balancer or PaaS router or CDN or Cloudflare or whatever starts serving some default placeholder, or error message, or its own content up on my URL?
That’s why I’d argue for a non-200 status code for the default “hello” page.
And in production monitoring I’d use something like https://heiioncall.com/blog/enhanced-api-monitoring-with-exp... to verify the presence of some special header set only by your application, so you know that your desired code is actually being called. (In addition to asserting the HTTP status code.)
Deleted Comment
That's all you need to know. HTTP, like the other components of the web stack, is an organically grown monstrosity that resembles what you would get if a thousand random people shat on a pile. Any attempt to extract philosophical purity and/or rigorous discussion from it is a massive waste of time. Just use what you feel makes sense at the current moment, and move on.
Dead Comment
What about "518 - teapot not configured" ?
When I launch a container that's supposed to sit in front of multiple other containers as a reverse proxy or just serve static files, I need to know whether the Apache process in it is working and actually serving files. This is regardless of the rest of the configuration and whether every site is up: for example, if 19 out of 20 sites are configured correctly and are served, the failing one can be addressed separately later.
In my case, that's as easy as the following:
(there are also separate external uptime checks for the actual sites with domains and HTTPS, too)I serve some HTML files by default in every container with specific contents, in addition to any domains that the web server has configured. If I can access these default files, that means that the web server is up and I can then think about testing the rest of the configuration myself. In this case, I decided to check the file contents instead of the status code.
Frankly, you can get the same result with just status codes (probably a 200 IMO because of how simple it is), or maybe some specific HTML contents in the default page to identify whether you're talking to the correct web server instance and have deployed it in the right place, maybe even page contents that have been generated by something like PHP-FPM to check whether scripts get executed correctly (also like testing OpenResty with Lua) if you need that sort of thing.
Whatever works for you.