In particular, it gets into the heart of the matter: What does the user want to happen when they click the Refresh button?
It does seem worthwhile to try to change the default behavior of the Refresh button to mean "refresh the page" instead of "fix the page" (what it currently does), which would make this "immutable" proposal unnecessary, AFAICT.
IIRC this is exactly what the reload button used to do. You had to hold down what I believe was the Control key while pressing the reload button to do a "force refresh". Now it would seem it's the default behaviour. That, or maybe a normal refresh does the revalidation checks (which return 304), while a Control-refresh does a full download of all resources?
Browsers behave the same (though they've acquired additional nuance since the HTTP 1.0 and pre days) as they always have, generally speaking. IE had a cache control bug for many years that made it impossible to force a reload in some circumstances, but was fixed in IE 6.
The change is on the server side, not the browser. Modern single page applications do all kinds of janky things, and a lot of them break caching, either explicitly (with cache-control headers) or accidentally (with uncacheable URLs).
As far as I know, every major browser has standards compliant cache-control implementations, and all have some way to force a full reload.
Source: I worked on cache-control browser compatibility in Squid many years ago. The browsers took a while, but did get it right eventually.
In at least recent versions of Firefox and Chrome, a reload
includes a `Cache-Control:max-age=0` with the request. During a forced reload (e.g. Shift-Reload), both the legacy `Pragma:no-cache` (HTTP/1.0) and the more modern `Cache-Control:no-cache` (HTTP/1.1) headers are sent.
Judging by the comments here, there seems to be some confusion.
This is exactly like long-lived cache settings today. Right now browsers send a request on basic reloads and get back a 304 from the server which states that nothing has changed. All this setting does is let the server tell the browser to skip that check/roundtrip instead of wasting the time/bandwidth on confirming with a 304 after the initial load.
The browser is still completely in control here and can do a full reload or just reload all the time if it wants to. Web scrapers and other HTTP clients are unaffected.
We use webpack and all filenames are just sha hashes of the contents in the production builds. There is no need for the browser to ever ask anything about that file again (unless its purged from the cache...).
This has been the standard way of serving assets in Rails for the last 5+ years. I don't think it was invented there, as if you are using a CDN it's basically required.
Invalidating edge caches takes time and/or is expensive (i.e. Cloudfront), so adding the content hash in the file name is a good trick to ensure users always get the correct version of the asset.
What happens if a random wordpress blog's frontpage (/) is compromised and has malware injected, setting the immutable keyword? Cloudflare and letsencrypt means most sites will be https sooner than later, so the https part will be "taken care of". (At least that's better than nothing; imagine the power granted to captive wifi portals if not!)
I think it would be bad practice to use this keyword on end-point URL's that are advertised in search engines or API documentation.
You would want to use it for resources to which base pages and manifests point; such as JS, CSS, JPG, PNG, etc.
The browser could enforce that, sort of. It could ignore immutable cache status on the object that is actually in the browser location bar and IMS it, but it could allow it on referenced objects.
The idea is that referenced objects can simply stop being referenced, and a fresh object is referenced.
There's max-age support, the ability to preload pins in the browser, and certificate transparency to work around this, see section 4.5: https://tools.ietf.org/html/rfc7469#page-21
As to this original point, it would be best if this didn't apply to the address bar URL / main document request. But it's a good point, worth considering. Perhaps the UA should set a timer, and two or three refreshes in a row would be the equivalent to the prior refresh behaviour.
Or simply the domain is resold, but old visitors still see a page from a year back. Immutable is useful, but the max-age limit should be limited to a few hours, which is an acceptable timeframe for internet disruptions (e.g. DNS).
With these, you fetch data by its hash. You provide a primary URL where the item is known to exist, but the browser is free to fetch the data from any proxy (or local cache) with the same hash.
This could replace package mirroring, git clones, parts of bittorrent, CDNs and more.
It does assume that you use a hash with enough bits that collisions are extremely unlikely, and also that your hash is cryptographically strong (else a rogue proxy can inject data).
Exactly! IPFS already has a web proxy to their content addressed network. (e.g https://ipfs.io/ipfs/QmXZnH2WVmFoiE7tRJQk9QstLGhSKpVyEQ4Rywx... ) And hopefully browsers will learn to speak the protocol natively so then there's no need for a http proxy at all.
"Immutable" with "max age" is an oxymoron. If it expires it isn't immutable. Use another word.
What you need is an absolute date and time in the cache header which says "we promise this page does not change before this date and time". This could be treated as a "lease" and automatically extended in some configurable intervals. For instance, if it is 30 days, then the file is good for 30 days since its modification time stamp. When that time passes, this is renewed automatically: it is now good for 60 days since its modification time stamp. Basically, it is always good for N*30 days since its modification time stamp, where N is the smallest N required for that time to be in the future.
When the webmaster publishes a new version of the file, he or she knows precisely when browsers who have seen the cached the previous version will start picking up the new one. Changes can be co-ordinated with the expiry time to minimize the refresh lag: the time between when the earliest new client sees the new page, and the last old client stop seeing the old one. If we know that a page expires for everyone on June 1, 2016 at noon, we can update that page in the morning on June 1. By afternoon, everyone sees the new one.
Yes I think it would still be a good idea to require expiry date for "immutable" content, just as a safety net if something is misconfigured somewhere etc - then when you fix a bug, you will know the precise time at which it will be gone for everyone (hopefully the expiry date was not set to 10 years).
However I wonder what is the typical cache lifetime of resources on current web. IIRC someone on HN posted like a week ago that from their study, it's rather short - stuff is evicted from cache quite rapidly if not used. So fast that getting a cache hit for jquery from CDN is quite unlikely very often.
> I've learned to press Enter on the URL line
> instead of reload for exactly this reason.
Yes, this is what I do too.
The browser can do one of three things:
1. Serve the file from cache. This is what happens when you put the cursor in the address bar and hit Enter. Well, at least for ancillary files. It likely still will ask whether the main HTML document has changed. But it will load CSS and JavaScript files from its local cache --- if the webmaster properly set the HTTP headers, like Expires, to tell the browser that it can cache the files.
2. Ask the server whether the file has changed. This is what happens when you click the Reload button. This is the area of dispute. The article is saying it would probably be better if the browser acted just like it does when you put the cursor in the address bar and hit Enter. Instead the browser seems to check not only the main document file but also every single CSS, JavaScript, image, whatever, file. It doesn't redownload them all, but it sends an If-Modified-Since header, to ask whether they have changed, and then requests the whole file only for ones that the server says have changed. The payload back and forth is usually just a few hundred bytes for files that have not changed. But the network requests take a noticeable slice of time, because it's one request per file.
3. Ask the server for the whole file, regardless. This is usually when you hit Shift and Reload.
I apologize if someone already mentioned this, but there's a way to eliminate the penalty to the user without changing HTTP at all. Browsers can simply check all the non-expired resources after all the rest. Now the latency is the same, but we still do the checks, just after everything else and we're already rendering the page. Only if one of those resources did actually change then we re-render the page.
The immutable solution is cleaner, and doesn't load the server as much, but it's not backward compatible and requires the people who run the server to know what they're doing. Maybe the two solutions could be combined?
The biggest potential drawback I see is that maybe most resources, including the html don't expire, so every page will be rendered and re-rendered. Giving little benefit and making the rendering choppier. Some of that maybe could be mitigated by starting the rendering in the background and not displaying it until a certain percentage of requests return, or special casing the "page" itself as opposed to page resources.
What you're suggesting won't work. The problem is that a lot of pages require their resources to be loaded in a specific order. The C in CSS stands for cascading, and means that rules that are loaded later override earlier ones (if selector specificity match). The same goes for JS since later scripts might depend on the framework or libraries loaded in earlier, unless the script has the async attribute. And then there are the content which are loaded by the CSS and JS themselves, which in most modern web apps make up the majority of the content.
Nonsense. You have the css and javascripts. You just aren't certain that they're the most recent version. So you go ahead and render the page using either the version you have, or requesting it from the server, still doing everything in order. Then you validate your assumptions in the background about the stale versions you used still being the most current. If your assumptions are right (and mostly they will be) nothing happens. If they're wrong, you re-render the page, again all in order.
Is there a danger here of getting a corrupt resource, then no matter how many times you mash reload it never gets fixed? What do we have to stop this... I don't think CSS files have a SHA checksum header by default do they?
Correcting possible corruption (e.g. shift reload in Firefox) never uses conditional revalidation and still makes sense to do with immutable objects if you're concerned they are corrupted.
and subresource integrity would insure that the value in the URL matches the hash of the content. Files named by their hash can be treated as immutable with confidence.
How do I do shift reload on my smartphone browser? And how do I even know what shift reload is (most people won't) or that the site is corrupted (it still shows something - how do I know that it isn't the most recent stuff)?
And for my general understanding of this proposal: Even if the current domain owner might guarantee that the content never changes - the domain can switch to another site which might use the pathes but of course wants to put different contents there. Is this somehow covered?
In particular, it gets into the heart of the matter: What does the user want to happen when they click the Refresh button?
It does seem worthwhile to try to change the default behavior of the Refresh button to mean "refresh the page" instead of "fix the page" (what it currently does), which would make this "immutable" proposal unnecessary, AFAICT.
The change is on the server side, not the browser. Modern single page applications do all kinds of janky things, and a lot of them break caching, either explicitly (with cache-control headers) or accidentally (with uncacheable URLs).
As far as I know, every major browser has standards compliant cache-control implementations, and all have some way to force a full reload.
Source: I worked on cache-control browser compatibility in Squid many years ago. The browsers took a while, but did get it right eventually.
This is exactly like long-lived cache settings today. Right now browsers send a request on basic reloads and get back a 304 from the server which states that nothing has changed. All this setting does is let the server tell the browser to skip that check/roundtrip instead of wasting the time/bandwidth on confirming with a 304 after the initial load.
The browser is still completely in control here and can do a full reload or just reload all the time if it wants to. Web scrapers and other HTTP clients are unaffected.
If that is true wouldn't a "strict" cache control make a whole lot more sense?
And yes, having more uniform and standardized approach to caching would be better.
We use webpack and all filenames are just sha hashes of the contents in the production builds. There is no need for the browser to ever ask anything about that file again (unless its purged from the cache...).
Invalidating edge caches takes time and/or is expensive (i.e. Cloudfront), so adding the content hash in the file name is a good trick to ensure users always get the correct version of the asset.
Deleted Comment
You would want to use it for resources to which base pages and manifests point; such as JS, CSS, JPG, PNG, etc.
The browser could enforce that, sort of. It could ignore immutable cache status on the object that is actually in the browser location bar and IMS it, but it could allow it on referenced objects.
The idea is that referenced objects can simply stop being referenced, and a fresh object is referenced.
A possible solution would be to only use the immutable header with integrity html tag.
As to this original point, it would be best if this didn't apply to the address bar URL / main document request. But it's a good point, worth considering. Perhaps the UA should set a timer, and two or three refreshes in a row would be the equivalent to the prior refresh behaviour.
> Immutable in Firefox is only honored on https:// transactions.
https://rwmj.wordpress.com/2013/09/09/half-baked-idea-conten...
With these, you fetch data by its hash. You provide a primary URL where the item is known to exist, but the browser is free to fetch the data from any proxy (or local cache) with the same hash.
This could replace package mirroring, git clones, parts of bittorrent, CDNs and more.
It does assume that you use a hash with enough bits that collisions are extremely unlikely, and also that your hash is cryptographically strong (else a rogue proxy can inject data).
What you need is an absolute date and time in the cache header which says "we promise this page does not change before this date and time". This could be treated as a "lease" and automatically extended in some configurable intervals. For instance, if it is 30 days, then the file is good for 30 days since its modification time stamp. When that time passes, this is renewed automatically: it is now good for 60 days since its modification time stamp. Basically, it is always good for N*30 days since its modification time stamp, where N is the smallest N required for that time to be in the future.
When the webmaster publishes a new version of the file, he or she knows precisely when browsers who have seen the cached the previous version will start picking up the new one. Changes can be co-ordinated with the expiry time to minimize the refresh lag: the time between when the earliest new client sees the new page, and the last old client stop seeing the old one. If we know that a page expires for everyone on June 1, 2016 at noon, we can update that page in the morning on June 1. By afternoon, everyone sees the new one.
However I wonder what is the typical cache lifetime of resources on current web. IIRC someone on HN posted like a week ago that from their study, it's rather short - stuff is evicted from cache quite rapidly if not used. So fast that getting a cache hit for jquery from CDN is quite unlikely very often.
The browser can do one of three things:
1. Serve the file from cache. This is what happens when you put the cursor in the address bar and hit Enter. Well, at least for ancillary files. It likely still will ask whether the main HTML document has changed. But it will load CSS and JavaScript files from its local cache --- if the webmaster properly set the HTTP headers, like Expires, to tell the browser that it can cache the files.
2. Ask the server whether the file has changed. This is what happens when you click the Reload button. This is the area of dispute. The article is saying it would probably be better if the browser acted just like it does when you put the cursor in the address bar and hit Enter. Instead the browser seems to check not only the main document file but also every single CSS, JavaScript, image, whatever, file. It doesn't redownload them all, but it sends an If-Modified-Since header, to ask whether they have changed, and then requests the whole file only for ones that the server says have changed. The payload back and forth is usually just a few hundred bytes for files that have not changed. But the network requests take a noticeable slice of time, because it's one request per file.
3. Ask the server for the whole file, regardless. This is usually when you hit Shift and Reload.
The immutable solution is cleaner, and doesn't load the server as much, but it's not backward compatible and requires the people who run the server to know what they're doing. Maybe the two solutions could be combined?
The biggest potential drawback I see is that maybe most resources, including the html don't expire, so every page will be rendered and re-rendered. Giving little benefit and making the rendering choppier. Some of that maybe could be mitigated by starting the rendering in the background and not displaying it until a certain percentage of requests return, or special casing the "page" itself as opposed to page resources.
Correcting possible corruption (e.g. shift reload in Firefox) never uses conditional revalidation and still makes sense to do with immutable objects if you're concerned they are corrupted.
Also, there is Subresource Integrity, which adds a hash to the including tags, and if it is integrated correctly with the caching logic could catch this: https://developer.mozilla.org/en-US/docs/Web/Security/Subres...
And for my general understanding of this proposal: Even if the current domain owner might guarantee that the content never changes - the domain can switch to another site which might use the pathes but of course wants to put different contents there. Is this somehow covered?