Project

General

Profile

Actions

Feature #3055

closed

Load balancer friendly ETag

Added by stokito over 3 years ago. Updated over 3 years ago.

Status:
Invalid
Priority:
Low
Category:
core
Target version:
-
ASK QUESTIONS IN Forums:
No

Description

Different web servers generates different ETag header and this creates a problem for load balancing between them.
For example if a website was runned on LightHttpd but then switched to Nginx then all files will be re-downloaded because ETag changed.
Lighttpd uses INode-Size-MTime which is then dekhashed https://redmine.lighttpd.net/projects/lighttpd/repository/14/revisions/b700a8ca09b31cfc00ea6a3b6592b233761d5643/entry/src/http_etag.c
Also it may use nanoseconds which are not supported by all platforms (like embedded OS or Java/Tomcat/Jetty) i.e.theoretically etags may be different even between two lighttpd servers.

Nginx is probably most popular balancer and often used as a more light replacement for Apache especially for static files (almost all CDNs uses Nginx).
And it uses ETag in form "hex(MTime)-hex(Size)" and it can't be changed by configuration. The same ETag is used in BusyBox httpd which is extremely small server for embedded devices. Such devices may be even solar/potato powered and it would be nice to have a simple ETag schema (i.e. without hashing). So it makes sense to support at least this kind of ETag even if it's not ideal.

Could you please add a new config option like ETAG_LOAD_BALANCED which will force to generate an Nginx/Busybox ETag?
This option may not discard native lighty etags on comparison to avoid re-downloading.
I can try to create a patch and send to you if you accept this issue.

Actions #2

Updated by gstrauss over 3 years ago

  • Status changed from New to Invalid
  • Priority changed from Normal to Low
  • Target version deleted (1.4.x)

I think that you may be overly focused on the ETag format and might not be taking into account the importance of ETag (intended) uniqueness and opaqueness.

https://tools.ietf.org/html/rfc7232
https://tools.ietf.org/html/rfc7232#section-2.3

An entity-tag consists of an opaque quoted string

You may be misunderstanding the specification (opaque quoted string) with your recommendation that the ETag be well-defined and without hashing, and therefore not opaque.

Some OS and some filesystems have high-precision timestamps, and lighttpd tries to use those when available, in order to make a stronger (more unique) ETag. With entity-caching, the goal is to uniquely identify the resource (without making it onerous and expensive) so that changes can be detected. For correctness, it is better to accidentally re-download something that has not changed than it is to incorrectly believe that a cached copy is still fresh when the resource on the origin server has changed. Some more expensive ETag generation schemes involve MD5 or SHA256 of the content, and are appropriate when even stronger uniqueness is required.

Meta: professional websites of large companies often uniquely name resources such as images, and then add a Cache-Control header to allow caching for a week or more, being more explicit than a simple ETag. If a new version of an image is created, it is given a new name, and the pages (or templates) referring to the image are updated to use the new name for the image.

Given this general information, I do not clearly understand the problem that you think you are solving by "unifying" the ETag format, other than directly contravening the recommendations in the RFC 7232 specification.

In other words, if you need better control over resource caching, you should be employing Cache-Control in addition to ETag. Cache-Control is more powerful than ETag and this probably makes your focus on ETag unnecessary.

.

Important: the ETag is supposed to change for different encodings of a resource, e.g. different languages or different compression (gzip, deflate, brotli). Your suggested "standard" does not take this into account. Similarly, lighttpd generates an ETag for server-side-includes, which combines information from multiple files into an opaque and (attempted) unique token. Your suggested format does not provide for any of these (or other) cases.

new config option like ETAG_LOAD_BALANCED

That is very presumptuous of you to name your personal preference as such as "standard", especially with such a weak and coarse one-second precision and file size.

Since your focus is on static files and does not take into account any of the cases mentioned above, I am sorry, but I do not see a path to accepting any patches related to what you have described.

Instead, I suggest
a) use Cache-Control
b) if you for some reason must control the ETag, you can disable lighttpd generation of the ETag and can create an ETag of any format you like by writing some simple Lua code.
See mod_magnet or AbsoLUAtion with Lua code examples.

Aside: Similar to Busybox httpd, lighttpd is also lightweight and runs quite well on embedded devices, so I do not understand why you mentioned that in your post.

Actions #3

Updated by gstrauss over 3 years ago

The post https://stackoverflow.com/questions/47512043/how-etags-are-generated-and-configured/62996104#62996104 is incorrect about lighttpd.
ETag generation for static files can be disabled in lighttpd with

etag.use-inode = "disable" 
etag.use-mtime = "disable" 
etag.use-size  = "disable" 

Actions #4

Updated by gstrauss over 3 years ago

I have read your post to https://lists.w3.org/Archives/Public/ietf-http-wg/2020JulSep/0041.html and see that you are familiar with RFC 7232.

What you have not communicated to that mailing list or to here is why and how (with specific description of impact) this affects load balancers and is an established problem with sufficient impact that needs a solution. You have not made any good arguments for why there are no better alternatives -- (hint: there are) -- than to rewrite RFC 7232 with a "structured" ETag (your words) instead of an opaque token, and then weakening the ETag to the lowest-common-denominator of easily supported modification time and file size, which are very specific to static files and do not necessarily apply to other resources. ETag can be applied to all HTTP resources, and static files are a subset of those resources.

Actions #5

Updated by stokito over 3 years ago

Thank you for a quick reply.
"opaque" means that client shouldn't rely on its structure. The same as opaque tokens in OAuth means that client just can send them to a server. But tokens in OAuth OpenId Connect are not opaque anymore and may be a structured JWT. So it doesn't mean that we must hide info and hash it: client anyway will get the size and Last-Mod.

Cache-Control is a different thing. ETag is used for the cache re-validation and when time specified in Cache-Control expires client will anyway re-download the resource.

You are absolutely right that ETag should be unique as possible. In the same time any hashcode will decrease uniqueness by design. Nanoseconds also can't be a true source of uniqueness because 1) we may have two changes in the same nanosec 2) It's not guaranteed that last change will have bigger nanosecond (this is not TIMESTAMP).
As far I remember Apache when sees request with Date in the same second as file's mtime just sends a weak ETag.
"Correctness" is something that client should decide. Maybe client is not interested in such level of correctness. Common sense here is that client shouldn't request the same resource twice in the same second. Otherwise this is DDoS, bug or abusing. This may be even technically hard: send request, download resource, parse it, put to local cache, send another request. If client want to be sure that the "static" file didn't changed then yes, it should use digest ETag or just don't include ETag and always fetch the resource.

From the server side we must clearly specify how the ETag is generated and how unique it can be. And ideally client not have to recheck this question after changing webserver.
I don't insist to rewrite RFC 7232 and make ETags not opaque but just add a paragraph with some recommended default schema for new authors.
Having some default schema (even not ideal common denominator) allows to have better predictability, avoid unexpected pitfalls, transfer knowledge, avoid vendor lock and what it more important - avoid a real bugs and problems that we already have. Think about this issue like a bug for all webservers.

Nginx users are fine to have a second precision. If anybody needs something different then, as you said, they may use Lua and other configure options. Some webservers like BB HTTPD and uhttp (openwrt) just can't add configure options for performance and code size purposes.

What I'm asking you is not to change the way how Lighty generates ETag but to add a new flag for interoperability.

Actions #6

Updated by gstrauss over 3 years ago

Your response is so full of opinions and projected statements without evidence that is is impossible to continue this conversation.

Citations or GTFO.

You have no credibility to attempt to speak for so many different entities. Also, I find it hard to believe a large number of your projections which you have have written as statements.

For example:

"Correctness" is something that client should decide.

First, that is an opinion. Second, its presumption is baseless.

RFC 7232 Introduction

   The conditional request mechanisms
   assume that the mapping of requests to a "selected representation" 
   (Section 3 of [RFC7231]) will be consistent over time if the server
   intends to take advantage of conditionals.

An origin server should communicate to the client what is acceptable (or not) for caching if the server intends to support conditional requests. Cache-Control is one way to do this. ETag strength is something that the origin server chooses for the given resource, which could be a file or could be a generated resource. The origin server chooses if and how to construct an ETag for a given resource, not the client. The ETag is opaque. The server can generate it differently for different resources depending on what is appropriate for validating that specific resource.

If a server does not provide ETag, a client might use Last-Modified if available, which has one-second precision. A client might make a HEAD request and check the size. Oh look, those are nearly equivalent to your poor ETag suggestion. If you want such a poor validator, you might consider stripping the ETag from responses.

If the client does not receive cachability information from the server, the client has a large blind spot in choosing whether or not a resource is still fresh, or how and when to revalidate.

Your "research" is poor. You have made numerous incorrect statements about lighttpd, including how to control or disable ETag generation in lighttpd, and even how to spell lighttpd.

Your opinions are worthless and your projected statements are misleading. Provide evidence to back up your claims.

Actions

Also available in: Atom