The other day I was pairing with a colleague on a confusing bug with an nginx deployment. One of our proxy_pass directives was configured to point to the hostname of an AWS Application Load Balancer (ALB). After behaving normally in production for a few hours the nginx server started throwing 502s for all requests to a certain location block. It is documented that nginx will not re-resolve hostnames for proxy_pass entries that are set to static string values, so the first thing we confirmed was that the proxy_pass directive in question was configured with a variable value. The relevant parts of our nginx config looked something like this:

http {
  include /etc/nginx/logging.conf;

  server {
    listen 8000;
    resolver local=on; # openresty: this makes nginx use /etc/resolv.conf

    set $api_host      https://api.server.com;
    set $web_host      https://web.other-server.com;

    include /etc/nginx/conf.d/api-gateway.conf;
    include /etc/nginx/conf.d/errors.conf;
    include /etc/nginx/conf.d/healthcheck.conf;
    include /etc/nginx/conf.d/something.conf;
    include /etc/nginx/conf.d/something_else.conf;
    include /etc/nginx/conf.d/another_thing.conf;

    location /api {
      include /etc/nginx/conf.d/headers.conf;
      proxy_pass       $api_host;
      proxy_buffering  off;
      proxy_set_header Host $http_host;
    }
  }
}

After scratching our heads for a bit, we decided to reproduce the problem with a simple test setup. We created a new A record in development pointed to a known IP and gave it a 1 second TTL. We set the nginx server to proxy to this subdomain and booted it up. With nginx running, we updated the DNS record. As expected based on the earlier incident, nginx failed to pick up the change and remained stuck on the old IP.

We then commented out most of the config and ended up with something like this:

http {
  # include /etc/nginx/logging.conf;

  server {
    listen 8000;
    resolver local=on; # openresty: this makes nginx use /etc/resolv.conf

    set $api_host      https://api.server.com;
    # set $web_host      https://web.other-server.com;

    # include /etc/nginx/conf.d/api-gateway.conf;
    # include /etc/nginx/conf.d/errors.conf;
    # include /etc/nginx/conf.d/healthcheck.conf;
    # include /etc/nginx/conf.d/something.conf;
    # include /etc/nginx/conf.d/something_else.conf;
    # include /etc/nginx/conf.d/another_thing.conf;

    location /api {
      include /etc/nginx/conf.d/headers.conf;
      proxy_pass       $api_host;
      proxy_buffering  off;
      proxy_set_header Host $http_host;
    }
  }
}

To our surprise, running the test scenario again was successful – nginx correctly followed the DNS record change after a few seconds. We proceeded to binary search our way to the offending configuration by gradually reintroducing the includes and landed on the following location block as the one that caused the issue.

location /some-other-path {
  proxy_pass       https://api.server.com/other-path;
  proxy_buffering  off;
}

Although we found a new problematic use of proxy_pass (it was set to a static string and not a variable), we were surprised to learn that this directive, which was pointed to the same subdomain as above, appears to have “tainted” the other one. We confirmed that setting this second proxy_pass directive to a variable value fixes the DNS issue with both directives.

Our best hypothesis on why this happens is that the first time nginx encounters a domain in any proxy_pass directive, it determines how to resolve the hostname. If provided as a string, that hostname will be resolved only once and never again, regardless of whether or not it shows up in a variable later on. This, however, is just a guess.

Whatever the case, I guess this is further evidence that setting proxy_pass to a static string is pretty much never a good idea.