ALB is a layer7 load balancer on AWS. It offers more features than ELB. You can find out more about it on https://aws.amazon.com/elasticloadbalancing/applicationloadbalancer/
Recently, I was asked to look into a random 504 issue with ELB. Because there is not a lot to configure on ELB and it’s pretty much a black box, I wasn’t making any headway there. The client believes their code is solid and this is an ELB issue. In fact, others have reported random 504 problems and they believe when ELB reloads, states are lost.
In this post, I want to demonstrate the request tracing feature which will be handy for troubleshooting timeout errors. ALB inserts a header (X-Amzn-Trace-Id) into each request with a unique id. This stamp allows admins or developers to trace the request to web servers.
To enable it, first we need to enable ALB access log. It store logs to s3 bucket periodically. Next, configure apache’s access log to write the trace id. I add that to the combined LogFormat
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"TRACE_%{X-Amzn-Trace-Id}i\"" combined
I put a simple php file on apache, which sleeps 80 seconds before printing phpinfo().
My ALB is configured with an idle timeout of 60 seconds (default). When I open that test page, I get a 504 error after 1 minute as expected:
$ time curl -w %{http_code} -s -o /dev/null http://blah-alb-test-1234567.ap-southeast-1.elb.amazonaws.com/sleep.php 504 real 1m0.147s user 0m0.006s sys 0m0.009s
Now let’s look at the logs. ALB’s log for this request:
"Root=1-58668f05-6aa993624007fa8e279482c0"http 2016-12-30T16:47:06.403348Z app/blah-alb-test/f478d01b63e4d7df 1.2.3.4:49806 10.0.7.229:80 0.001 -1 -1 504 504 131 341 "GET http://blah-alb-test-1234567.ap-southeast-1.elb.amazonaws.com:80/sleep.php HTTP/1.1" "curl/7.47.0" - - arn:aws:elasticloadbalancing:ap-southeast-1:531700196402:targetgroup/blah-alb-targets/5889498795ae58e4 "Root=1-58668f4e-0eb5767b225ce20e2dffacd2"
Lots of fields but what I’m interested in are the response code (504) and the trace id (Root=1-58668f4e-0eb5767b225ce20e2dffacd2). Grepping that id in apache’s access log tells me the page returned normally:
10.0.40.163 - - [30/Dec/2016:16:46:06 +0000] "GET /sleep.php HTTP/1.1" 200 43132 "-" "TRACE_Root=1-58668f4e-0eb5767b225ce20e2dffacd2"
If this is a real-world application, it probably means the web server or the code is too slow. ALB sent timeout to the client before it received a reply from the backend web server. If the response code on apache is not 200, well you know what that means. Fix your code before blaming the load balancers!