Let’s examine the retry pattern. Use the retry pattern for handling transient failures to a service or network resource. Requests temporarily fail for many reasons. Examples include the network having a faulty connection, a site is reloading after a deployment, or data hasn’t propagated to all instances.
The retry pattern encapsulates more than the idea of just retrying but different ways to categorize issues and make choices around retrying. If the problem is a rare error, retry the request immediately. If the problem is a more common error, retry the request after some amount of a delay. If the issue indicates that the failure isn’t transient, for example, invalid credentials that are unlikely to succeed on subsequent requests, cancel the request. Regardless of the type of the issue, some tracking should be done to ensure that endless retries aren’t occurring. Implementing a circuit-breaker in this situation can be helpful to limit the impact of a retry storm on a failed or recovering service.
When operating a service, it’s important to track metrics around retries. Metrics can tell you a lot about your customers’ experiences with your service. It might be a case that a service needs to be scaled out or up. Additionally, even if the problem is on the client’s end, if they are repeatedly experiencing degraded service, the problem is going to reflect on your service.
I managed a service that used Apache Traffic Server at one point. Apache Traffic server did not handle Expect: request headers as specified by the HTTP/1.1 spec. 1KB and larger requests from applications built with the cURL library had 1-second delays unless customers turned off the Expect: header explicitly. Compounding the issue, retries due to not receiving a response within the expected 50ms response time led to retry storms. When I saw this pattern in my logs, I wrote up documentation and set up tools that would catch this issue so that I could reach out to customers proactively.
When using a third-party service or external tool, make sure that you’re not layering retry logic. Layering retries at different levels or cascading retries can lead to increased latency where failing fast would have been the preferred operation.
Many SDKs include retry configurations. Generally, there will be a maximum number of retries and a delay factor that may be an exponential, incremental, or randomized interval of the amount of time to delay subsequent retries.
Separate the transient issues in logging so that they are warnings rather than errors. While they should be monitored, paging out on something that self-resolves is a recipe for the increased loss of attention, especially when it pages folks on off-hours. There is nothing so frustrating as losing sleep over something that fixed itself.