Delayed processing in RabbitMQ with Dead Letter Exchanges
In 2014 we were in the process of migrating from our Rails monolith to a service oriented architecture (or microservices). For Rails we had utilized DelayedJob in order to schedule code to be executed at a later time and to run asynchronously from the current request context. For our microservices architecture we needed something that served the same purpose. For this we evaluated multiple different solutions with RabbitMQ winning out in the end - mainly because the operation seemed simpler than the other options and it covered all our other requirements.
During the evaluation process we noticed very quickly that a message that failed to be processed would immediately be retried. When a message gets rejected, it gets added back to the end of the queue. If the queue is otherwise empty, then the message is consumed again immediately. When the cause for the processing failure still exists, it will fail again... and again... and again...
In the testing environment this created a lot of log and a small self denial of service.
With DelyedJob we could catch the errors and simply schedule the processing to be tried again a few minutes later. We knew that we needed similar behaviour to better manage the processing of messages in our system. We needed to be able to process messages not immediately, but after a small delay.
We came across dead letter exchanges - which I also referenced in a previous blog post about RabbitMQ cluster migration.
Dead letter exchanges can be configured per queue to define what happens when a message is rejected, the message time-to-live (TTL) has expired, or when the queue is full. (For more details check out the RabbitMQ documentation on the subject.)
The name 'dead letter exchange' somewhat suggests that it should be used for messages that should no longer be processed automatically but might need to be inspected by a human.
Delayed Processing
Having something happen not immediately but rather after a certain delay was exactly what we were looking for. The behaviour we had in mind was the following: if a message could not be processed due to an error, try it again after e.g. 10 seconds, 1 minute, or 5 minutes.
This can be solved by using a dead letter exchange:
When the processing of a message failed, we decided to publish the message again into another queue. We call those queues "delay queue", since that is exactly their purpose for us. The delay queue has a message TTL configured on it, so that the messages will not wait in there forever. The delay queue has a dead letter exchange configured through which these messages are then passed back into the original queue, which has the consumer attached to it. This would in effect allow us to delay processing the message again until it expired in the delay queue.
Note that any exchange can be utilized as a dead letter exchange. We tend to just use the default exchange, because we have a specific queue in mind into which it should be passed again.
The dead letter exchange on the delay queue settings would then look like this:
x-dead-letter-exchange: "" // default
exchangex-dead-letter-routing-key: <original queue name>
x-message-ttl: <delay in milliseconds>
Above I also mentioned that rejecting a message can be used to pass the message through a dead letter exchange. We decided against using this because we did not want to have to configure this on the processing queue.
A queue in RabbitMQ must be deleted to change settings and reuse the name. In order to skip adding versioning in our queue names we decided to go for the simplest setup that was unlikely to change for the processing queue.
A concrete use case for this was that we figured out that individual per message TTL does not work as we thought it would: RabbitMQ only checks the first message in a queue. If the first message has a 50 second TTL and the second message has a 10 second TTL, then the second message also needs to wait the full 50 seconds before it would actually leave the queue.
We therefore decided for a setup that was easier to reason about for us: We decided for a queue wide message TTL. In order to accomodate changing the delay time later on, we decided to add the number of seconds to be delayed into the message name.
The naming pattern this resulted in was then <original_queue_name>_dlq_<seconds of delay>. This also makes it easy to understand how long a message will wait without having to check the queue properties.
With this setup, we can easily retry e.g. importing an entry a minute later without much effort. For cases where there is a serious problem that cannot be automatically resolved through a retry it is still recommended to have some sort of logic to eventually abort processing and inform someone to take a look at it.
Differing delay periods
The way to handle delayed processing with a delay queue works well for use cases where the delay is the same. For cases where the delay might differ based on the message it gets a bit more complicated.
As mentioned above the per-message-TTL has some caveats to consider when working with significantly differing delay periods. An example where this might cause some issues could be for webhooks with automatic retries and exponential backoff to automatically increase the delay for each subsequent retry.
When a message that failed more often would end up in the delay queue, it would block other messages from being processed at the time when they should be processed.
One solution for this problem is to have multiple queues which will then pass the message back into the processing queue after a different amount of time per queue.
This would work easily and might be the simplest solution in code. It would also create a lot of queue in the RabbitMQ cluster, depending on the number of differing delays that should be covered here.
We did not want to have all those queues created and instead decided to for an approach that required a bit more programming on our part.
Instead of multiple queues we decided for having the processing queue, a delay queue and a busy waiting queue.
The processing queue as above would be where the consumer is attached that executes the action we want to have executes for this message.
The delay queue would be responsible for delaying the processing again. One important part here is that the delay period is the greatest common divisor of all the different delays you want to be able to handle. For example if you want to handle 1 minute, 2 minutes, 4 minutes, 8 minutes, ..., then the greatest common divisor would be 1 minute. The other important part is that the queue does not actually lead the messages back into the processing queue, but instead into the busy waiting queue.
The busy waiting queue has another consumer attached to it, which has the sole purpose of checking if the message should wait for another waiting period in the delay queue or whether it should be passed back into the processing queue. In order to achieve this we added a property onto the message with the expiration timestamp.
We specifically decided to use a timestamp here, as for a really busy queue it might take more than 1 minute to process each message. We did not want to add additional delay/wait time in the delay queue if the message had already spent the its time waiting in the busy waiting queue.
With this setup, we can easily change the delay as long as it is a multiple of the wait type in the exponential backoff queue. There is nothing we have to adapt in the fixed delay consumer. Additionally the processing overhead is very minimal, as we have all the information in the message and its headers. We therefore do not have to load any additional data.
This approach is running in our system nicely for a few years now and we never had issues with it.