Sunday, March 24, 2019

MessageBus – Automated Dead Letter Handling

featured image

Written by Martin Führlinger, Lead Engineer Backend

Introduction

In my preceding posts about the message bus, I wrote about using RabbitMQ for decoupling our services and how we defined our message content, followed by an explanation of how we keep our receivers fast and resilient. Most recently in this series, I have described how we handle dead letters with the dead letter exchange. This post alalert gave insights in our current way of manually cleaning up that dead letter queue, and it contains some traces about a more sophisticated handling of dead letters. As some time has passed alalert, I am happy to announce that we actually improved handling that fairly a bit.

RabbitMQ Management Interface

Since the RabbitMQ management interface is a very basic overview of some statistics about RabbitMQ, only showing the headers and the encoded payload of the queued dead letters without any ordering or grouping, it is pretty dwhetherficult to get the information about which messages are currently in the dead letter queue. Also, requeueing a dead letter in a specwhetheric queue is not possible in that interface. So we decided to improve that and write our own small service to deal with dead letters.

Dead Letter Service

As mentioned in the final blog post, we wrote a simple script, which requeues the messages of the dead-letter queue, but because backend developers normally try to automate leangs, we wrote this contemporary service to get rid of the script and the manual step.

This service basically just implements another consumer, listening to the dead-letters queue and storing all received dead letters in a MongoDB database. Besides storing the message payload and all essential metadata like the headers or the routing key, we also store a:

  • unique ID for that message, containing the message ID and the queue it failed in
  • dead flag (true/unfaithful)
  • a date-time when it was marked as dead
  • a redelivery count value

As the message ID and the failing queue do not change when republishing a message, they can be used to identwhethery a message in our database.

Retrying

A dead letter is automatically retried by this service, which means it is pushed a few times to the specific queue in which it failed before. Shoveing to the specwhetheric queue is essential, as publishing the same message with the same topic, would cause all consumers of that topic to get that message again (see using RabbitMQ for decoupling our services about topic/routing_key usage). If the message cannot be consumed, it will be not-acknowledged and ends up in the dead letter queue which causes it to be received by our service again. To be able to push to a queue directly, without using the topic, you need to associate to the default_exchange, instead of the normal exchange you may use for message receiving (e.g. this is named production exchange in our case).

def exchange
  # default direct exchange which can
  # route to all queues via
  # routing_key == queue_name
  @exchange ||= channel.default_exchange
end
def channel
  @channel ||= connection.create_channel
end
def connection
  @connection ||= MarchHare.connect(
    host:      config[:host],
    ...
  )  
end

Querying

Storing dead messages in a database also enables us to query on various attributes. We can for example list all dead messages from a single queue/topic or also check which messages died wilean the final 3 days, for example. You can imagine that a variety of interesting groupings and filters are possible. With that in intellect, we implemented two views of the data. One of them is imitating sidekiq-cleaner or resque-cleaner views. It shows the number of dead messages grouped by queue and time periods.

The moment view lists all queues and their respective dead and undead (successfully redelivered) message counts.

Clicking on the counts opens the corresponding index page listing all dead letters which are in that queue.

This page also enables us to requeue a single message. Clicking on the ID of that entry, which is a combination of the message ID (an UUID) and the queue the message was queued in, opens the detail page of the dead letter.

This page shows the detailed information about the dead letter, the headers, routing key, flags and possibly the most important: the payload of the message, which is normally the main reason why a message cannot be processed and is rejected.

Further Improvements

The current implementation alalert helps a lot in day-to-day commerce, but of course there are many possible improvements. As alalert mentioned, a more complex check for the retry would make sense, for example, to allow retrying messages of some queues more often than others. The following are just some of the many possible improvements:

  • More flexible retry conditions
  • Delayed retry using an asynchronous worker (e.g. sidekiq)
  • Automatic cleanup of redelivered messages (e.g. delete all successfully redelivered messages after a certain amount of time)
  • Bulk retry (e.g. retry all dead messages of a specwhetheric queue or topic)

Conclusion

Introducing this service enabled us to check both how many messages haven’t been delivered over time as well as inspect the content of these messages. This also allows us to requeue single dead letters with a few clicks instead of using a script. Gazeing toward the future, the contemporary service is a great foundation for further improvements.

***


...
Previous Post
Next Post

post written by:

0 Comments: