Reading Time: 6 minutes

Picture this: You’ve just shipped a shiny new feature in your product, and you’re waiting eagerly for users to engage with it. What’s the magic ingredient that makes sure they don’t forget about it? Yep, notifications! Be it push notifications, emails, or instant messages, these little nudges are key to keeping your customers hooked and informed. But what happens when these notifications fail? (Spoiler: it’s not pretty.) That’s why having a robust notification retry system is an absolute must-have.

Contents

Why Notifications Matter (and Why They Sometimes Don’t Show Up)

Notifications are like your product’s personal assistant—delivering messages swiftly and efficiently. Whether it’s to remind users of something important, lure them back into your app, or just keep them in the loop, notifications do the heavy lifting.

Now, imagine a world where notifications fail to deliver. (I know, terrifying!) Whether it’s due to a network glitch, an overwhelmed server, or just one of those mysterious tech gremlins, failed notifications are a reality we all have to face. And as a product owner, you’re left asking, “Now what?”

Welcome to the Wonderful World of Retrying Notifications

When it comes to handling failed notifications, you’re in luck if you’re using cloud services like AWS, Azure, or GCP. For instance, talking about AWS, it offers fancy tools like Dead Letter Queues (DLQ) in SQS and EventBridge, which help in managing those troublesome notifications. But here’s the catch—these solutions are cloud-specific. So, if you decide to switch clouds or use a different platform, you might find yourself starting from scratch. Ouch!

Enter: a solid, cloud-agnostic system design. This is your golden ticket to ensuring that notifications get retried no matter where your app is hosted. Let’s break it down.

Step 1: Detecting a Failed Notification (The “Uh-Oh” Moment)

Before you can retry a notification, you need to know that it failed in the first place. Luckily, most notification modules give you a heads-up. Typically, they’ll return a response—sometimes an ID you can check on later, and sometimes an immediate thumbs up or down (success or failure). Handling the latter case is a bit straight-forward (and this is what we will discuss to start with; don’t call me a slacker LOL; we would probably discuss the former case in another blog)

Step 2: Crafting the Perfect Retry Mechanism (Because One Size Doesn’t Fit All)

Not all retries are created equal. You might have specific requirements like “retry up to 5 times” or “use exponential backoff.” (No, that’s not a term from a sci-fi movie—it’s a fancy way of saying you wait a little longer between each retry.) Your system should be flexible enough to handle these nuances.

Step 3: Keeping Track of Everything (Because Auditing is King)

Once you start retrying notifications, you’ll want to know what’s going on. An audit system lets you inspect which notifications got sent successfully and which ones didn’t make it. For the failed ones, especially if they’re urgent notifications—like password reset emails or critical account alerts—you might need to debug the issue quickly or even manually reach out to your customers. After all, nobody likes being left in the dark when it comes to important stuff. (On the other hand, if it’s just a promotional message, you might be able to breathe easier and just let it go.)

The Secret Sauce: Putting It All Together

So, how do you design this notification retry system? Here’s a recipe:

Maintain State: You need to keep a record of what’s happening with each notification. This is where a good ol’ database comes into play. Store who the notification was sent to, when it was sent, and whether it succeeded, failed, or is still pending (of course, you might want to add more fields based on your exact use case but I believe the mentioned ones are absolutely critical)
Automate with Cron Jobs: Set up a cron job (or a scheduled task) that regularly checks your database. It should look for records that have failed or are pending and try sending those notifications again. (You can skip the successful ones—they’ve already lived their best lives.)
Handle Definite Statuses: If the notification module gives you a clear success or failure response, life is easy. Your cron job simply updates the status in the database accordingly. For failures, it retries; for successes, it moves on.
Implement Retry Logic: If your retry mechanism involves fancy algorithms (like exponential backoff with a max of 5 retries), store that information in the database too. Include fields like numTry and lastSentAt, so your cron job knows when to retry. After the max retries, if it still fails, give it a “dead” status. (RIP, notification.) Dead records are left alone.
Audit Like a Pro: By using a database, you automatically get a ‘sort-of’ audit trail. But if you want to level up, make it a proper audit system which is append-only (like a ledger), so you never lose track of what happened at every event points. Depending on your database, you might use something like pgaudit for Postgres or built-in audit features in NoSQL databases like MongoDB.

Pros and Cons of the Notification Retry Approach (Because Every System Has Its Quirks)

So, you’ve got your shiny new notification retry system all set up. It’s sleek, it’s smart, but like everything in life, it’s not perfect. Let’s take a quick look at the pros and cons of this approach.

Pros (The “Yes, We’re Winning!” Moments)

Cloud-Agnostic Flexibility: Whether you’re on AWS today, Azure tomorrow, or GCP next week, this approach doesn’t tie you down. You’re not stuck with cloud-specific tools, giving you the freedom to move around without redoing your entire notification system. (Freedom feels good, right?)
Audit-Trail Awesomeness: By maintaining a database, you automatically get a record of every notification that was sent, failed, retried, or, well, died. This makes troubleshooting easier and helps you keep track of what’s going on under the hood. Plus, your auditors will love you.
Custom Retry Logic: Need to retry notifications with exponential backoff? No problem. Want to cap retries at 5 times? Easy peasy. This system is built to handle whatever retry logic you throw at it—no rigid rules here.
No Message Queues Needed: You can skip the complexity of message brokers or queues because the cron job handles the retry logic asynchronously. Less moving parts mean fewer headaches (and fewer things that can break at 2 a.m.).
Scalable: As your product grows and you need to send more notifications, this approach scales. Just beef up your database and cron job processing power! However, there is a limit to this beefing up which we have covered in the ‘Cons’ section.

Cons (The “Oops, Maybe We Should’ve Thought About That” Moments)

Cron Job Load: While cron jobs are great, they can become a burden if you’re handling a high volume of notifications. If your database gets too large, the cron job might struggle to keep up, potentially causing delays. (Nobody likes a lagging system, especially at scale.)
Database Management: Maintaining state in a database means more database maintenance—indexing, pruning old records, optimizing queries. It’s manageable, but it’s an extra task on your plate. And yes, DBAs will have opinions on this.
Retry Logic Complexity: The more complex your retry requirements, the more logic you need to implement. While the system can handle it, you’ll need to ensure you’re testing all the edge cases. Complex logic = potential for bugs. (And bugs are never fun.)
Manual Interventions: When a notification fails and it’s urgent, someone may need to step in manually. While this is better than leaving customers in the dark, it does introduce a manual component to your otherwise automated system. It’s a necessary evil, but still, it’s not ideal.
Latency: Since the cron job runs at set intervals, there could be a slight delay between when a notification fails and when it’s retried. If near-instant retries are crucial for your use case, you might need to tweak the cron job frequency—or consider adding some real-time processing.

In the end, no system is without its trade-offs, but understanding these pros and cons helps you make informed decisions. With this approach, you get flexibility and robustness, but be prepared for a bit of database management and cron job care. It’s all part of the deal, and if done right, your notifications will be flying high, retrying like pros! 🎉

In Conclusion: Retrying Notifications Doesn’t Have to Be Hard (But It Should Be Smart)

Failed notifications are inevitable, but with a robust retry system in place, they don’t have to be a headache. Whether you’re using AWS, Azure, GCP, or something else entirely, designing your system to be flexible, reliable, and audit-friendly will save you countless hours (and a lot of frustration).

So go forth, and let your notifications shine—whether they get there on the first try or the fifth!