How to solve a problem like locking

The Duolingo app is famous for encouraging users to do their lessons. While Duo the owl’s methods vary, the most effective method for reminding users about their learning goals is via notifications. In order to schedule and send hundreds of millions of notifications each day, we use a system known as comeback.

When a user does things like finishing a lesson, we will calculate when we should send them reminders for the following days and store that schedule of reminders as a row in our database (the DB). At the same time, we’re also scanning the database for notifications that are ready to be dispatched and sending them to users across the world! Interactions with a DB, like updating or reading a row, are known as transactions.

A diagram of a transaction (represented by Duo in a hardhat) adding a new reminder, Do your Duolingo, to a table of reminders. Reminders have various messages and small clocks to represent when they should be sent. Another transaction is checking for reminders that are ready to be sent.
Representation of adding reminders to a database and checking ones to send

A classic problem with DBs, though, is that you need to make sure that if one transaction is updating a row, nobody else tries to modify it at the same time. Otherwise, reminders could get overwritten if transactions interleave each other. (See more in the appendix below.) So, before writing to the DB, we would lock a specific row so that anybody else trying to touch it had to wait.

A database table of scheduled reminders. One transaction has locked a row and is currently sending the row's reminder. Another transaction is waiting for that row to be unlocked so it can access the row.
A transaction has to wait for the previous one to finish before it can access the row

Unfortunately, all of this traffic on the DB led to a lot of locks being waited on. And when the DB had to check whether a row was locked and who was holding it, it took some time. As the DB spent more time checking locks, the last time it had to spend on actually updating the rows, leading to delayed notifications. The extra load from lock contention got so bad that notifications could be delayed by more than 20 minutes, and those minutes can be critical if it’s the last reminder to save your streak!

A representation of Duo the owl surrounded by lock icons in speech bubbles, looking defeated and on the verge of tears
Our system being overwhelmed by lock requests

As our user base grew, we started to see this behavior more, especially during peak notification traffic on Saturday at about 10 AM Eastern. Our notifications team would often have to monitor metrics at these times and be ready to intervene if necessary, a practice lovingly referred to as “watching the Saturday morning cartoons.” But this practice was not sustainable, so we looked to address the locking issue at the source.

A person looking worried with a bead of sweat running down their face seated in front of a computer screen
An engineer watching notifications break

Taking a look under the hood

To identify which transactions were holding the locks, we used SQL queries to profile the DB at different amounts of traffic.

What we found was that updating the schedules could take upwards of ten seconds!

When we updated a schedule, we would first lock the user’s row. We would read the current schedule and add any new reminders based on the user’s current state. Then we’d write back the new schedule and release the lock. And a lot of this time—over half!—was spent making network calls to get a user’s details (like the length of their streak) so that we could schedule the right reminders.

A timeline made up of four steps: A user does a lesson, illustrated by Duo waving a phone. Next, we lock their row and get the current reminders. Next, we get relevant data (illustrated by a picture of a server). Finally, we write the new reminders and unlock the row. The last three steps are highlighted in red, indicating that the lock is held by the transaction during these steps.
How transactions used to work when updating reminders

If we could reduce the amount of time each transaction is holding their locks, then there would be less waiting, so the solution would be to do these time-consuming network calls before getting any locks—and finally, this is where optimism comes into play.

The power of positive thinking

With optimistic locking, we read the current schedule without locking anything! We create the new schedule of reminders, but before we write back, we check if anyone else has updated the row in the meantime. If the row is unchanged, then it’s effectively as if we had locked the row, and so it’s safe to update it with our changes.

But if somebody else has updated the row, then we don’t want to overwrite their work, so we try from the very beginning, reading the now-current schedule and doing it all over again.

A timeline made up of four steps: A user does a lesson, illustrated by Duo waving a phone. Next, we lock their row and get the current reminders. Next, we get relevant data (illustrated by a picture of a server). Finally, we write the new reminders and unlock the row. Only the last step is highlighted in red, indicating that the lock is held by the transaction only during this step.
A transaction using optimistic locking

To enable checking if the row had been edited, we needed a version ID for each row—basically a number saying “this row is on version 5.” And then each transaction would only commit its changes if the version ID was the same as what we saw at the beginning of the transaction.

One caveat of optimistic locking is this “try again” business. If there are a lot of ongoing transactions, then won’t this cause a lot of this work to be wasted when the transaction has to start over? And the answer is: yes! Fortunately, though, we can check how often this happens and implement improvements to prevent eating too much work. Monitoring this failure mode and others is almost as important as implementing the fix itself.

Test, Monitor, Safeguard

While we felt bullish about this plan, it would be foolhardy to jump right in without testing it first. It’s good to be optimistic, but dangerous to be naive! With service as critical as notifications (see the appendix below), we needed to prove this would work before touching any code.

To test our plan, we created a replica of the DB with simulated mock traffic on it so that we could assess the impacts of the changes. For example, we were able to test adding a new column for the version ID and found that it would take 22 milliseconds, which is well within the acceptable downtime. We also tested various approaches for storing the version ID and checking if it had been updated, allowing us to find the most efficient setup. 

We were concerned that a lot of retried transactions could add load to our service, so we added tracking for how a transaction had to be retried because the version ID had been updated before it could write back to the DB. We found this to be a fairly rare occurrence, so we proceeded with the change.

For migrating from the old system to using optimistic locking, we added a feature flag, allowing us to instantly change which part of code is used without having to deploy any changes. This was useful so we could slowly move users over to the new system while checking for errors, but also very handy in case we needed to quickly revert the migration. (Fortunately, that didn’t happen!)

Ending on a High Note

After we switched to optimistic locking, we found that the load on the DB due to locking was reduced to negligible amounts! Without needing to wait on locks, our transactions to update reminders fell from about 16 milliseconds to <1 millisecond!

A graph of DB Load with big pink and red lines representing load from locks, yellow lines for load from commits, and green lines for CPU. On the right hand side of the graph, the pink and red lines go to zero as we add optimistic locking.
With optimistic locking, our DB load from locks (pink/red lines) went to zero!

Good News

  • We reduced the amount of resources used for the database, saving over $100k per year!
  • We stopped having delayed notification incidents, and we expect the current database setup to handle user traffic growth for several years

Other News (not bad, but good to note)

  • Optimistic locking did lead to heightened commit load on the DB since it can’t batch transactions, but overall load is still down and we identified strategies for reducing commit load.

Takeaways 

  • The main lesson of optimistic locking is to try to limit the scope of locks to only cover work that directly involves the DB.
  • Testing every step on a mock DB took extra time, but led to a smooth transition to optimistic locking with only a few seconds of downtime.
  • Proactively determining failure modes and adding monitoring in advance made it easy and fast to determine whether the migration was working correctly.
  • Most importantly, Saturdays can now once again be spent watching actual cartoons!

Like solving tricky problems with clever, inventive ideas? We’re hiring! 

An engineer having a chill Saturday morning!Image of a person smiling while sitting in front of a TV screen displaying two cartoon characters.
An engineer having a chill Saturday morning!


Appendix

Below is an example of interleaved transactions. Say we have transaction A that is supposed to add a practice reminder and transaction B which is supposed to add a streak reminder. Without locks, it’s possible for the transactions to both read an empty schedule, and then the last transaction to write will overwrite the first transaction’s changes!

Two diagrams stacked on top of one another. The top diagram has a timeline where Transaction A reads current schedule (which is empty). Then Transaction B reads current schedule (which is empty). Transaction A writes new schedule (now a single practice reminder). Transaction B writes new schedule (now a single streak reminder). The bottom diagram has a timeline where Transaction A locks & reads current schedule (which is empty). Then Transaction A writes new schedule and unlocks the row (now a single practice reminder). Transaction B locks & reads current schedule (a single practice reminder). Transaction B writes new schedule and unlocks the row (now a practice reminder and a streak reminder).
Example of interleaving transactions when locks aren’t versus are used