How we built a robust ecosystem for dataset development

At Duolingo, we are big believers in making data-driven decisions. It’s not just people with statistics backgrounds who work with data here. On any given day, a Data Scientist, Product Manager, Learning Scientist, Marketing Analyst, or Software Engineer might be performing complex analyses to understand how to improve our learners’ app experience.

We collect various pieces of data about how learners interact with the app (e.g., when a learner completes an exercise, or when a learner accepts a Friend Streak invite), and with tens of millions of learners using the app every day, we have large amounts of data on our hands! At Duolingo, the Data Refinery team builds tooling and infrastructure for data modeling, helping take raw data and shaping it to a cleaner structure that makes it easier to gather meaningful metrics and insights. In addition to Data Scientists, we staffed this team with Software Engineers who were brand new to the data analytics space. This naturally led us to ask “What lessons from developing backend services can we apply to developing datasets?” It turns out, quite a lot!

Modeling data isn’t so different from designing an API – you set up code to take in some input data, perform some validation and computation, and produce output data in a form that’s easy to work with.

A diagram representing the data modeling process, with an example. The diagram shows that two tables of "Raw Data" can be combined through the "Data Modeling Layer" to form a single, cleaned table that is "Modeled Data."

This understanding led us to build out a data modeling developer experience that mirrors the process of building an engineering system.

Conventions should be automatically enforceable

Code linting comes in handy for setting and enforcing standards, especially when multiple people contribute to a codebase. In the world of dataset development, linting can be used to enforce not only SQL code formatting, but also consistent and intuitive outputs. Without some conventions, different dataset developers might structure similar data in dramatically different ways (e.g., one table might store user IDs in a `user_id` column, while another might store the same IDs in a `user_logged_in_id` column). More importantly, data consumers might struggle to navigate a disorganized data ecosystem like this.

In order to avoid conflicting patterns, we incorporate these linting checks into our Continuous Integration process. For example, our linters flag situations such as the following: a code change introduces a column named `event_timestamp` (with the name suggesting that the data type would be a `TIMESTAMP`), but the column is actually stored as a `DATE` type. We don’t make our dataset developers guess what these standards are – we just surface them in pull requests when relevant.

An automated GitHub PR comment that points out that a table name does not mean the naming and typing conventions.

Verifying changes should be straightforward and encouraged

Making changes to complex systems can be daunting – when writing microservices, we utilize unit, integration, and end-to-end tests, but it can be hard to replicate the nature of real, raw input data. In the past, Duos (our employees) developing data models would have to manually run the new version of their code and store it in a “dev” table in order to compare it to production. The process always had the same steps:

Create a dev version of the table based on the new code
Perform some queries comparing this dev table to the production table
Summarize the results of these comparison queries

Now, we have tooling for anyone to run “data diffs” automatically in their pull requests. Think of this not as a CI step but as a verification step: in some cases, a big effect makes sense (e.g. changing the logic behind a computation), and sometimes, you expect no change (e.g., refactoring).

An automated GitHub PR comment that starts a "diff task" and then shows the results of the diff.

Making big changes shouldn’t need downtime

Once a code change is approved and verified, the next big hurdle is deployment – in particular, getting the change out while minimizing downtime. Blue-green deployment is a popular method for deploying code changes, and we can apply similar principles to modeled data.

“Downtime” looks a little different when it comes to data. Instead of system errors, the concerns tend to be around incomplete or misleading data. Imagine analyzing this data and launching into an investigation of the spike, only to then learn that you happened to query the metric while the data was in the middle of an update.

A line chart that is mostly level, with 4 data points in the middle spiking by a large amount.

To avoid this situation, we instead follow this process:

All modeled datasets are backfilled to a separate table from the production table
Once backfill is complete, we “clone” the backfilled table into the production table, replacing the data all at once
Once we’ve replaced the data, we deploy the code change automatically

This means no more worrying about inconsistent data!

Observability should be a key part of the process – but don’t overcomplicate it!

A system is incomplete without some form of observability, both for the people maintaining the system and for consumers. Monitoring allows dataset developers to automatically learn when something is broken (ideally before consumers report an issue!), but “broken” can mean many things when it comes to data. It’s tempting to fixate on complex approaches to data monitoring, such as statistical tests or anomaly detection.

Our monitoring solution is extremely straightforward and focuses on the situations that are the most important to us. We need to know 1) when things fail and 2) when datasets fall behind due to processing delays. By keeping the process simple, we also provide alerts that are accessible for our data consumers.

A series of automated Slack messages, including messages that a job has failed, that a table is out of date, and that the table has then caught up.

Conclusion

Our modeled dataset ecosystem is now a core part of how we work with data at Duolingo. These datasets are used regularly in hundreds of A/B test analyses and dashboards, with over 10,000 queries per week from people across the company.

Our dataset development tooling allows anyone with some SQL experience to productionize a modeled dataset within a day, and Data Scientists, Marketing Analysts, Learning Scientists, and Software Engineers frequently create and modify modeled datasets. We didn’t do all of this by reinventing the wheel – we stuck to what we know works in engineering and stood up a robust dataset development process!

If you want to use your engineering skills to streamline intuitive data structures that power millions of learners and drive educational improvements, we’re hiring!

SEE OUR OPEN ROLES HERE

How we built a robust ecosystem for dataset development

Conventions should be automatically enforceable

Verifying changes should be straightforward and encouraged

Making big changes shouldn’t need downtime

Observability should be a key part of the process – but don’t overcomplicate it!

Conclusion

About us

Help and support

Privacy and terms

About us

Press

Careers

Help and support

Privacy and terms

How we built a robust ecosystem for dataset development

Conventions should be automatically enforceable

Verifying changes should be straightforward and encouraged

Making big changes shouldn’t need downtime

Observability should be a key part of the process – but don’t overcomplicate it!

Conclusion

RELATED ARTICLES

RELATED ARTICLES