At Duolingo, we value speed on executing new ideas and iteration on existing ones. However, new features and myriad other changes introduce the potential for regressions. As such, the QA team spends a substantial portion of its bandwidth monitoring and testing our weekly releases to ensure a polished final product for our learners. 

We have long desired an automated solution for regression testing, which would allow the team to focus more on higher ROI tasks like supporting bug fixes and testing new features. We needed a tool that could be easily maintained, ideally by anyone on the team, regardless of their depth of coding knowledge, and one that could be robust enough so as not to crumble under the many changes introduced every week. 

In 2024, we were excited to partner with MobileBoost on GPT Driver to explore whether such a solution had finally arrived. 

New frontiers in AI-based testing

Enter automation tools based on AI. GPT Driver is a toolset that can take natural language instructions and act on them on a virtual device. 

Example of test directions listed and written in a natural language style, “Tap on the profile tab icon located at the bottom of the screen.”

The benefits of this approach were immediately apparent from an accessibility perspective. With little training, we could put the tool in front of any team member, regardless of coding knowledge, have them type out a description of the test case they sought to fulfill, hit run, and watch the test play out in a browser in real-time. Within a few hours, we were able to generate tests for key scenarios like session progression, onboarding, and social features.

Learning to effectively test using GPT-based prompts

We needed to overcome two major problems for such a solution to work: 1) the highly iterative nature of Duolingo, and 2) that the Duolingo experience presents so many different variables and variants regarding our features and A/B testing methodology, knowing what screen will be next for a user is often difficult to say with certainty. Such ambiguity is often easy enough for a manual tester to accept but quite challenging for automation. After weeks of playing wack-a-mole and watching our tests quickly balloon into large lists of eventualities, we knew there had to be another way. 

In collaboration with MobileBoost, we learned that our tests could be more robust and reliable if we allowed GPT to be more creative in its solutions to our goals. For instance, rather than telling GPT Driver to tap on a particular button on a screen and then tap a different button on the next screen and so on (buttons that might change over time or scenario), if we instead wrote tests to achieve a broader goal like “Progress through the screens until you see XYZ”, GPT Driver would interpret each screen as presented with its end goal in mind and continue to progress until it could no longer interpret what to do with a screen or had achieved the aforementioned success criteria—this natural language reframing of test cases allowed for more reliable runs even in the face of uncertainty. 

A test direction written broadly to accomplish completing a lesson, “Complete the quiz until you are on the `Lesson complete!` screen.”

This approach does have drawbacks. Though more reliable, it can also introduce the potential for missing issues that GPT Driver was simply able to work around. Many bugs do not strictly block progress, so we need to know what occurred. Thankfully, GPT Driver records live test runs and stores them for easy review. Reviewing these runs has become core to our regression testing workflow. However, rather than manually running through these workflows, a process that previously took several hours for numerous QA Team members every week, our team can now quickly scrub through recordings to ensure workflows are completed as expected, a process of minutes. This process has reduced manual regression testing workflows by as much as 70%.

A list of run tests indicating which have passed and which have failed with links to view recordings

Challenges of a GPT-based approach

On a similar note, as anyone who has worked with GPT or similar LLMs knows, providing directions that get you exactly what you need can sometimes be challenging. Misinterpretation can happen, and determining what directions are key and which should remain open-ended is a learning process. 

Additionally and frustratingly, interpretation from GPT can, at times, “go rogue” in ways that are difficult to troubleshoot or understand. MobileBoost has been a good partner in attempting to curb some of GPT's more bizarre behaviors, introducing checks in their software layer or avoiding the GPT layer altogether where a clear next step can be directly executed. 

We also face challenges with particularly difficult content. GPT can struggle with our app's most complex challenge types due to long and nuanced translations or performing timed and complex interactions. Still, as the challenges LLMs can address expand and interpretation speed and accuracy improve, we expect these to become more possible.

Conclusion

We’re very pleased with how accessible GPT Driver has been for the QA Team and we’re excited at the productivity gains we’ve seen as a result. GPT Driver has sped up regression testing we feared might never be automatable, allowing the team to focus more on higher ROI tasks like supporting bug fixes and testing new features. We’re excited to explore what other workflows it may benefit.

If you’re looking to join an innovative team where your engineering skills help revolutionize education through advanced AI solutions, we’re hiring! 

Acknowledgements 
A special thanks to all who have contributed to the MobileBoost project: Sharanya Viswanath, Aaron Wang, Sam Potter, Pat Dalsass, Josh Abrams, Shelby Wengreen, Clark Munson, and Eva Raimondi.