Five years ago, we launched the first high-stakes, digital-first test that could be taken anytime and anywhere in the world, because we saw how technology could radically improve testing for students and institutions. In this five-part series, we take a look back at the enormous amount of research and development that went into reinventing the world of high-stakes testing.

The Duolingo English Test is a new breed of assessment, a digital-first high-stakes exam that is available on demand, 24/7 anywhere in the world. But this continuous, global test administration poses enormous quality assurance challenges, as every test must be monitored to ensure score validity. Learn how we combine human expertise and artificial intelligence to ensure the quality of a new generation of assessments.

The contest and the measurement

High-stakes exams impact people’s lives, so it’s crucial they meet what assessment scientists call “the contest and the measurement” standards of a test — that is, they must give everyone a fair opportunity to prove their ability in a certain area, and measure that ability accurately (Holland, 1994).

One way to ensure this is to monitor test scores over many administrations of the test;if scores are comparable, this helps test developers to determine that the test is valid. To help test developers identify and prevent possible errors that might jeopardize test score validity, the International Test Commission has laid out quality assurance guidelines: step-by-step procedures for regulating and monitoring high-stakes assessments.

The challenge is, the Duolingo English Test isn’t your average test. Unlike traditional brick-and-mortar operations, where a single form of the exam is administered to people gathered together on a set day and time, the Duolingo English Test is administered continuously, on demand, anytime and anywhere in the world. The old guidelines simply weren’t designed to accommodate the complexity of this new system.

“We can’t just conduct quality assurance for a few days during and after a test and be done,” says Mancy Liao, a scientist on the Assessment Research team. “We need to monitor everything, everywhere, all the time—it’s a huge undertaking.”

AQuAA: A fluid system

Here, tools that facilitate continuous pattern monitoring and swift communication are paramount. Enter AQuAA: Analytics for Quality Assurance in Assessment, an interactive dashboard that blends educational data mining techniques and psychometric theory.

AQuAA: a research-based automatic monitoring system

The AQuAA dashboard allows our psychometricians to evaluate the interaction between the test items, the adaptive test, and the samples of test takers, to ensure scores are consistent over many test administrations. First, a wide range of data mining psychometrics and visualization techniques are used to gather assessment data and describe trends and seasonal patterns of test scores, in order to detect atypical changes. The dashboard then communicates this information to our team of assessment scientists and psychometricians, helping them to discover issues and respond in a timely manner.

Humans in the Loop

The Duolingo English Test is the first of its kind; creating a unique quality assurance system to match required a tremendous amount of ingenuity. From determining which statistics should be used as indicators of score validity, to identifying patterns and irregularities relevant to the test’s quality, our experts had to research, test, and deliberate every step.

Subject Matter Expert (SME) review process for maintaining AQuaAA

Our assessment scientists designed AQuAA to accommodate the features of the Duolingo English Test that make it maximally accessible. Because many key aspects of this digital-first exam are accomplished automatically, quality assurance requires far more extensive data mining techniques than traditional tests. AQuAA leverages computational psychometrics to mine and model the educational data needed to run the system, while also allowing our experts to conduct pattern monitoring and reviews continuously, to keep up with our round-the-clock administration.

Designed to adapt

Beyond facilitating quality assurance, AQuAA helps our test developers continually improve the exam. Insights drawn from AQuAA are used to direct the maintenance and improvement of other aspects of the assessment, such as item development. The system was built to be flexible, so that it can adapt and evolve along with the Duolingo English Test.

What’s more, the methodologies for designing AQuAA can be adapted to ensure the quality of other digital-first assessments. Our research team outlines this system in a forthcoming paper; they hope it will help other assessment scientists advance the field of digital assessments, breaking down barriers and increasing access around the world.