Five years ago, we launched the first high-stakes, digital-first test that could be taken anytime and anywhere in the world, because we saw how technology could radically improve testing for students and institutions. In this five part series, we take a look back at the enormous amount of research and development that went into reinventing the world of high-stakes testing.
High-stakes tests can be stressful — even demoralizing. At Duolingo, we believe that the content of a test should be challenging, but not the experience of taking it. That’s why we designed the Duolingo English Test to quickly adapt to a test taker’s learning level, withholding content that is likely too difficult — or too easy.
Read on to learn how our team of experts in assessment science and natural language processing collaborated with our machine learning engineers to design a rigorous test that can determine English proficiency far more efficiently (read: in a lot less time) than a traditional fixed-form exam.
Fixed-forms aren't flexible
Language proficiency is on a spectrum from basic to advanced. The goal of a language proficiency exam is to assess where test takers fall on that spectrum by seeing how they respond to questions and tasks — known by researchers as “items” — that range from simple to more challenging.
One of the reasons that traditional language proficiency tests take so long to complete (about 3 hours on average) is that they are fixed-form: items are pre-determined before the exam begins, with no knowledge of the test taker’s language ability. By evaluating a test taker’s responses to items across all proficiency levels, the test places them on the scale. In order for a fixed-form test to do this, it must include items across all of these levels.
In this fixed-form model, every test taker ends up responding to test items that are not “informative” — that is, their response to the item doesn’t help much to estimate their true proficiency. Having an advanced English speaker correctly responding to many simple questions isn’t very helpful in precisely measuring their ability; for a complete beginner, guessing the answer to advanced questions is a waste of time.
We knew there was a better way to go about this, so our team of experts in assessment design, natural language processing, and machine learning went to work generating a computer adaptive test that takes less than an hour to complete — an industry first for English language assessment.
CATs are nimble, CATs are quick
A Computer Adaptive Test (CAT) is just that — adaptive. As an individual moves through the test, the items they’re given are determined by their responses to the previous items. That means they won’t waste time on items that are far above or below their proficiency level. Compared to fixed-form exams, CATs like the Duolingo English Test take less time to complete, because far fewer items are needed to accurately place test takers on the scale of proficiency. So if a test taker gets a reading task right, they’ll move on to a more challenging task, perhaps speaking. If they don’t do so well on an item, the next item they’ll be presented with will be somewhat easier.
So how does a CAT actually work? It all starts with the item bank — the collection of items that the algorithm pulls from to generate each individual test. Οur natural language processing experts and machine learning engineers leverage human-in-the-loop AI to generate a bank of tens of thousands of items. How does it work?
First, we collected English words and texts for which subject matter experts had already assigned a value on the Common European Framework of Reference for Languages (CEFR): an international standard that describes language ability on a six-level scale, from A1 (basic) to C2 (advanced).
Then, we analyzed these words and texts to determine what language features made them more likely to fall into different CEFR levels. For example, the simpler language in the sentence “I took a test” makes it A1, whereas “I elected to take a computer adaptive test” uses more advanced B and C level language. Based on this analysis, our machine learning models were able to generate new test items spanning the CEFR range, to measure test takers at all proficiency levels.
As a test taker moves through the exam, items are selected at the level that is the algorithm’s best guess of their ability level, based on their responses to previous items. The more items a test taker responds to, the closer and closer the algorithm gets to estimating their true proficiency. Because these items are selected in real time from our bank of tens of thousands, no two tests are identical, meaning it’s impossible to get answers to the test in advance. (Curious how our experts use AI to supervise and score the test? Check out our previous post in this series!)
A healthy distribution
Being adaptive means that compared to a fixed-form exam, the Duolingo English Test requires far less time to determine a test taker’s proficiency level — most people complete the exam in under an hour. Because of this, and because they encounter few items far above or below their proficiency level, test takers might find that the test feels less stressful, and perhaps easier, than a longer, fixed-form exam.
So how can we be sure the test is rigorous? Assessment scientists use a host of tools and methods to ensure that the difficulty of an exam is at an appropriate level. But one way to understand this is by looking at score distribution.
"If the test wasn’t hard enough, we’d see everyone scoring very highly,” says Duolingo English Test Chief of Assessment, Alina von Davier. “But that’s not what we’re seeing. In fact, we have a very good range across the scale. As expected in any good assessment, few test takers get the highest scores.”
Test taker first
The bottom line is that how challenging a test feels is about more than just the objective difficulty of the items a test taker responds to across the CEFR scale; the test taker experience is also a huge factor.
“We thought a lot about the test taker experience when designing the Duolingo English Test.” says Director of Duolingo Research, Burr Settles. “We pride ourselves on being student-centered and know there’s a lot riding on the outcome of this high-stakes exam, so we wanted to make it as stress-free as possible.”
Thanks to its computer adaptive delivery, whenever and wherever they choose to take the exam, Duolingo English Test takers can focus on showcasing what really matters: their English proficiency. Want to learn more about the design of the Duolingo English Test? Check out our test taker guide, or read the technical manual if you want a deeper dive!