If you’ve taken any kind of standardized test (the SAT, the ACT, AP exams, or others), you’ve come across what test developers call items—the variety of questions, prompts, and tasks that allow the test taker to show their proficiency. But these items aren’t developed out of thin air, nor is it ever as simple as just having the question! On a language proficiency test, for example, students must demonstrate their ability to speak, write, read, and listen in the language they’re being tested in, so the test needs written and audio passages for them to answer questions about, sentences with missing words they can complete, and so on.

For years, the only way to generate that content was for teams of expert test developers to research and write it. This involves people ideating on subject matter, finding source material, researching, and of course writing, all of which takes time, and money. And those costs are then passed on to test takers — most high-stakes English proficiency tests cost several hundred dollars. But now, using the latest machine learning technology, our team of test developers are able to automatically generate test content, making room for more innovation in the test development process.

an image of several speech bubbles representing different kinds of test items. One contains a large orange question mark, another contains gray ellipses
At Duolingo, we believe everyone should have access to high quality education, and we’re experts at using the power of technology to break down barriers. Instead of having test developers spend countless hours researching and writing content, we’ve reimagined the item development process, harnessing machine learning technology to automatically generate content for the Duolingo English Test!

How AI makes test creation easier

How is this possible? With GPT-3: an extremely powerful machine learning model that’s been developed by the Open AI research lab to mimic any kind of text it’s given.

Let’s say you give GPT-3 a few textbook articles; it can start generating more textbook articles that mimic the style and tone of those examples. It can do the same thing with recipes. Or poems. Because this model has such a strong understanding of language and the contexts in which it’s used, it doesn’t need a lot of input to generate content. Trained with masses of data, it’s able to produce natural, coherent texts on a wide range of topics, in virtually any genre.

Our test developers have incorporated this technology into what’s known as a ‘human in the loop’ process, allowing them to work in more efficient and creative ways: filtering, editing, and reviewing AI-generated content to produce test items that are indistinguishable from something written by actual humans.

This technology works especially well for standardized testing because we’re not trying to accomplish a rhetorical goal—that is, we’re not trying to persuade them that a certain argument is true, or teach them something new, or make them feel a certain emotion; we’re just testing people’s ability to use the language to demonstrate their comprehension. So while this technology might not be able to develop a Pulitzer-worthy op-ed column, the texts that it is able to generate work perfectly for this application.

For example, to create a fill-in-the-blank item, we’ll first use GPT-3 to generate a passage, then select the sentence that’s the most natural to remove, so that a test taker can still understand the passage without it, and have context clues that can help them figure out what that blank sentence should say. Of course, in order to present test takers with a choice of sentences to fill in the blank, you need the wrong answers, too. By generating a lot of texts about similar topics, we can use those other texts as sources of incorrect answers.

This ‘human in the loop’ process also opens the door for our test developers to experiment with innovative item types that weren’t feasible prior to this kind of technology — like interactive reading, a new item on the test that assess targeted skills related to academic reading tasks.

Once the items are developed, our teams step in to review again for accuracy, fairness, and potential bias in the material—a crucial step in any test-development process, whether items are generated by machine learning, or by humans!

A line drawing of a human brain; the left side looks like a maze while the right side looks more human. Over the brain is a network of orange dots connected with solid and dotted lines.

The future of testing is here

Our approach to automatic item generation ensures that we have a wide variety of interesting content on the test—at a volume necessary to support our computer adaptive test format. And we’re not trying to minimize our need for human collaboration and innovation, rather we’re using the technology to supplement all of the hard work our test creators do! And instead of spending months ideating on and researching the topics they want to write about to generate items, our test developers can choose from a far greater range of content, to produce items far more efficiently!

By streamlining the test development process, we’re able to offer a faster, more innovative test, at a much more affordable price point, making it possible for more people to test their best!

To learn more about DET test design, check out our recent white paper outlining our theoretical assessment ecosystem for digital-first assessment