Subscores: Improving how we report Duolingo English Test results

Duolingo developed the Duolingo English Test as a convenient and accurate way for English learners to certify their English level. There are many reasons why people need to show their English language proficiency, such as applying to university or getting a job.

The goal of the test is to measure the test taker’s English proficiency and estimate how well they can use language to communicate their ideas. Their overall proficiency is reported on a 160-point scale that aligns with the CEFR. Today we are sharing a new feature of the test: subscores!

Why add subscores?

As more universities began accepting the Duolingo English Test as proof of English proficiency for international students, they expressed an interest in subscores to better understand their applicants and have greater confidence in their ability to function in English at their institution.

4-Skills

It’s common to report subscores as proficiency in different components of language such as reading, writing, speaking, and listening. These subscores can provide more nuanced information about a test taker’s language abilities, without requiring them to take another test.

How are we reporting subscores?

Research shows that natural language use is complex and varies by situation. Additionally, English language programs and assessment tasks often follow a design that integrates various components of language^1-4. In practice, language skills are often integrated which means people use multiple skills simultaneously to communicate. For example, understanding a university lecture requires students’ comprehension skills—listening and reading, but participating in study groups requires different skills—listening and speaking.

So in effort to reflect natural language use, we report subscores that represent integrated modalities. These are Literacy, Conversation, Comprehension, and Production.

Integrated-Modality-Subscores

How did we develop the subscores?

In order to create subscores that are useful for all stakeholders, we considered three criteria. These criteria required that the subscores:

Reflect the internal structure of the test. The subscores should be a good reflection of how questions work together to measure different components of language proficiency.
Be reliable. The subscores should be consistent.
Have added value. The information in each subscore should be different, and add measurement value above and beyond the total score.

Internal structure

The Duolingo English Test has seven types of questions drawn from language testing research that measure different aspects of language proficiency⁵. Each of the question types contribute to two subscores.

Literacy: C-test, writing, yes/no (text)
Conversation: Speaking, dictation, elicited speech, yes/no (audio)
Comprehension: C-test, dictation, elicited speech, yes/no (text), yes/no (audio)
Production: Speaking and Writing

To better understand the relationship among the test questions, we use a statistical process called non-metric multidimensional scaling (MDS) that looks at how similar (or different) the questions are from each other based on patterns in our test takers' scores on those questions. This analysis allows us to look at the test questions in a smaller set of dimensions (in this case two dimensions), and it produces a useful figure that illustrates the relationships among the different questions.

mds

Figure 2 Duolingo English Test Questions in Two Dimensions ^[1]

The results of the MDS analysis are displayed in Figure 2. This shows that the questions are working together to measure integrated modalities of language: 1) understanding and producing written language (Literacy), 2) understanding and producing spoken language (Conversation), 3) understanding spoken and written language (Comprehension), and 4) producing spoken and written language (Production).

Reliability

Once we determine the internal structure of the test and use that to create subscores, it is necessary to evaluate their reliability, or consistency. We do this by estimating their internal consistency and test-retest reliability. Internal consistency is an indicator of how well the questions that are used to create the subscore work together to measure the component of language (Literacy, Conversation, Comprehension, and Production). Test-retest reliability represents the relationship between two different test scores for the same people within a small window of time (30 days). The results for both of these measures are satisfactory for all four subscores.

Table 2 Duolingo English Test Subscore Reliability

Subscore	Internal consistency	Test-retest reliability
Literacy	0.89	0.82
Conversation	0.93	0.80
Comprehension	0.95	0.78
Production	0.76	0.83
Note: The closer the value of these are to 1.00 the better. Internal consistency was calculated using split-half reliability methods.

Added value

To determine if a subscore adds measurement value, we compare how much of the subscore’s variation is accounted for by the total score with the amount of variation that is accounted for by the subscore alone (its reliability)⁶. This measure is called proportional reduction in mean squared error (PRMSE). If the subscore value is larger than the total score value, then we conclude that the subscore provides additional, meaningful information beyond the total score. Table 3 shows that all of the values for the subscores are larger than the total score values which means that the subscores have added measurement value beyond the total score.

Table 3 Duolingo English Test Subscore Added Measurement Value

Subscore	Subscore PRMSE	Total score PRMSE
Literacy	0.89	0.78
Conversation	0.93	0.76
Comprehension	0.95	0.89
Production	0.76	0.45
Note: It is desirable for the subscore PRMSE to be larger than the total score PRMSE.

We set three criteria to evaluate the quality of our subscores. We required that they reflect the internal structure of the test, be reliable, and have added measurement value. The MDS analysis showed that the questions work together to measure integrated components of language. The subsequent analyses showed that our subscores are reliable, and they add additional value for score interpretation. These results support this configuration of subscores for the Duolingo English Test.

How can subscores be used?

Institutions that identify the skill profiles that are important to them can use subscores to make more informed decisions about applicants. For example, if an institution values their applicants' abilities to converse in study groups, they can set higher minimum admissions scores for Conversation. For test takers, subscores can signal their strengths and weaknesses allowing better focus in future language studies.

Duolingo English Test subscores are available as of July 7, 2020. Please visit our FAQ page and read our subscore whitepaper for more details!

References

Cumming, A. (2013). Assessing integrated skills. In A. J. Kunnan (Ed.), The companion to language assessment, 1, 216-229. Hoboken, NJ: Wiley-Blackwell.
Hinkel, E. (2010). Integrating the four skills: Current and historical perspectives. In R.B. Kaplan (Ed.), Oxford Handbook in Applied Linguistics, (pp. 110-126). 2nd ed. Oxford University Press.
Hinkel, E. (2006). Current perspectives on teaching the four skills. TESOL Quarterly, 40(1), 109-131.
Widdowson, H. G. (1978). Teaching language as communication. Oxford: Oxford University Press.
LaFlair, G. T., & Settles, B. (2019). Duolingo English Test: Technical manual. Duolingo. Pittsburgh, PA.
Sinharay, S., & Haberman, S. J. (2008). Reporting subscores: A survey. Research Memorandum, (08-18).

The vertical dimension represents Literacy (bottom) and Conversation (top). The horizontal dimension represents Comprehension (left) and Production (right) (n = 47,654; Stress-1 = 0.026). ↩︎