Our characters are finally finding their voices! In this blog, we’ll be sharing how we’re creating custom text-to-speech voices for all of our characters.

Illustration of the Duolingo characters racing through space. They look excited and determined.

Our characters make learning more fun

After our art team illustrated, animated, and added the characters to the app, we saw a lot of love for them on social media. This enthusiasm motivated us to put even more time into our cast and really flesh them out.

We saw an opportunity to make language learning more fun and engaging — and, as a result, build a stronger bond between our learners and Duolingo. We could have learners coming back again and again, to learn and discover more about our characters through engaging storytelling.

Now we’re also adding custom voices for each of our characters. In addition to giving our characters more personality, adding these voices will expose learners to a wider variety of voices -- an advantage for learning. When you use your new language out in the world, you'll be interacting with people from different ages, genders, and backgrounds. That's why hearing a variety of voices in your lessons is so important: it helps you develop flexible listening skills for real-life language situations!

So who are our characters?

To answer this question, we spent many months developing the characters, and discussing their personalities, backstories, and relationships with one another. At the same time, we were also writing Stories featuring them, which helped us further uncover their personalities.

We realized that familiarity with the characters could be a great shortcut for storytelling. In Stories, we have constraints around length and the kinds of words and grammar we can use (especially for beginners). But using our characters, with their strong, distinctive personalities and familiar dynamics with one another, we suddenly had an easier path to stronger storytelling. Instead of having to explain Lily’s motivations in each story, her blithe, unamused avatar gives our learners a nuanced understanding of her motivations, allowing us to tell engaging stories even with only beginner-level language.

Lily and Duo have a low-energy chat while walking

Finding voices for our characters

The next step in bringing our characters to life was to give them their own voices. That’s why we’re building a custom text-to-speech, or TTS, voice for each character that really shows off their unique personalities. We're excited about how their voices can make the language-learning experience more engaging, more effective, and even more fun.

Of course, developing unique voices for nine characters across multiple languages isn’t easy or fast. Casting just the English voices took many months of reviewing auditions and deliberating on which actors captured our characters best. Did this Eddy audition sound too intellectual? Should Oscar have a deeper, more resonant voice? And just how deadpan can Lily be without negatively impacting the learning experience?

After casting and recording the characters in English, we used those performances as a blueprint for Spanish, French, German, and Japanese. Even with the English voices established as a reference, each language presented new challenges for both creative and logistical reasons. For instance, sarcasm sounds different in Japanese than it does in English. Should Lily sound different as well? With our team of language experts, phoneticians, and creative consultants, we worked through each voice to be sure it captured the characters’ personalities in a culturally-suitable way.

Some voices in other languages sound almost identical to our English characters. In other instances, we played up a particular element of a character’s personality. Lin is an especially interesting character across languages. For example, she’s languid and matter-of-fact in Japanese, but perpetually amused in English.

Lin on stage singing at a microphone while playing guitar

Building their voices

After casting and recording the characters with their own personality and style, we used machine learning to build state-of-the-art text-to-speech voices. These can be used to say any sentence in the course — even the ones that haven't been written yet! There's a lot of great technology already available to build and develop voices, but what we at Duolingo need them to do is teach languages, and that's pretty different from how the technology is currently used in other applications.

We carefully designed the sentences for the recordings to cover all the contexts we'd need for our lessons — different combinations of speech sounds, a variety of sentence types, and all sorts of contexts, including exclamations and single words. This range of recordings was necessary to represent all the ways learners encounter the language in their courses. We also worked to push the limits of the technology to get the right delivery — intonation, rate, and pausing — to make the voices as realistic and effective as possible for language learning.

Our new voices aimed to balance the expressiveness of the voice actors with our very specific teaching needs. For the recordings, the voice actors had to invent scenarios to make the lines meaningful — and sometimes that extra acoustic "flavor," like imagining the character being angry, presented a challenge to the technology, which is trained on more neutral speech.

It was also really important to us to match the recording and TTS voices with its eventual goal in a real lesson. For our learners, the TTS voices need to be a reliable model of how to pronounce and use the language. For example, in the English sentence "I read the book," the word "read" will be pronounced differently in a lesson about present tense ("I read the book [every night before bed]") compared to a lesson about past tense ("I read the book [last summer]"). This was also challenging when working to get the rhythm and intonation right in different kinds of sentences. In English, our voices go up and down in very particular, but very different ways, depending on the kind of question we're asking: "Do you want to go?" has a different rhythm compared to "Where do you want to go?" Our TTS voices are only as good as the speech examples we give the system, so our language experts and engineers worked together to give the system hints or correct the speech when necessary.

Who can you hear next?

Learners in our English courses can now hear all the characters' voices in their lessons!

And if you’re studying multiple languages on Duolingo, you’ll get to hear many different interpretations of our cast of characters!