Voiced by Darth Vader

Opportunities and limitations of speech synthesis

Digital media is changing the way knowledge is transferred from pure text to images and spoken language. Because in many areas, spoken language is better suited for communicating content than pure text.

Sounds good. But does it deliver?

Especially for videos and e-learning, spoken language in the form of dubbing (lip-sync if necessary) or voice-over (off-screen) is often the better choice. Up to now, dubbing and voice-over have been produced elaborately in a sound studio with voice actors, sound engineers and sometimes even directors.

Speech synthesis is an alternative here, i.e. the software-based generation of a synthesized and, at the same time, human-sounding speaking voice. This technology (also known as text-to-speech, TTS) has evolved greatly in recent years – not least due to the increasing availability of deep learning, or artificial intelligence (AI).

The evolution of speech synthesis:

Approach Features
Phoneme-based speech synthesis Often mechanical or artificial sound, less voice selection
Speech synthesis based on AI (machine learning) Largely natural sounding voices, few adjustment options
Speech synthesis based on AI with SSML capability SSML markup language allows much better fine-tuning of many aspects

While narrators are still the gold standard for high-quality marketing videos, speech synthesis – now also based on trained AI speech models – offers a serious alternative for a number of areas.

Less expensive. And what else?

One advantage of speech synthesis is reduced costs and production times. This makes the use of spoken language possible even in places where it was previously inconceivable. Subtitles no longer have to be used as a stopgap for dubbing. What’s more, speech synthesis makes things possible that are not feasible in a recording studio. For example, you can create a voice model with any voice and dub your e-learning session with it.

The advantages and disadvantages of voice recordings versus speech synthesis:

Aspect Voice actor Speech synthesis
Optimal possibilities to determine the effect of the voice in detail (pronunciation, intonation, sentence melody, etc.) yes no
Quick production mostly no yes
Manageable costs (especially with multiple languages) no yes
Simple, inexpensive retakes no yes
Voices mostly available all the time no yes
Simple, low-cost licensing models no yes
Creation of own speech synthesis models with any voice (e.g., Darth Vader) no yes

As always: It depends.

However, depending on the intended use, target audience and quality requirements, speech synthesis also has limitations. In certain application scenarios, the goal of the communication can still only be achieved with a recording by a professional voice actor. This is especially true in image and product videos, where the type of speech and the associated emotionality in the voice are an essential part of the marketing message.

One should decide carefully whether to use speech synthesis or not after considering the content to be dubbed and the languages and speech varieties required.

Here are some decision criteria:

Type of content or general conditions Implementation possible through
Glossy image video with a voice that significantly contributes to atmosphere and emotionality in the video Studio recording with voice actor
Technical how-to video Speech synthesis
Product video Studio recording with voice actor or speech synthesis
E-learning Speech synthesis
If no speech synthesis model exists for a particular language Studio recording with voice actor

Synthesized voices also have a number of peculiarities. These should also be considered when deciding for or against speech synthesis:

  • Not all synthesized voices are equal in phonetic quality
  • Not all languages have the same number of synthesized voices available – some languages have dozens, while others currently only have one or two
  • The pronunciation of certain words can only be adapted to a certain extent (especially non-lexical company and product names as well as loanwords)
  • There are limitations in the adaptability of intonation, speech melody and timing
  • Various limitations cannot be assessed in advance but only become apparent during the production of the speech synthesis files

Conclusion

The further development of speech synthesis opens interesting, cost-effective possibilities, especially since much better fine-tuning is possible through SSML (Speech Synthesis Markup Language).

The speech synthesis tools themselves are widely available online. To use them productively, however, especially for multilingual content, suitable processes and practical experience with SSML & Co. are necessary.

You want Darth Vader to voice your next video of safety instruction for visitors to your
company premises? Then get in touch with us.